How to read a CSV file using Haskell
2021-09-22
In this entry we explore how to read a file CSV using Haskell. If you are new to Haskell the easiest way to install it is GHCup.
Description
Suppose we want to read a CSV file that contains basic information about financial instruments (bonds, stocks, funds, etc.) listed in the US stock market.
The CSV file we want to read has the following structure:
There are two important things we need to take into account about the file:
- There are different types of financial instruments in the file: Common stock, ETF, Fund.
- The Isin column can be empty.
Steps
We want to write a Haskell program that reads the file and returns a list
containing financial instruments of type Common Stock
. In order to do that,
our program needs to perform (at least) the following steps:
- Receive the CSV file path as argument
- Check if the CSV file exists
- Read the CSV file
- Filter the elements of type
Common Stock
- Return the resulting list
Let's see how to do that using Haskell!
Source code
Data modeling
Among the first things we usually do before start writing code, is think about the data we are going to work with. In this case we need to create a data type to hold the information from the CSV file.
We will model each row of the CSV file using the following data type:
-- data type to model a FinancialInstrument
data FinancialInstrument = FinancialInstrument
{ code :: String,
name :: String,
country :: String,
exchange :: String,
currency :: String,
instrumentType :: String
}
deriving (Show, Eq)
We can ignore the value from the column Isin. Why? Because we don't need it for this exercise.
How to read a CSV file
The next thing we need to define is how we are going to read the CSV file. To read the file we are going to use two libraries:
- Data.ByteString.Lazy: to read the file as a stream (lazily).
- Cassava: to parse a CSV row to a
FinancialInstrument
.
Let's see how to use them:
import Prelude hiding (filter)
import qualified Data.ByteString.Lazy as BL
import Data.Csv -- this is from the Cassava library
import qualified Data.Vector as V
import System.Directory (doesFileExist)
-- Define how to get a FinancialInstrument from a record (CSV row)
-- by implementing the FromNamedRecord type class
instance FromNamedRecord FinancialInstrument where
parseNamedRecord record =
FinancialInstrument
<$> record .: "Code"
<*> record .: "Name"
<*> record .: "Country"
<*> record .: "Exchange"
<*> record .: "Currency"
<*> record .: "Type"
type ErrorMsg = String
-- type synonym to handle CSV contents
type CsvData = (Header, V.Vector FinancialInstrument)
-- Function to read the CSV
parseCsv :: FilePath -> IO (Either ErrorMsg CsvData)
parseCsv filePath = do
fileExists <- doesFileExist filePath
if fileExists
then decodeByName <$> BL.readFile filePath
else return . Left $ printf "The file %s does not exist" filePath
Parse a CSV row
The first five lines of the previous code fragment specify the libraries we need to read the CSV file.
Then, we make our FinancialInstrument
data type an instance of the type class
FromNamedRecord
. Think about this like implementing an interface in Java.
Java:
class FinancialInstrument implements FromNamedRecord { ... }
Haskell:
instance FromNamedRecord FinancialInstrument where ...
In order to be an instance of FromNamedRecord
we need to implement the
function parseNamedRecord
. Again, following our Java analogy:
Java:
public Parser<FinancialInstrument> parseNamedRecord(NamedRecord record) { ... }
Haskell:
parseNamedRecord record = ...
In this function we define how we can create a FinancialInstrument
from a row
in the CSV file.
Read and return
Then, we define two type synonyms (aka. type alias):
ErrorMsg
to define an error message.CsvData
is a tuple to handle the result of reading the CSV file.
Finally, we define the function parseCsv
. We use it to read the CSV file and
return its contents. The function performs the following steps:
- Check if the file exists
- If the file exists read each row and return
IO (Right CsvData)
- if the file does not exist return
IO (Left ErrorMsg)
Filter stocks
If everything went well, the function parseCsv
returns an IO (Right CsvData)
.
We have read the CSV file! But we are not done yet.
Please remember that:
type CsvData = (Header, V.Vector FinancialInstrument)
Therefore, if we replace CsvData
by its definition in IO (Right CsvData)
,
we have:
IO (Right (Header, V.Vector FinancialInstrument))
Let's drop IO
and Right
. Don't worry about them, we will add them back later.
In essence, we have a tuple (Header, V.Vector FinancialInstrument)
and we want
to have V.Vector FinancialInstrument
, where the vector only has financial
instruments of type Common Stock
. Therefore, we are missing two steps:
- Remove the headers from
(Header, V.Vector FinancialInstrument)
- Filter the
V.Vector FinancialInstrument
to only keep the financial instruments of typeCommon Stock
1. Remove the headers from CsvData
Our starting point is (Header, V.Vector FinancialInstrument)
and we
want to remove the headers, meaning that we want V.Vector FinancialInstrument
.
In Haskell terms we want the following function:
someFunction :: (Header, V.Vector FinancialInstrument) -> V.Vector FinancialInstrument
We have a tuple, and we need to get its second element. Fantastic! Haskell has a function for that:
:t snd
snd :: (a, b) -> b
To have a meaningful name (in the context or our problem), we define a
function that will be equal to snd
.
-- Discard headers from CsvData
removeHeaders :: CsvData -> V.Vector FinancialInstrument
removeHeaders = snd
Excellent! We have a function to get the second element of the tuple.
2. Filter the FinancialInstrument vector
Now, we need to filter the V.Vector FinancialInstrument
to keep the
instruments of type Common Stock
and discard the rest. Once again, Haskell has
a function for that, the name of the function is filter:
:t filter
filter :: (a -> Bool) -> [a] -> [a]
-- Which means:
-- applied to a predicate (a -> Bool) and a list [a],
-- returns the list [a] of those elements that satisfy the predicate
--
-- For example: filter odd [1, 2, 3] -> [1, 3]
We would like to create a function that has a meaning in the context of our problem:
-- Given a list, return only the elements with instrumentType "Common Stock"
filterStocks :: V.Vector FinancialInstrument -> V.Vector FinancialInstrument
filterStocks = filter isStock
where
isStock :: FinancialInstrument -> Bool
isStock instrument = instrumentType instrument == "Common Stock"
Great! We have functions to remove the headers and filter the vector. What happens if we compose these two functions?
:t (filterStocks . removeHeaders)
(filterStocks . removeHeaders) :: CsvData -> V.Vector FinancialInstrument
We get a function that takes a CsvData
and returns a filtered vector
V.Vector FinancialInstrument
. That's almost exactly what we need.
Why almost? Because we have an IO (Right CsvData)
, not just a CsvData
.
This takes us to the last question of the exercise. How do we apply the function
(filterStocks . removeHeaders) :: CsvData -> V.Vector FinancialInstrument
to
IO (Right CsvData)
?
Lifting functions
We don't want to explain IO
and Either
in this blog post.
However, you can read about them in these links:
For now, let's call them a container or a context.
Taking that into account, IO (Right CsvData)
means that we have a tuple
(remember, CsvData is a tuple) inside an Either context: Right CsvData
,
and we have an Either inside an IO context: IO (Right CsvData)
.
How can we modify our composed function (filterStocks . removeHeaders)
to operate
on IO (Right CsvData)
?
There is a simple way we can lift a function to operate on contexts.
Enter fmap
:
:t fmap
fmap :: Functor f => (a -> b) -> (f a -> f b)
fmap
is a function that takes a function from a to b (a -> b)
and returns
the same function lifted
to the context f (f a -> f b)
. Therefore we can apply the initial function in
the context f.
What is Functor f
? For the sake of this exercise, a functor is a
container or a context. In fact IO
and Either
are functors!
For a proper explanation of what a Functor is, please read this chapter:
Functors, Applicative Functors and Monoids.
What happens if we apply fmap
to (filterStocks . removeHeaders)
? Let's see:
:t fmap (filterStocks . removeHeaders)
fmap (filterStocks . removeHeaders)
:: Functor f => f CsvData -> f (V.Vector FinancialInstrument)
We are close! We lifted the function and we can apply it over one context.
But we have two contexts (Either
and IO
). What if we apply fmap
again? (aka. lift the function again).
:t fmap (fmap (filterStocks . removeHeaders))
fmap (fmap (filterStocks . removeHeaders))
:: (Functor f1, Functor f2) => f1 (f2 CsvData) -> f1 (f2 (V.Vector FinancialInstrument))
-- which can also be written as:
(fmap . fmap) (filterStocks . removeHeaders)
We did it! Now we can apply our function over two nested contexts! And this is how our program ends:
-- Read stocks from a CSV file
readStocks :: FilePath -> IO (Either ErrorMsg (V.Vector FinancialInstrument))
readStocks filePath =
(fmap . fmap) -- lift the function twice
(filterStocks . removeHeaders) -- remove headers and filter stocks
(parseCsv filePath) -- read CSV from file path
If you want to check the complete program and run it yourself you can find the instructions here: README.md. The file we have explained is Csv.hs
Tests
We want to add tests to check the behavior of our program. We created the test cases using Hspec:
-- imports are omitted for brevity
spec :: Spec
spec = do
describe "readStocks" $ do
it "returns IO (Left ErrorMsg) when the file does not exist" $ do
let nonExistentFile = "test-resources/no-file.csv"
let errorMessage = printf "The file %s does not exist" "test-resources/no-file.csv"
readStocks nonExistentFile `shouldReturn` Left errorMessage
it "returns 'not enough input' when the file is empty" $ do
let emptyFile = "test-resources/empty-file.csv"
let errorMessage = "parse error (not enough input) at \"\""
readStocks emptyFile `shouldReturn` Left errorMessage
it "returns the same rows as the file when the file only contains stocks" $ do
let stocksOnlyFile = "test-resources/stocks-only.csv"
either <- readStocks stocksOnlyFile
either `shouldSatisfy` isRight
length <$> either `shouldBe` Right 5
it "returns the less rows than the file because filters out non-stocks" $ do
let stocksAndFundsFile = "test-resources/stocks-and-funds.csv"
either <- readStocks stocksAndFundsFile
either `shouldSatisfy` isRight
length <$> either `shouldBe` Right 7
You can find the whole test file here: CsvSpec.hs
Closing words
Thanks for reading this far! I hope you have enjoyed this post.