Python Pandas {read_csv}
A Different Perspective
An analytic project start off with data gathering and data normally comes in CSV format. In this article, I would like highlight some useful yet under rated tips in Pandas read_csv method.
First and foremost, I will make use of New York bike sharing data for simple demo. You may obtain the data from bikeshare.com, specifically data for Jan 2017. This exercise need pandas & functools (reduce).
Most beginners, including myself will use read_csv method to import CSV, just like below:
# The Most Common Way
import pandas as pd
from functools import reduce
This way, we import a 122 MB file with all columns and observations into a data frame. Such action consumes 83+ MB of RAM (take note of memory usage). With .info() method, we can see the file contains 726K+ rows & 15 columns.
Now, let’s me introduce some parameters bundled with read_csv().
# Parameter 01: nrows
This parameter allows me to read limited number of rows from file, 50,000 only for instance.
This parameter is useful if I only need to explore a large file (imagine 1GB+). I can tell the file contains 15 columns, data type of each column and this action only consume approximately 6 MB of RAM.
# Parameter 02: usecols
After nrows, if I decide that only few columns are required for further exploration, I could make use of usecols to do exactly just that. In this case, I choose ‘Trip Duration’, ‘Start Time’, ‘Start Station Name’ & ‘User Type’. This is to facilitate the next point of discussion.
Please note I can select the column by index (number under the # column) or by name. Now, my MacBook only consume 1.5+ MB on memory usage by combining nrows & usecols.
# Parameter 03: names
I can assign new name to each column with parameter names. One thing to note about usecols & names: I must use column index in usecols in order for names to work. Now, I get a data frame with column names that I desired (note changes under column). I can skip the step to rename these columns later.
# Parameter 04: parse_dates
Here, I know that start_time should be in pandas datetime64 format. I can get pandas to convert this column into datetime format. Again, save my effort for data type conversion. One thing to take note: parse_dates must be used together with parameter header = 0. Otherwise, it will not work.
# Parameter 05: dtype
Finally, read_csv can help me to perform data type conversion with parameter dtype. Input for this parameter must be in dictionary format. I need to specify key-value pair in order to let pandas know data type for each column. In this case, I set duration to integer32 (np.int32) & user_type to category.
By applying parameters nrows, usecols, names, parse_dates & dtype at one go in read_csv, memory usage is significantly reduced to 1.0+ MB.
# Parameter 06: chunksize
Some day we might come across an issue when importing data: extremely large data set (1 GB+) . In this particularly situation, chunksize comes to our rescue. Such a large data set normally contains millions of rows. RAM of our machine might not be capable of handling the task at hand.
We could set chunk size to 50,000. By doing so, a large dataset will be broken down into many small data frames. Each data frame contains up to 50,000 rows and stored within a pandas object.
We can loop through the pandas object and store them in a list.
Let’s take a look at one of the dataframes.
We can use Python functools (reduce method) as a replacement to for-loop to combine all data frames into single data frame.
And here we are, the final data frame with only 4 columns & 726K rows, memory usage is only 14.6+ MB.
Hope these tips could help in your future endeavours for data science.