Python Pandas {read_csv}

A Different Perspective

ong kam siong
4 min readFeb 17, 2021

An analytic project start off with data gathering and data normally comes in CSV format. In this article, I would like highlight some useful yet under rated tips in Pandas read_csv method.

First and foremost, I will make use of New York bike sharing data for simple demo. You may obtain the data from bikeshare.com, specifically data for Jan 2017. This exercise need pandas & functools (reduce).

Most beginners, including myself will use read_csv method to import CSV, just like below:

# The Most Common Way

import pandas as pd
from functools import reduce

Most common way of reading CSV

This way, we import a 122 MB file with all columns and observations into a data frame. Such action consumes 83+ MB of RAM (take note of memory usage). With .info() method, we can see the file contains 726K+ rows & 15 columns.

Now, let’s me introduce some parameters bundled with read_csv().

# Parameter 01: nrows

This parameter allows me to read limited number of rows from file, 50,000 only for instance.

read_csv with nrows

This parameter is useful if I only need to explore a large file (imagine 1GB+). I can tell the file contains 15 columns, data type of each column and this action only consume approximately 6 MB of RAM.

# Parameter 02: usecols

After nrows, if I decide that only few columns are required for further exploration, I could make use of usecols to do exactly just that. In this case, I choose ‘Trip Duration’, ‘Start Time’, ‘Start Station Name’ & ‘User Type’. This is to facilitate the next point of discussion.

read_csv with usecols

Please note I can select the column by index (number under the # column) or by name. Now, my MacBook only consume 1.5+ MB on memory usage by combining nrows & usecols.

# Parameter 03: names

I can assign new name to each column with parameter names. One thing to note about usecols & names: I must use column index in usecols in order for names to work. Now, I get a data frame with column names that I desired (note changes under column). I can skip the step to rename these columns later.

read_csv with names

# Parameter 04: parse_dates

Here, I know that start_time should be in pandas datetime64 format. I can get pandas to convert this column into datetime format. Again, save my effort for data type conversion. One thing to take note: parse_dates must be used together with parameter header = 0. Otherwise, it will not work.

read_csv with parse_dates

# Parameter 05: dtype

Finally, read_csv can help me to perform data type conversion with parameter dtype. Input for this parameter must be in dictionary format. I need to specify key-value pair in order to let pandas know data type for each column. In this case, I set duration to integer32 (np.int32) & user_type to category.

read_csv with dtype

By applying parameters nrows, usecols, names, parse_dates & dtype at one go in read_csv, memory usage is significantly reduced to 1.0+ MB.

# Parameter 06: chunksize

Some day we might come across an issue when importing data: extremely large data set (1 GB+) . In this particularly situation, chunksize comes to our rescue. Such a large data set normally contains millions of rows. RAM of our machine might not be capable of handling the task at hand.

We could set chunk size to 50,000. By doing so, a large dataset will be broken down into many small data frames. Each data frame contains up to 50,000 rows and stored within a pandas object.

read_csv with chunksize
pandas object

We can loop through the pandas object and store them in a list.

Let’s take a look at one of the dataframes.

basic info — one of the data frames

We can use Python functools (reduce method) as a replacement to for-loop to combine all data frames into single data frame.

merge dataframes with reduce

And here we are, the final data frame with only 4 columns & 726K rows, memory usage is only 14.6+ MB.

Hope these tips could help in your future endeavours for data science.

--

--

ong kam siong
ong kam siong

Written by ong kam siong

From Finance to Data Analytics

No responses yet