How to read a large CSV file with pandas?

With the abundance of data in recent times, reading and processing large data files have become challenging. We often face memory issues on local machines while reading large CSV files. In this article, we will discuss how to read such large CSV files in a more optimized manner.

Table of Contents

Preparing Large CSV file

Before getting started, let’s create a sample DataFrame containing some large values and save it in the local system for experimentation.

# importing libraries
import pandas as pd
import numpy as np
import time

# dataframe with random values
df = pd.DataFrame(data=np.random.randint(99999, 99999999, size=(10000000,14)),
                  columns=['col_' + str(i) for i in range(14)])
print (df.shape)

# save the dataframe as CSV file
df.to_csv("file1.csv")

Output

(10000000, 14)
....
....

We have created a sample DataFrame with 1M rows and 14 columns containing some large integer values. It takes about 1.32GB of local storage which is a decent size for experimentation.

Advertisements

Using pandas.read_csv() method

Let’s start with the basic pandas.read_csv method to understand how much time it take to read this CSV file.

import pandas as pd
import time

start = time.time()

df = pd.read_csv('file1.csv')

end = time.time()

print("Total Time:",(end-start),"sec")

Output

Total Time: 12.11653995513916 sec

It roughly took around 12 secs to read the entire file. Note that this might vary based on the system configuration.

Using chunksize in pandas.read_csv() method

Now let’s look at a slightly more optimized way to reading such large CSV files using pandas.read_csv method. It contains an attribute called chunksize, meaning, instead of reading the whole CSV at once, chunks of CSV are read into memory. This method optimizes time and memory effectively.

import pandas as pd
import time

start = time.time()
# read chunks
chunk = pd.read_csv('file1.csv', chunksize=500000)

# concat chunks
df = pd.concat(chunk)
end = time.time()

print("Total Time:",(end-start),"sec")

Output

Total Time: 9.14691710472107 sec

Using chunksize has helped reduce the time taken from 12 secs to 9 secs, which is a good improvement.

Using Dask library

Dask is an open-source python library that includes features of parallelism and scalability in Python. It utilizes multiple CPU cores to reduce the processing time and has syntax similar to pandas for easy adaptability.

We can simply install the Dask library via conda using the following –

conda install dask

Once installed, let’s read the CSV file using the dataframe method from Dask.

# import libraries
import time
from dask import dataframe as dd

# read file
start = time.time()
df = dd.read_csv('file1.csv')
end = time.time()

print("Total Time:",(end-start),"sec")

Output

Total Time: 0.08547091484069824 sec

Dask has reduced the time to under 1sec, which is amazing optimization. It can be converted easily back to pandas DataFrame for further processing and modeling.

Summary

In this article, we have discussed how to read a large CSV file with pandas.

Pandas Tutorials -Learn Data Analysis with Python

   

Are you looking to make a career in Data Science with Python?

Data Science is the future, and the future is here now. Data Scientists are now the most sought-after professionals today. To become a good Data Scientist or to make a career switch in Data Science one must possess the right skill set. We have curated a list of Best Professional Certificate in Data Science with Python. These courses will teach you the programming tools for Data Science like Pandas, NumPy, Matplotlib, Seaborn and how to use these libraries to implement Machine learning models.

Checkout the Detailed Review of Best Professional Certificate in Data Science with Python.

Remember, Data Science requires a lot of patience, persistence, and practice. So, start learning today.

Join a LinkedIn Community of Python Developers

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top