How to read a large CSV file with pandas?

With the abundance of data in recent times, reading and processing large data files have become challenging. We often face memory issues on local machines while reading large CSV files. In this article, we will discuss how to read such large CSV files in a more optimized manner.

Table of Contents

Preparing Large CSV file

Before getting started, let’s create a sample DataFrame containing some large values and save it in the local system for experimentation.

# importing libraries
import pandas as pd
import numpy as np
import time

# dataframe with random values
df = pd.DataFrame(data=np.random.randint(99999, 99999999, size=(10000000,14)),
                  columns=['col_' + str(i) for i in range(14)])
print (df.shape)

# save the dataframe as CSV file
df.to_csv("file1.csv")

Output

(10000000, 14)
....
....

We have created a sample DataFrame with 1M rows and 14 columns containing some large integer values. It takes about 1.32GB of local storage which is a decent size for experimentation.

Using pandas.read_csv() method

Let’s start with the basic pandas.read_csv method to understand how much time it take to read this CSV file.

import pandas as pd
import time

start = time.time()

df = pd.read_csv('file1.csv')

end = time.time()

print("Total Time:",(end-start),"sec")

Output

Total Time: 12.11653995513916 sec

It roughly took around 12 secs to read the entire file. Note that this might vary based on the system configuration.

Using chunksize in pandas.read_csv() method

Now let’s look at a slightly more optimized way to reading such large CSV files using pandas.read_csv method. It contains an attribute called chunksize, meaning, instead of reading the whole CSV at once, chunks of CSV are read into memory. This method optimizes time and memory effectively.

import pandas as pd
import time

start = time.time()
# read chunks
chunk = pd.read_csv('file1.csv', chunksize=500000)

# concat chunks
df = pd.concat(chunk)
end = time.time()

print("Total Time:",(end-start),"sec")

Output

Total Time: 9.14691710472107 sec

Using chunksize has helped reduce the time taken from 12 secs to 9 secs, which is a good improvement.

Using Dask library

Dask is an open-source python library that includes features of parallelism and scalability in Python. It utilizes multiple CPU cores to reduce the processing time and has syntax similar to pandas for easy adaptability.

We can simply install the Dask library via conda using the following –

conda install dask

Once installed, let’s read the CSV file using the dataframe method from Dask.

# import libraries
import time
from dask import dataframe as dd

# read file
start = time.time()
df = dd.read_csv('file1.csv')
end = time.time()

print("Total Time:",(end-start),"sec")

Output

Total Time: 0.08547091484069824 sec

Dask has reduced the time to under 1sec, which is amazing optimization. It can be converted easily back to pandas DataFrame for further processing and modeling.

Summary

In this article, we have discussed how to read a large CSV file with pandas.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top