With the abundance of data in recent times, reading and processing large data files have become challenging. We often face memory issues on local machines while reading large CSV files. In this article, we will discuss how to read such large CSV files in a more optimized manner.
Table of Contents
Preparing Large CSV file
Before getting started, let’s create a sample DataFrame containing some large values and save it in the local system for experimentation.
# importing libraries import pandas as pd import numpy as np import time # dataframe with random values df = pd.DataFrame(data=np.random.randint(99999, 99999999, size=(10000000,14)), columns=['col_' + str(i) for i in range(14)]) print (df.shape) # save the dataframe as CSV file df.to_csv("file1.csv")
Output
(10000000, 14) .... ....
We have created a sample DataFrame with 1M rows and 14 columns containing some large integer values. It takes about 1.32GB of local storage which is a decent size for experimentation.
Using pandas.read_csv() method
Let’s start with the basic pandas.read_csv method to understand how much time it take to read this CSV file.
import pandas as pd import time start = time.time() df = pd.read_csv('file1.csv') end = time.time() print("Total Time:",(end-start),"sec")
Output
Frequently Asked:
Total Time: 12.11653995513916 sec
It roughly took around 12 secs to read the entire file. Note that this might vary based on the system configuration.
Using chunksize in pandas.read_csv() method
Now let’s look at a slightly more optimized way to reading such large CSV files using pandas.read_csv method. It contains an attribute called chunksize, meaning, instead of reading the whole CSV at once, chunks of CSV are read into memory. This method optimizes time and memory effectively.
import pandas as pd import time start = time.time() # read chunks chunk = pd.read_csv('file1.csv', chunksize=500000) # concat chunks df = pd.concat(chunk) end = time.time() print("Total Time:",(end-start),"sec")
Output
Total Time: 9.14691710472107 sec
Using chunksize has helped reduce the time taken from 12 secs to 9 secs, which is a good improvement.
Using Dask library
Dask is an open-source python library that includes features of parallelism and scalability in Python. It utilizes multiple CPU cores to reduce the processing time and has syntax similar to pandas for easy adaptability.
We can simply install the Dask library via conda using the following –
conda install dask
Once installed, let’s read the CSV file using the dataframe method from Dask.
# import libraries import time from dask import dataframe as dd # read file start = time.time() df = dd.read_csv('file1.csv') end = time.time() print("Total Time:",(end-start),"sec")
Output
Total Time: 0.08547091484069824 sec
Dask has reduced the time to under 1sec, which is amazing optimization. It can be converted easily back to pandas DataFrame for further processing and modeling.
Summary
In this article, we have discussed how to read a large CSV file with pandas.