In this article we will discuss how to read a CSV file with different type of delimiters to a Dataframe.
Python’s Pandas library provides a function to load a csv file to a Dataframe i.e.
pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, ....)
It reads the content of a csv file at given path, then loads the content to a Dataframe and returns that. It uses comma (,) as default delimiter or separator while parsing a file. But we can also specify our custom separator or a regular expression to be used as custom separator.
To use pandas.read_csv() import pandas module i.e.
import pandas as pd
Using read_csv() with custom delimiter
Suppose we have a file ‘users.csv‘ in which columns are separated by string ‘__’ like this.
Contents of file users.csv are as follows,
Name__Age__City jack__34__Sydeny Riti__31__Delhi Aadi__16__New York Suse__32__Lucknow Mark__33__Las vegas Suri__35__Patna
Now to load this kind of file to a dataframe object using pandas.read_csv() we have to pass the sep & engine arguments to pandas.read_csv() i.e.
# Read a csv file to a dataframe with custom delimiter usersDf = pd.read_csv('users.csv', sep='__' , engine='python') print('Contents of Dataframe : ') print(usersDf)
Output:
Contents of Dataframe : Name Age City 0 jack 34 Sydeny 1 Riti 31 Delhi 2 Aadi 16 New York 3 Suse 32 Lucknow 4 Mark 33 Las vegas 5 Suri 35 Patna
Here, sep argument will be used as separator or delimiter. If sep argument is not specified then default engine for parsing ( C Engine) will be used which uses ‘,’ as delimiter. So, while specifying the custom sep argument we must specify the engine argument as ‘python’, otherwise we will get warning like this,
ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex);
You can avoid this warning by specifying engine=’python’.
Using read_csv() with white space or tab as delimiter
As we have seen in above example, that we can pass custom delimiters. Now suppose we have a file in which columns are separated by either white space or tab i.e.
Contents of file users_4.csv are,
Name Age City jack 34 Sydeny Riti 31 Delhi
Now, to load this kind of file to dataframe with pandas.read_csv() pass ‘\s+’ as separator. Here \s+ means any one or more white space character.
# Read a csv file to a dataframe with delimiter as space or tab usersDf = pd.read_csv('users_4.csv', sep='\s+', engine='python') print('Contents of Dataframe : ') print(usersDf)
Contents of the dataframe returned are,
*** Using pandas.read_csv() with space or tab as delimiters *** Contents of Dataframe : Name Age City 0 jack 34 Sydeny 1 Riti 31 Delhi
Using read_csv() with regular expression for delimiters
Suppose we have a file where multiple char delimiters are used instead of a single one. Like,
Contents of file users_5.csv are,
Name,Age|City jack,34_Sydeny Riti:31,Delhi Aadi,16:New York Suse,32:Lucknow Mark,33,Las vegas Suri,35:Patna
Now, to load this kind of file to dataframe with read_csv() pass a regular expression i.e. ‘[:,|_]’ in sep argument. This regular expression means use any of these characters ( , : | ) asa delimiter or separator i.e.
# Read a csv file to a dataframe with multiple delimiters in regular expression usersDf = pd.read_csv('users_5.csv', sep='[:,|_]', engine='python') print('Contents of Dataframe : ') print(usersDf)
Output:
Contents of Dataframe : Name Age City 0 jack 34 Sydeny 1 Riti 31 Delhi 2 Aadi 16 New York 3 Suse 32 Lucknow 4 Mark 33 Las vegas 5 Suri 35 Patna
Complete example is as follows:
import pandas as pd def main(): print(' *** Using pandas.read_csv() with Custom delimiter ***') # Read a csv file to a dataframe with custom delimiter usersDf = pd.read_csv('users_3.csv', sep='__' , engine='python') print('Contents of Dataframe : ') print(usersDf) print('********') print(' *** Using pandas.read_csv() with space or tab as delimiters ***') # Read a csv file to a dataframe with delimiter as space or tab usersDf = pd.read_csv('users_4.csv', sep='\s+', engine='python') print('Contents of Dataframe : ') print(usersDf) print(' *** Using pandas.read_csv() with multiple char delimiters ***') # Read a csv file to a dataframe with multiple delimiters in regular expression usersDf = pd.read_csv('users_5.csv', sep='[:,|_]', engine='python') print('Contents of Dataframe : ') print(usersDf) if __name__ == '__main__': main()
Output:
*** Using pandas.read_csv() with Custom delimiter *** Contents of Dataframe : Name Age City 0 jack 34 Sydeny 1 Riti 31 Delhi 2 Aadi 16 New York 3 Suse 32 Lucknow 4 Mark 33 Las vegas 5 Suri 35 Patna ******** *** Using pandas.read_csv() with space or tab as delimiters *** Contents of Dataframe : Name Age City 0 jack 34 Sydeny 1 Riti 31 Delhi *** Using pandas.read_csv() with multiple char delimiters *** Contents of Dataframe : Name Age City 0 jack 34 Sydeny 1 Riti 31 Delhi 2 Aadi 16 New York 3 Suse 32 Lucknow 4 Mark 33 Las vegas 5 Suri 35 Patna
Pandas Tutorials -Learn Data Analysis with Python
-
Pandas Tutorial Part #1 - Introduction to Data Analysis with Python
-
Pandas Tutorial Part #2 - Basics of Pandas Series
-
Pandas Tutorial Part #3 - Get & Set Series values
-
Pandas Tutorial Part #4 - Attributes & methods of Pandas Series
-
Pandas Tutorial Part #5 - Add or Remove Pandas Series elements
-
Pandas Tutorial Part #6 - Introduction to DataFrame
-
Pandas Tutorial Part #7 - DataFrame.loc[] - Select Rows / Columns by Indexing
-
Pandas Tutorial Part #8 - DataFrame.iloc[] - Select Rows / Columns by Label Names
-
Pandas Tutorial Part #9 - Filter DataFrame Rows
-
Pandas Tutorial Part #10 - Add/Remove DataFrame Rows & Columns
-
Pandas Tutorial Part #11 - DataFrame attributes & methods
-
Pandas Tutorial Part #12 - Handling Missing Data or NaN values
-
Pandas Tutorial Part #13 - Iterate over Rows & Columns of DataFrame
-
Pandas Tutorial Part #14 - Sorting DataFrame by Rows or Columns
-
Pandas Tutorial Part #15 - Merging or Concatenating DataFrames
-
Pandas Tutorial Part #16 - DataFrame GroupBy explained with examples
Are you looking to make a career in Data Science with Python?
Data Science is the future, and the future is here now. Data Scientists are now the most sought-after professionals today. To become a good Data Scientist or to make a career switch in Data Science one must possess the right skill set. We have curated a list of Best Professional Certificate in Data Science with Python. These courses will teach you the programming tools for Data Science like Pandas, NumPy, Matplotlib, Seaborn and how to use these libraries to implement Machine learning models.
Checkout the Detailed Review of Best Professional Certificate in Data Science with Python.
Remember, Data Science requires a lot of patience, persistence, and practice. So, start learning today.