Multiple times while processing data, it is important to extract specific columns and store them in a new DataFrame. In this article, we will cover multiple ways to achieve that.
Table of Contents
Preparing Dataset
To quickly get started, let’s create a sample dataframe to experiment. We’ll use the pandas library with some random data.
import pandas as pd import numpy as np # List of Tuples employees= [('Shubham', 'Data Scientist', 'Tech', 5), ('Riti', 'Data Engineer', 'Tech' , 7), ('Shanky', 'Program Manager', 'PMO' , 2), ('Shreya', 'Graphic Designer', 'Design' , 2), ('Aadi', 'Backend Developer', 'Tech', 11), ('Sim', 'Data Engineer', 'Tech', 4)] # Create a DataFrame object from list of tuples df = pd.DataFrame(employees, columns=['Name', 'Designation', 'Team', 'Experience'], index=[0, 1, 2, 3, 4, 5]) print(df)
Contents of the created dataframe are,
Name Designation Team Experience 0 Shubham Data Scientist Tech 5 1 Riti Data Engineer Tech 7 2 Shanky Program Manager PMO 2 3 Shreya Graphic Designer Design 2 4 Aadi Backend Developer Tech 11 5 Sim Data Engineer Tech 4
Now let’s see how to select specific columns from this, to a new DataFrame as copy.
Select subset of columns using copy() function
The simplest way to extract columns is to select the columns from the original DataFrame using [] operator and then copy it using the pandas.DataFrame.copy() function. Let’s take a look at code using an example, say, we need to select the columns “Name” and “Team” from the above DataFrame and save it as a new DataFrame.
# saving columns Name and Team into df_new df_new = df[['Name', 'Team']].copy() print (df_new)
Output
Frequently Asked:
- Pandas – Check if all values in a Column are Equal
- GroupBy two columns and get count in Pandas DataFrame
- Pandas: Drop last N columns of dataframe
- Replace NaN with 0 in Pandas DataFrame
Name Team 0 Shubham Tech 1 Riti Tech 2 Shanky PMO 3 Shreya Design 4 Aadi Tech 5 Sim Tech
As observed, the new DataFrame (df_new) contains the two selected columns (Name and Team) from the original DataFrame.
Select subset of columns using filter() function
The filter() function also works similarly to the above method. Here, we don’t need to explicitly copy the DataFrame, as the filter() function automatically creates one with the selected columns. Let’s execute the above task using the filter function.
# Select Name and Team columns from DataFrame as copy # using filter() function df_new = df.filter(['Name', 'Team']) print (df_new)
Output
Name Team 0 Shubham Tech 1 Riti Tech 2 Shanky PMO 3 Shreya Design 4 Aadi Tech 5 Sim Tech
Here you go, we have similar output, i.e., the columns Name and Team saved in a new DataFrame (df_new).
Select subset of columns using drop() function
The drop() function is used to drop the non-required columns from the DataFrame. It comes in handy, in cases where we need to select the majority of the columns from a large dataset. So instead of listing down the column names to be selected, we can mention the column names that need to be dropped.
In the above example, we need the columns Name and Team, meaning that we will have to drop the other columns (Designation and Experience). Here, since the columns are very few, it is not making much sense, but below is the code to execute this in case of ideal situations.
# Drop few columns from DataFrame and select remaining df_new = df.drop(['Designation', 'Experience'], axis=1) print (df_new)
Output
Name Team 0 Shubham Tech 1 Riti Tech 2 Shanky PMO 3 Shreya Design 4 Aadi Tech 5 Sim Tech
Select subset of columns using assign() function
Another interesting method is to use the DataFrame.assign() method to create a new DataFrame and then assign different columns, finally returning a new DataFrame with the new columns created. Let’s look at the code below which is actually simple to understand.
# Assign few columns from a DataFrame to a new DataFrame df_new = pd.DataFrame().assign(employee_name=df['Name'], team_name=df['Team']) print (df_new)
Output
employee_name team_name 0 Shubham Tech 1 Riti Tech 2 Shanky PMO 3 Shreya Design 4 Aadi Tech 5 Sim Tech
The real advantage of this function is that you can rename the columns as well at the same time. Here, we have renamed column “Name” as “employee_name” and “Team” as “team_name”.
The complete example is as follows,
import pandas as pd # List of Tuples employees= [('Shubham', 'Data Scientist', 'Tech', 5), ('Riti', 'Data Engineer', 'Tech' , 7), ('Shanky', 'Program Manager', 'PMO' , 2), ('Shreya', 'Graphic Designer', 'Design' , 2), ('Aadi', 'Backend Developer', 'Tech', 11), ('Sim', 'Data Engineer', 'Tech', 4)] # Create a DataFrame object from list of tuples df = pd.DataFrame(employees, columns=['Name', 'Designation', 'Team', 'Experience'], index=[0, 1, 2, 3, 4, 5]) print(df) # saving columns Name and Team into df_new df_new = df[['Name', 'Team']].copy() print (df_new) # Select Name and Team columns from DataFrame as copy # using filter() function df_new = df.filter(['Name', 'Team']) print (df_new) # Drop few columns from DataFrame and select remaining df_new = df.drop(['Designation', 'Experience'], axis=1) print (df_new) # Assign few columns from a DataFrame to a new DataFrame df_new = pd.DataFrame().assign(employee_name=df['Name'], team_name=df['Team']) print (df_new)
Output:
Name Designation Team Experience 0 Shubham Data Scientist Tech 5 1 Riti Data Engineer Tech 7 2 Shanky Program Manager PMO 2 3 Shreya Graphic Designer Design 2 4 Aadi Backend Developer Tech 11 5 Sim Data Engineer Tech 4 Name Team 0 Shubham Tech 1 Riti Tech 2 Shanky PMO 3 Shreya Design 4 Aadi Tech 5 Sim Tech Name Team 0 Shubham Tech 1 Riti Tech 2 Shanky PMO 3 Shreya Design 4 Aadi Tech 5 Sim Tech Name Team 0 Shubham Tech 1 Riti Tech 2 Shanky PMO 3 Shreya Design 4 Aadi Tech 5 Sim Tech employee_name team_name 0 Shubham Tech 1 Riti Tech 2 Shanky PMO 3 Shreya Design 4 Aadi Tech 5 Sim Tech
Summary
In this article, we have discussed multiple ways to extract specific columns to a new DataFrame as a copy in pandas. Thanks.