In the data analysis, we often need to create a new column a DataFrame based on values from other columns in Pandas. In this article, we will cover multiple scenarios and functions that we can use to achieve that.
Table of Contents
- Add new Column based on other Columns using basic functions (mean, max, etc.)
- Add new Column based on other Columns using custom defined function (apply() method)
- Add new Column based on other Columns using numpy.where() function
- Add new Column based on other Columns using custom defined function (vectorize method)
- Summary
To quickly get started, let’s create a sample DataFrame to experiment. We’ll use the pandas library with some random data.
import pandas as pd import numpy as np # List of Tuples employees= [('Shubham', 'Data Scientist', 'Sydney', 5, 5), ('Riti', 'Data Analyst', 'Delhi' , 7, 3), ('Shanky', 'Program Manager', 'Delhi' , 2, 2), ('Shreya', 'Graphic Designer', 'Mumbai' , 2, 2), ('Aadi', 'Data Engineering', 'New York', 11, 3)] # Create a DataFrame object from list of tuples df = pd.DataFrame(employees, columns=['Name', 'Designation', 'City', 'Experience', 'Tenure'], index=[0, 1, 2, 3, 4]) print(df)
Contents of the created dataframe are,
Name Designation City Experience Tenure 0 Shubham Data Scientist Sydney 5 5 1 Riti Data Analyst Delhi 7 3 2 Shanky Program Manager Delhi 2 2 3 Shreya Graphic Designer Mumbai 2 2 4 Aadi Data Engineering New York 11 3
Now, let’s look at different ways in which we could add a new column based on values from other columns in this DataFrame.
Add new Column based on other Columns using basic functions (mean, max, etc.)
In this scenario, we need to calculate the row-wise average of two columns and store it as a new column. Let’s quickly try this by taking the average of columns “Experience” and “Tenure” and saving it as a new column named “avg_exp_tenure” in the existing DataFrame.
# avg of experience and tenure df['avg_exp_tenure'] = df[['Experience', 'Tenure']].mean(axis=1) print (df)
Output
Frequently Asked:
Name Designation City Experience Tenure avg_exp_tenure 0 Shubham Data Scientist Sydney 5 5 5.0 1 Riti Data Analyst Delhi 7 3 5.0 2 Shanky Program Manager Delhi 2 2 2.0 3 Shreya Graphic Designer Mumbai 2 2 2.0 4 Aadi Data Engineering New York 11 3 7.0
Here you go, we have the average of two columns stored as a new column. We can use other basic operators such as mean, median, sum, etc. in the same way.
Add new Column based on other Columns using custom defined function (apply() method)
In most cases, we need to define a custom function to replicate the business understanding. Let’s take an example here, we need to create a new column containing the weighted score, by giving 0.75 weight to column “Experience” and 0.25 weight to column “Tenure”.
# weighted score of experience and tenure df['weighted_score'] = df.apply(lambda row: 0.75*row['Experience'] + 0.25*row['Tenure'], axis=1) print (df)
Output
Name Designation City Experience Tenure weighted_score 0 Shubham Data Scientist Sydney 5 5 5.0 1 Riti Data Analyst Delhi 7 3 6.0 2 Shanky Program Manager Delhi 2 2 2.0 3 Shreya Graphic Designer Mumbai 2 2 2.0 4 Aadi Data Engineering New York 11 3 9.0
In case, we have more complicated logic, feel free to create a separate function with all the conditions. For example, let’s say we want to tweak the weighted score based on the “City” information.
def weighted_score(row): if row['City'] in ['Delhi', 'Mumbai']: return 0.75*row['Experience'] + 0.25*row['Tenure'] else: return 0.5*row['Experience'] + 0.5*row['Tenure'] df['weighted_score'] = df.apply(weighted_score, axis=1) print (df)
Output
Name Designation City Experience Tenure weighted_score 0 Shubham Data Scientist Sydney 5 5 5.0 1 Riti Data Analyst Delhi 7 3 6.0 2 Shanky Program Manager Delhi 2 2 2.0 3 Shreya Graphic Designer Mumbai 2 2 2.0 4 Aadi Data Engineering New York 11 3 7.0
In the above example, we are doing a row-wise operation by checking the “City” information in each row and then applying the required operation. Finally, returning the pandas Series which is stored as the new column in the existing DataFrame.
Add new Column based on other Columns using numpy.where() function
The numpy.where() is a little convenient approach to achieve the most task in cases where there are not too many conditions. It works similar to the IF statements in excel, let’s quickly try to replicate the above scenario.
# weighted score of experience and tenure df['weighted_score'] = np.where(df['City'].isin(['Delhi', 'Mumbai']), # condition 0.75*df['Experience'] + 0.25*df['Tenure'], # operation if the condition is TRUE 0.5*df['Experience'] + 0.5*df['Tenure']) # operation if the condition is FALSE print (df)
Output
Name Designation City Experience Tenure weighted_score 0 Shubham Data Scientist Sydney 5 5 5.0 1 Riti Data Analyst Delhi 7 3 6.0 2 Shanky Program Manager Delhi 2 2 2.0 3 Shreya Graphic Designer Mumbai 2 2 2.0 4 Aadi Data Engineering New York 11 3 9.0
As observed, the output is the same as the above apply function and is achieved by using a one-liner statement. However, in case of complicated conditions, it is recommended to use the apply function for cleaner code.
Add new Column based on other Columns using custom defined function (vectorize method)
All the above approaches work on element-wise operations which are computationally heavy and take more time. Let’s look at a vectorize approach using the np.select() function.
In np.select, we need to define the conditions (list) and their corresponding outputs (list). Both the conditions and outputs are passed as argument are passed in the np.select() as arguments along with a default condition in case none of the conditions matches.
Let’s try to create a new column “Department” based on the conditions on the “Designation” column.
# define condition conditions = [ df['Designation'].str.contains("Data"), # condition1 df['Designation'].str.contains("Program"), # condition2 df['Designation'].str.contains("Designer"), # condition3 ] # define the corresponding output output = ['Data', 'PMO', 'Graphics'] df['Department'] = pd.Series(np.select(conditions, output, default = 'Others')) print (df)
Output
Name Designation City Experience Tenure Department 0 Shubham Data Scientist Sydney 5 5 Data 1 Riti Data Analyst Delhi 7 3 Data 2 Shanky Program Manager Delhi 2 2 PMO 3 Shreya Graphic Designer Mumbai 2 2 Graphics 4 Aadi Data Engineering New York 11 3 Data
Note that the same can be achieved using apply or np.where function as well, but they will be more time consuming that the np.select function.
The complete example containing all the techniques is as follows,
import pandas as pd import numpy as np # List of Tuples employees= [('Shubham', 'Data Scientist', 'Sydney', 5, 5), ('Riti', 'Data Analyst', 'Delhi' , 7, 3), ('Shanky', 'Program Manager', 'Delhi' , 2, 2), ('Shreya', 'Graphic Designer', 'Mumbai' , 2, 2), ('Aadi', 'Data Engineering', 'New York', 11, 3)] # Create a DataFrame object from list of tuples df = pd.DataFrame(employees, columns=['Name', 'Designation', 'City', 'Experience', 'Tenure'], index=[0, 1, 2, 3, 4]) print(df) print("** Add new Column based on other Columns using basic functions ***") # avg of experience and tenure df['avg_exp_tenure'] = df[['Experience', 'Tenure']].mean(axis=1) print (df) print("** Add new Column based on other Columns using custom defined function ***") # weighted score of experience and tenure df['weighted_score'] = df.apply(lambda row: 0.75*row['Experience'] + 0.25*row['Tenure'], axis=1) print (df) print("** Another example ***") def weighted_score(row): if row['City'] in ['Delhi', 'Mumbai']: return 0.75*row['Experience'] + 0.25*row['Tenure'] else: return 0.5*row['Experience'] + 0.5*row['Tenure'] df['weighted_score'] = df.apply(weighted_score, axis=1) print (df) print("** Add new Column based on other Columns using numpy.where() ***") # weighted score of experience and tenure df['weighted_score'] = np.where(df['City'].isin(['Delhi', 'Mumbai']), # condition 0.75*df['Experience'] + 0.25*df['Tenure'], # operation if the condition is TRUE 0.5*df['Experience'] + 0.5*df['Tenure']) # operation if the condition is FALSE print (df) print("** Add new Column based on other Columns using vectorization ***") # define condition conditions = [ df['Designation'].str.contains("Data"), # condition1 df['Designation'].str.contains("Program"), # condition2 df['Designation'].str.contains("Designer"), # condition3 ] # define the corresponding output output = ['Data', 'PMO', 'Graphics'] df['Department'] = pd.Series(np.select(conditions, output, default = 'Others')) print (df)
Output
Name Designation City Experience Tenure 0 Shubham Data Scientist Sydney 5 5 1 Riti Data Analyst Delhi 7 3 2 Shanky Program Manager Delhi 2 2 3 Shreya Graphic Designer Mumbai 2 2 4 Aadi Data Engineering New York 11 3 ** Add new Column based on other Columns using basic functions *** Name Designation City Experience Tenure avg_exp_tenure 0 Shubham Data Scientist Sydney 5 5 5.0 1 Riti Data Analyst Delhi 7 3 5.0 2 Shanky Program Manager Delhi 2 2 2.0 3 Shreya Graphic Designer Mumbai 2 2 2.0 4 Aadi Data Engineering New York 11 3 7.0 ** Add new Column based on other Columns using custom defined function *** Name Designation City Experience Tenure avg_exp_tenure weighted_score 0 Shubham Data Scientist Sydney 5 5 5.0 5.0 1 Riti Data Analyst Delhi 7 3 5.0 6.0 2 Shanky Program Manager Delhi 2 2 2.0 2.0 3 Shreya Graphic Designer Mumbai 2 2 2.0 2.0 4 Aadi Data Engineering New York 11 3 7.0 9.0 ** Another example *** Name Designation City Experience Tenure avg_exp_tenure weighted_score 0 Shubham Data Scientist Sydney 5 5 5.0 5.0 1 Riti Data Analyst Delhi 7 3 5.0 6.0 2 Shanky Program Manager Delhi 2 2 2.0 2.0 3 Shreya Graphic Designer Mumbai 2 2 2.0 2.0 4 Aadi Data Engineering New York 11 3 7.0 7.0 ** Add new Column based on other Columns using numpy.where() *** Name Designation City Experience Tenure avg_exp_tenure weighted_score 0 Shubham Data Scientist Sydney 5 5 5.0 5.0 1 Riti Data Analyst Delhi 7 3 5.0 6.0 2 Shanky Program Manager Delhi 2 2 2.0 2.0 3 Shreya Graphic Designer Mumbai 2 2 2.0 2.0 4 Aadi Data Engineering New York 11 3 7.0 7.0 ** Add new Column based on other Columns using vectorization *** Name Designation City Experience Tenure avg_exp_tenure weighted_score Department 0 Shubham Data Scientist Sydney 5 5 5.0 5.0 Data 1 Riti Data Analyst Delhi 7 3 5.0 6.0 Data 2 Shanky Program Manager Delhi 2 2 2.0 2.0 PMO 3 Shreya Graphic Designer Mumbai 2 2 2.0 2.0 Graphics 4 Aadi Data Engineering New York 11 3 7.0 7.0 Data
Summary
Great, you made it! In this article, we have discussed multiple ways to create a new column based on the values from other columns.