Create New Column based on values of other columns in Pandas

In the data analysis, we often need to create a new column a DataFrame based on values from other columns in Pandas. In this article, we will cover multiple scenarios and functions that we can use to achieve that.

Table of Contents

To quickly get started, let’s create a sample DataFrame to experiment. We’ll use the pandas library with some random data.

import pandas as pd
import numpy as np

# List of Tuples
employees= [('Shubham', 'Data Scientist', 'Sydney',   5, 5),
            ('Riti', 'Data Analyst', 'Delhi' ,   7, 3),
            ('Shanky', 'Program Manager', 'Delhi' ,   2, 2),
            ('Shreya', 'Graphic Designer', 'Mumbai' ,   2, 2),
            ('Aadi', 'Data Engineering', 'New York', 11, 3)]

# Create a DataFrame object from list of tuples
df = pd.DataFrame(employees,
                  columns=['Name', 'Designation', 'City', 'Experience', 'Tenure'],
                  index=[0, 1, 2, 3, 4])
print(df)

Contents of the created dataframe are,

      Name       Designation      City  Experience  Tenure
0  Shubham    Data Scientist    Sydney           5       5
1     Riti      Data Analyst     Delhi           7       3
2   Shanky   Program Manager     Delhi           2       2
3   Shreya  Graphic Designer    Mumbai           2       2
4     Aadi  Data Engineering  New York          11       3

Now, let’s look at different ways in which we could add a new column based on values from other columns in this DataFrame.

Advertisements

Add new Column based on other Columns using basic functions (mean, max, etc.)

In this scenario, we need to calculate the row-wise average of two columns and store it as a new column. Let’s quickly try this by taking the average of columns “Experience” and “Tenure” and saving it as a new column named “avg_exp_tenure” in the existing DataFrame.

# avg of experience and tenure
df['avg_exp_tenure'] = df[['Experience', 'Tenure']].mean(axis=1)

print (df)

Output

      Name       Designation      City  Experience  Tenure  avg_exp_tenure
0  Shubham    Data Scientist    Sydney           5       5             5.0
1     Riti      Data Analyst     Delhi           7       3             5.0
2   Shanky   Program Manager     Delhi           2       2             2.0
3   Shreya  Graphic Designer    Mumbai           2       2             2.0
4     Aadi  Data Engineering  New York          11       3             7.0

Here you go, we have the average of two columns stored as a new column. We can use other basic operators such as mean, median, sum, etc. in the same way.

Add new Column based on other Columns using custom defined function (apply() method)

In most cases, we need to define a custom function to replicate the business understanding. Let’s take an example here, we need to create a new column containing the weighted score, by giving 0.75 weight to column “Experience” and 0.25 weight to column “Tenure”.

# weighted score of experience and tenure
df['weighted_score'] = df.apply(lambda row: 0.75*row['Experience'] + 0.25*row['Tenure'], axis=1)

print (df)

Output

      Name       Designation      City  Experience  Tenure  weighted_score
0  Shubham    Data Scientist    Sydney           5       5             5.0
1     Riti      Data Analyst     Delhi           7       3             6.0
2   Shanky   Program Manager     Delhi           2       2             2.0
3   Shreya  Graphic Designer    Mumbai           2       2             2.0
4     Aadi  Data Engineering  New York          11       3             9.0

In case, we have more complicated logic, feel free to create a separate function with all the conditions. For example, let’s say we want to tweak the weighted score based on the “City” information.

def weighted_score(row):
    if row['City'] in ['Delhi', 'Mumbai']:
        return 0.75*row['Experience'] + 0.25*row['Tenure']

    else:
        return 0.5*row['Experience'] + 0.5*row['Tenure']

df['weighted_score'] = df.apply(weighted_score, axis=1)

print (df)

Output

      Name       Designation      City  Experience  Tenure  weighted_score
0  Shubham    Data Scientist    Sydney           5       5             5.0
1     Riti      Data Analyst     Delhi           7       3             6.0
2   Shanky   Program Manager     Delhi           2       2             2.0
3   Shreya  Graphic Designer    Mumbai           2       2             2.0
4     Aadi  Data Engineering  New York          11       3             7.0

In the above example, we are doing a row-wise operation by checking the “City” information in each row and then applying the required operation. Finally, returning the pandas Series which is stored as the new column in the existing DataFrame.

Add new Column based on other Columns using numpy.where() function

The numpy.where() is a little convenient approach to achieve the most task in cases where there are not too many conditions. It works similar to the IF statements in excel, let’s quickly try to replicate the above scenario.

# weighted score of experience and tenure
df['weighted_score'] = np.where(df['City'].isin(['Delhi', 'Mumbai']), # condition
                                0.75*df['Experience'] + 0.25*df['Tenure'], # operation if the condition is TRUE
                                0.5*df['Experience'] + 0.5*df['Tenure']) # operation if the condition is FALSE

print (df)

Output

      Name       Designation      City  Experience  Tenure  weighted_score
0  Shubham    Data Scientist    Sydney           5       5             5.0
1     Riti      Data Analyst     Delhi           7       3             6.0
2   Shanky   Program Manager     Delhi           2       2             2.0
3   Shreya  Graphic Designer    Mumbai           2       2             2.0
4     Aadi  Data Engineering  New York          11       3             9.0

As observed, the output is the same as the above apply function and is achieved by using a one-liner statement. However, in case of complicated conditions, it is recommended to use the apply function for cleaner code.

Add new Column based on other Columns using custom defined function (vectorize method)

All the above approaches work on element-wise operations which are computationally heavy and take more time. Let’s look at a vectorize approach using the np.select() function.

In np.select, we need to define the conditions (list) and their corresponding outputs (list). Both the conditions and outputs are passed as argument are passed in the np.select() as arguments along with a default condition in case none of the conditions matches.

Let’s try to create a new column “Department” based on the conditions on the “Designation” column.

# define condition
conditions = [
    df['Designation'].str.contains("Data"), # condition1
    df['Designation'].str.contains("Program"), # condition2
    df['Designation'].str.contains("Designer"), # condition3
]

# define the corresponding output
output = ['Data', 'PMO', 'Graphics']

df['Department'] = pd.Series(np.select(conditions, output, default = 'Others'))
print (df)

Output

      Name       Designation      City  Experience  Tenure Department
0  Shubham    Data Scientist    Sydney           5       5       Data
1     Riti      Data Analyst     Delhi           7       3       Data
2   Shanky   Program Manager     Delhi           2       2        PMO
3   Shreya  Graphic Designer    Mumbai           2       2   Graphics
4     Aadi  Data Engineering  New York          11       3       Data

Note that the same can be achieved using apply or np.where function as well, but they will be more time consuming that the np.select function.

The complete example containing all the techniques is as follows,

import pandas as pd
import numpy as np

# List of Tuples
employees= [('Shubham', 'Data Scientist', 'Sydney',   5, 5),
            ('Riti', 'Data Analyst', 'Delhi' ,   7, 3),
            ('Shanky', 'Program Manager', 'Delhi' ,   2, 2),
            ('Shreya', 'Graphic Designer', 'Mumbai' ,   2, 2),
            ('Aadi', 'Data Engineering', 'New York', 11, 3)]

# Create a DataFrame object from list of tuples
df = pd.DataFrame(employees,
                  columns=['Name', 'Designation', 'City', 'Experience', 'Tenure'],
                  index=[0, 1, 2, 3, 4])
print(df)

print("** Add new Column based on other Columns using basic functions ***")

# avg of experience and tenure
df['avg_exp_tenure'] = df[['Experience', 'Tenure']].mean(axis=1)

print (df)

print("** Add new Column based on other Columns using custom defined function ***")

# weighted score of experience and tenure
df['weighted_score'] = df.apply(lambda row: 0.75*row['Experience'] + 0.25*row['Tenure'], axis=1)

print (df)

print("** Another example ***")

def weighted_score(row):
    if row['City'] in ['Delhi', 'Mumbai']:
        return 0.75*row['Experience'] + 0.25*row['Tenure']

    else:
        return 0.5*row['Experience'] + 0.5*row['Tenure']

df['weighted_score'] = df.apply(weighted_score, axis=1)

print (df)

print("** Add new Column based on other Columns using numpy.where() ***")

# weighted score of experience and tenure
df['weighted_score'] = np.where(df['City'].isin(['Delhi', 'Mumbai']), # condition
                                0.75*df['Experience'] + 0.25*df['Tenure'], # operation if the condition is TRUE
                                0.5*df['Experience'] + 0.5*df['Tenure']) # operation if the condition is FALSE

print (df)

print("** Add new Column based on other Columns using vectorization ***")

# define condition
conditions = [
    df['Designation'].str.contains("Data"), # condition1
    df['Designation'].str.contains("Program"), # condition2
    df['Designation'].str.contains("Designer"), # condition3
]

# define the corresponding output
output = ['Data', 'PMO', 'Graphics']

df['Department'] = pd.Series(np.select(conditions, output, default = 'Others'))
print (df)

Output

      Name       Designation      City  Experience  Tenure
0  Shubham    Data Scientist    Sydney           5       5
1     Riti      Data Analyst     Delhi           7       3
2   Shanky   Program Manager     Delhi           2       2
3   Shreya  Graphic Designer    Mumbai           2       2
4     Aadi  Data Engineering  New York          11       3

** Add new Column based on other Columns using basic functions ***

      Name       Designation      City  Experience  Tenure  avg_exp_tenure
0  Shubham    Data Scientist    Sydney           5       5             5.0
1     Riti      Data Analyst     Delhi           7       3             5.0
2   Shanky   Program Manager     Delhi           2       2             2.0
3   Shreya  Graphic Designer    Mumbai           2       2             2.0
4     Aadi  Data Engineering  New York          11       3             7.0


** Add new Column based on other Columns using custom defined function ***

      Name       Designation      City  Experience  Tenure  avg_exp_tenure  weighted_score
0  Shubham    Data Scientist    Sydney           5       5             5.0             5.0
1     Riti      Data Analyst     Delhi           7       3             5.0             6.0
2   Shanky   Program Manager     Delhi           2       2             2.0             2.0
3   Shreya  Graphic Designer    Mumbai           2       2             2.0             2.0
4     Aadi  Data Engineering  New York          11       3             7.0             9.0

** Another example ***

      Name       Designation      City  Experience  Tenure  avg_exp_tenure  weighted_score
0  Shubham    Data Scientist    Sydney           5       5             5.0             5.0
1     Riti      Data Analyst     Delhi           7       3             5.0             6.0
2   Shanky   Program Manager     Delhi           2       2             2.0             2.0
3   Shreya  Graphic Designer    Mumbai           2       2             2.0             2.0
4     Aadi  Data Engineering  New York          11       3             7.0             7.0


** Add new Column based on other Columns using numpy.where() ***

      Name       Designation      City  Experience  Tenure  avg_exp_tenure  weighted_score
0  Shubham    Data Scientist    Sydney           5       5             5.0             5.0
1     Riti      Data Analyst     Delhi           7       3             5.0             6.0
2   Shanky   Program Manager     Delhi           2       2             2.0             2.0
3   Shreya  Graphic Designer    Mumbai           2       2             2.0             2.0
4     Aadi  Data Engineering  New York          11       3             7.0             7.0


** Add new Column based on other Columns using vectorization ***

      Name       Designation      City  Experience  Tenure  avg_exp_tenure  weighted_score Department
0  Shubham    Data Scientist    Sydney           5       5             5.0             5.0       Data
1     Riti      Data Analyst     Delhi           7       3             5.0             6.0       Data
2   Shanky   Program Manager     Delhi           2       2             2.0             2.0        PMO
3   Shreya  Graphic Designer    Mumbai           2       2             2.0             2.0   Graphics
4     Aadi  Data Engineering  New York          11       3             7.0             7.0       Data

Summary

Great, you made it! In this article, we have discussed multiple ways to create a new column based on the values from other columns.

Pandas Tutorials -Learn Data Analysis with Python

   

Are you looking to make a career in Data Science with Python?

Data Science is the future, and the future is here now. Data Scientists are now the most sought-after professionals today. To become a good Data Scientist or to make a career switch in Data Science one must possess the right skill set. We have curated a list of Best Professional Certificate in Data Science with Python. These courses will teach you the programming tools for Data Science like Pandas, NumPy, Matplotlib, Seaborn and how to use these libraries to implement Machine learning models.

Checkout the Detailed Review of Best Professional Certificate in Data Science with Python.

Remember, Data Science requires a lot of patience, persistence, and practice. So, start learning today.

Join a LinkedIn Community of Python Developers

Leave a Comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top