How to normalize columns in Pandas DataFrame?

In scenarios, where either we need to apply some statistical or ML models, we always hear that we need to normalize the data before fitting the models. Normalization is transforming the numerical variables to a certain scale and distribution. Additionally, there are multiple types of normalizing, we are going to discuss mean normalization (as highlighted below) in this tutorial.

Mean normalization = (values - mean(values))/std(values)

Table of Contents

Preparing Dataset for solution

To quickly get started, let’s create a sample dataframe to experiment. We’ll use the pandas library with some random data.

import pandas as pd
import numpy as np

# create a DataFrame with random integers
df = pd.DataFrame(np.random.randint(0,50000,size=(6, 4)), columns=list('ABCD'))

# print df
print(df)

Contents of the created dataframe are,

       A      B      C      D
0   9546   8717  36607  25438
1  12304  48707  42890  34911
2  28405  26525  40570  19785
3  29137  16876  11103  19178
4  45744  48931  36715  14560
5  29980  22805  40708  27517

Normalizing the entire DataFrame

In cases where all the DataFrame columns are numerical and we want to normalize all, there is easy to execute the formula on the entire DataFrame. Let’s try on the above sample DataFrame.

# normalizing the entire DataFrame
df_norm = (df-df.mean())/df.std()

print (df_norm)

Output

          A         B         C         D
0 -1.229528 -1.202905  0.155400  0.258530
1 -1.021574  1.197124  0.685609  1.565972
2  0.192447 -0.134145  0.489829 -0.521684
3  0.247640 -0.713237 -1.996827 -0.605461
4  1.499813  1.210567  0.164514 -1.242827
5  0.311202 -0.357404  0.501475  0.545469

The output shows the normalized values stored in the new DataFrame (df_norm).

Using Scikit-learn normalization

Scikit-learn is a very popular library for all Machine learning tasks. To install the scikit module, please use the following command,

pip3 install sklearn

It also contains a function to normalize the columns in a DataFrame. Let’s again try to normalize the data using the scikit library now.

# using scikit preprocessing
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

df_norm = scaler.fit_transform(df)
print (pd.DataFrame(df_norm, columns = df.columns))

Output

          A         B         C         D
0 -1.346880 -1.317717  0.170232  0.283206
1 -1.119078  1.311384  0.751047  1.715437
2  0.210815 -0.146949  0.536581 -0.571476
3  0.271276 -0.781312 -2.187414 -0.663249
4  1.642963  1.326110  0.180216 -1.361449
5  0.340905 -0.391516  0.549338  0.597532

The fit_transform function is a list of lists, which is converted into DataFrame as above. Note that here the values are marginally different, this is because the sklearn scaler function uses a different biased estimator. Overall, we don’t need to worry as this small difference is unlikely to affect our model performance.

Using custom function

An alternate way is to write a custom function for normalization and then use the apply function to execute it over DataFrame columns. Below, we have created a function “normalize_data”, which contains the same logic.

# custom function for normalization
def normalize_data(d):
    d = (d - d.mean())/d.std()
    return d

print (df.apply(normalize_data, axis=0))

Output

          A         B         C         D
0 -1.229528 -1.202905  0.155400  0.258530
1 -1.021574  1.197124  0.685609  1.565972
2  0.192447 -0.134145  0.489829 -0.521684
3  0.247640 -0.713237 -1.996827 -0.605461
4  1.499813  1.210567  0.164514 -1.242827
5  0.311202 -0.357404  0.501475  0.545469

As noticed, the output is the same as the first method. Also, we can add whatever other customization is needed.

Summary

In this article, we have discussed multiple ways to normalize columns in a pandas DataFrame. Thanks.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Scroll to Top