In scenarios, where either we need to apply some statistical or ML models, we always hear that we need to normalize the data before fitting the models. Normalization is transforming the numerical variables to a certain scale and distribution. Additionally, there are multiple types of normalizing, we are going to discuss mean normalization (as highlighted below) in this tutorial.
Mean normalization = (values - mean(values))/std(values)
Table of Contents
Preparing Dataset for solution
To quickly get started, let’s create a sample dataframe to experiment. We’ll use the pandas library with some random data.
import pandas as pd import numpy as np # create a DataFrame with random integers df = pd.DataFrame(np.random.randint(0,50000,size=(6, 4)), columns=list('ABCD')) # print df print(df)
Contents of the created dataframe are,
A B C D 0 9546 8717 36607 25438 1 12304 48707 42890 34911 2 28405 26525 40570 19785 3 29137 16876 11103 19178 4 45744 48931 36715 14560 5 29980 22805 40708 27517
Normalizing the entire DataFrame
In cases where all the DataFrame columns are numerical and we want to normalize all, there is easy to execute the formula on the entire DataFrame. Let’s try on the above sample DataFrame.
# normalizing the entire DataFrame df_norm = (df-df.mean())/df.std() print (df_norm)
Output
A B C D 0 -1.229528 -1.202905 0.155400 0.258530 1 -1.021574 1.197124 0.685609 1.565972 2 0.192447 -0.134145 0.489829 -0.521684 3 0.247640 -0.713237 -1.996827 -0.605461 4 1.499813 1.210567 0.164514 -1.242827 5 0.311202 -0.357404 0.501475 0.545469
The output shows the normalized values stored in the new DataFrame (df_norm).
Frequently Asked:
- How to slice a pandas DataFrame column?
- Save Pandas DataFrame to csv file without index
- Write a Dictionary to a CSV file in Python
- Pandas: Sum rows in Dataframe ( all or certain rows)
Using Scikit-learn normalization
Scikit-learn is a very popular library for all Machine learning tasks. To install the scikit module, please use the following command,
pip3 install sklearn
It also contains a function to normalize the columns in a DataFrame. Let’s again try to normalize the data using the scikit library now.
# using scikit preprocessing from sklearn.preprocessing import StandardScaler scaler = StandardScaler() df_norm = scaler.fit_transform(df) print (pd.DataFrame(df_norm, columns = df.columns))
Output
A B C D 0 -1.346880 -1.317717 0.170232 0.283206 1 -1.119078 1.311384 0.751047 1.715437 2 0.210815 -0.146949 0.536581 -0.571476 3 0.271276 -0.781312 -2.187414 -0.663249 4 1.642963 1.326110 0.180216 -1.361449 5 0.340905 -0.391516 0.549338 0.597532
The fit_transform function is a list of lists, which is converted into DataFrame as above. Note that here the values are marginally different, this is because the sklearn scaler function uses a different biased estimator. Overall, we don’t need to worry as this small difference is unlikely to affect our model performance.
Using custom function
An alternate way is to write a custom function for normalization and then use the apply function to execute it over DataFrame columns. Below, we have created a function “normalize_data”, which contains the same logic.
# custom function for normalization def normalize_data(d): d = (d - d.mean())/d.std() return d print (df.apply(normalize_data, axis=0))
Output
A B C D 0 -1.229528 -1.202905 0.155400 0.258530 1 -1.021574 1.197124 0.685609 1.565972 2 0.192447 -0.134145 0.489829 -0.521684 3 0.247640 -0.713237 -1.996827 -0.605461 4 1.499813 1.210567 0.164514 -1.242827 5 0.311202 -0.357404 0.501475 0.545469
As noticed, the output is the same as the first method. Also, we can add whatever other customization is needed.
Summary
In this article, we have discussed multiple ways to normalize columns in a pandas DataFrame. Thanks.