How to shuffle DataFrame rows in Pandas?

In this article, we will discuss how to shuffle DataFrame rows in Pandas. Shuffling rows is generally used to randomize datasets before feeding the data into any Machine Learning model training.

Table Of Contents

Preparing DataSet
Method 1: Using pandas.DataFrame.sample() function
Method 2: Using shuffle from sklearn
Method 3: Using permutation from NumPy
Summary

Preparing DataSet

To quickly get started, let’s create a sample dataframe to experiment. We’ll use the pandas library with some random data.

import pandas as pd
import numpy as np

# List of Tuples
employees = [('Shubham', 'India', 'Tech',   5),
            ('Riti', 'India', 'Tech' ,   7),
            ('Shanky', 'India', 'PMO' ,   2),
            ('Shreya', 'India', 'Design' ,   2),
            ('Aadi', 'US', 'Tech', 11),
            ('Sim', 'US', 'Tech', 4)]

# Create a DataFrame object from list of tuples
df = pd.DataFrame(employees,
                  columns=['Name', 'Location', 'Team', 'Experience'])
print(df)

Contents of the created dataframe are,

      Name Location    Team  Experience
0  Shubham    India    Tech           5
1     Riti    India    Tech           7
2   Shanky    India     PMO           2
3   Shreya    India  Design           2
4     Aadi       US    Tech          11
5      Sim       US    Tech           4

Method 1: Using pandas.DataFrame.sample() function

The sample() function from pandas is generally used to pick a random sample from the dataset. But we can also use it to shuffle the rows by setting the “frac” attribute as 1, where the “frac” attributes means to select the fraction of rows in the random sample DataFrame. Therefore, setting that as 1 will keep all the rows but will just shuffle it randomly.

# sample DataFrame with random state
print (df.sample(frac=1, random_state=2022))

Output

      Name Location    Team  Experience
2   Shanky    India     PMO           2
3   Shreya    India  Design           2
0  Shubham    India    Tech           5
1     Riti    India    Tech           7
4     Aadi       US    Tech          11
5      Sim       US    Tech           4

As observed, the DataFrame rows have now shuffled in random order. We have used “random_state” to replicate the results later on when we run the same code.

Frequently Asked:

Method 2: Using shuffle from sklearn

The sklearn.utils also provides a function to shuffle any pandas DataFrame. Let’s use it to shuffle the original DataFrame again.

# import
from sklearn.utils import shuffle

# shuffle rows
print (shuffle(df))

Output

      Name Location    Team  Experience
5      Sim       US    Tech           4
2   Shanky    India     PMO           2
3   Shreya    India  Design           2
4     Aadi       US    Tech          11
0  Shubham    India    Tech           5
1     Riti    India    Tech           7

Method 3: Using permutation from NumPy

Another interesting way to shuffle the DataFrame rows is using the numpy.random.permutation() function. Broadly, this is used to create all the permutations of a sequence or a range. Here, we will use it to shuffle the rows by creating a random permutation of the sequence from 0 to DataFrame length.

# shuffle using permutation function
print(df.iloc[np.random.permutation(len(df))])

Output

      Name Location    Team  Experience
5      Sim       US    Tech           4
1     Riti    India    Tech           7
0  Shubham    India    Tech           5
4     Aadi       US    Tech          11
2   Shanky    India     PMO           2
3   Shreya    India  Design           2

The complete example is as follows,

import pandas as pd
import numpy as np
from sklearn.utils import shuffle

# List of Tuples
employees = [('Shubham', 'India', 'Tech',   5),
            ('Riti', 'India', 'Tech' ,   7),
            ('Shanky', 'India', 'PMO' ,   2),
            ('Shreya', 'India', 'Design' ,   2),
            ('Aadi', 'US', 'Tech', 11),
            ('Sim', 'US', 'Tech', 4)]

# Create a DataFrame object from list of tuples
df = pd.DataFrame(employees,
                  columns=['Name', 'Location', 'Team', 'Experience'])
print(df)

# sample DataFrame with random state
print (df.sample(frac=1, random_state=2022))


# shuffle rows
print (shuffle(df))

# shuffle using permutation function
print(df.iloc[np.random.permutation(len(df))])

Output:

      Name Location    Team  Experience
0  Shubham    India    Tech           5
1     Riti    India    Tech           7
2   Shanky    India     PMO           2
3   Shreya    India  Design           2
4     Aadi       US    Tech          11
5      Sim       US    Tech           4

      Name Location    Team  Experience
2   Shanky    India     PMO           2
3   Shreya    India  Design           2
0  Shubham    India    Tech           5
1     Riti    India    Tech           7
4     Aadi       US    Tech          11
5      Sim       US    Tech           4

      Name Location    Team  Experience
0  Shubham    India    Tech           5
1     Riti    India    Tech           7
3   Shreya    India  Design           2
5      Sim       US    Tech           4
2   Shanky    India     PMO           2
4     Aadi       US    Tech          11

      Name Location    Team  Experience
4     Aadi       US    Tech          11
5      Sim       US    Tech           4
3   Shreya    India  Design           2
1     Riti    India    Tech           7
2   Shanky    India     PMO           2
0  Shubham    India    Tech           5

Summary

In this article, we have discussed multiple ways to shuffle the DataFrame rows in pandas.

Preparing DataSet

Method 1: Using pandas.DataFrame.sample() function

Frequently Asked:

Method 2: Using shuffle from sklearn

Method 3: Using permutation from NumPy

Summary

Related posts:

Share your love

Leave a Comment Cancel Reply