In this article, we will discuss how to find the largest file in a directory and its sub-directories using python.
Table of contents
- Find the largest file in a directory using python.
- Find the largest file in a directory and its sub-directories using python.
Get the largest file in a directory using python
In python, the glob module provides a function glob() to find files/directories in a given directory based on the matching pattern. Similar to the unix path expansion rules, we can use wildcards and regular expression to match & find few or all files in a directory. We will use the glob() function, to get a list of all files in a directory and then we will look for the largest file from the list of files. Steps are as follows,
- Get a list of all file & directories in a given directory using the glob().
- Filter the list and select files only, using the filter() and os.path.isfile() functions.
- Find the file with maximum size using max() function.
- For this, use lambda x: os.stat(x).st_size as the key argument in the max() function.
Complete example to search for the largest file in a directory is as follows,
import glob import os dir_name = 'C:/Program Files/Java/jdk1.8.0_191/' # Get list of files in a directory list_of_files = filter( os.path.isfile, glob.glob( dir_name + '*') ) # Find the file with max size from the list of files max_file = max( list_of_files, key = lambda x: os.stat(x).st_size) print('Max File: ', max_file) print('Max File size in bytes: ', os.stat(max_file).st_size)
Output:
Max File: C:/Program Files/Java/jdk1.8.0_191\src.zip Max File size in bytes: 21245025
In this solution we created a list of files in a folder, then selected the file with max size. But it looked for the largest file in the given directory only. It didn’t looked inside its sub-directories and directories inside them. What if we want to find the largest file in the complete hierarchy of directory, even if it is inside the nth nested folder in the given directory? Let’s see how to do that
Find largest file in a directory and its sub-directories (recursively)
In the previous example we searched for the largest file in a directory. But it looked into the files in the given directory only, not in nested directories. So, if you want to find the largest in complete directory hierarchy, then checkout this example,
import glob import os dir_name = 'C:/Program Files/Java/jdk1.8.0_191/' # Get list of files in a directory & sub-directories list_of_files = filter( os.path.isfile, glob.glob( dir_name + '/**/*', recursive=True) ) # Find the file with max size from the list of files max_file = max( list_of_files, key = lambda x: os.stat(x).st_size) print('Max File: ', max_file) print('Max File size in bytes: ', os.stat(max_file).st_size)
Output:
Max File: C:/Program Files/Java/jdk1.8.0_191\jre\lib\rt.jar Max File size in bytes: 63596151
We used the glob() function with pattern ‘/**/*’ and recursive=True argument. It gave a list of all files and directories in the given directory and in all sub-directories using a recursive approach . Then using the filter() and os.path.isfile() functions, we filtered out the directory objects and created a list of file paths only. Then by applying the max() function on the list with the key lambda x: os.stat(x).st_size, we searched for the largest file.
Summary:
We learned how to search for the largest file in a directory in python.
Pandas Tutorials -Learn Data Analysis with Python
-
Pandas Tutorial Part #1 - Introduction to Data Analysis with Python
-
Pandas Tutorial Part #2 - Basics of Pandas Series
-
Pandas Tutorial Part #3 - Get & Set Series values
-
Pandas Tutorial Part #4 - Attributes & methods of Pandas Series
-
Pandas Tutorial Part #5 - Add or Remove Pandas Series elements
-
Pandas Tutorial Part #6 - Introduction to DataFrame
-
Pandas Tutorial Part #7 - DataFrame.loc[] - Select Rows / Columns by Indexing
-
Pandas Tutorial Part #8 - DataFrame.iloc[] - Select Rows / Columns by Label Names
-
Pandas Tutorial Part #9 - Filter DataFrame Rows
-
Pandas Tutorial Part #10 - Add/Remove DataFrame Rows & Columns
-
Pandas Tutorial Part #11 - DataFrame attributes & methods
-
Pandas Tutorial Part #12 - Handling Missing Data or NaN values
-
Pandas Tutorial Part #13 - Iterate over Rows & Columns of DataFrame
-
Pandas Tutorial Part #14 - Sorting DataFrame by Rows or Columns
-
Pandas Tutorial Part #15 - Merging or Concatenating DataFrames
-
Pandas Tutorial Part #16 - DataFrame GroupBy explained with examples
Are you looking to make a career in Data Science with Python?
Data Science is the future, and the future is here now. Data Scientists are now the most sought-after professionals today. To become a good Data Scientist or to make a career switch in Data Science one must possess the right skill set. We have curated a list of Best Professional Certificate in Data Science with Python. These courses will teach you the programming tools for Data Science like Pandas, NumPy, Matplotlib, Seaborn and how to use these libraries to implement Machine learning models.
Checkout the Detailed Review of Best Professional Certificate in Data Science with Python.
Remember, Data Science requires a lot of patience, persistence, and practice. So, start learning today.