In this article we will discuss different ways to count occurrences of a sub-string in another string and also their index positions.
Count occurrences of a sub-string in another string using string.count()
Python’s String class contains a method to count the non overlapping occurrences of a sub-string in the string object i.e.
string.count(s, sub[, start[, end]])
It looks for the sub-string s in range start to end and returns it’s occurrence count. If start & end is not provided then it will look in complete string and returns the occurrence count of sub-string in the string. For example,
mainStr = 'This is a sample string and a sample code. It is very short.' # Get the occurrence count of sub-string in main string. count = mainStr.count('sample') print("'sample' sub string frequency / occurrence count : " , count)
Output:
'sample' sub string frequency / occurrence count : 2
As ‘sample’ string exists at 2 places in the another string, so it returned 2.
Using Python Regex : Count occurrences of a sub string in string
We can easily get the occurrence count using python regex too. For that we will create a regex pattern with sub-string and then find all matches of that regex pattern in another string i.e.
# Create a Regex pattern to match the substring regexPattern = re.compile("sample") # Get a list of strings that matches the given pattern i.e. substring listOfMatches = regexPattern.findall(mainStr) print("'sample' sub string frequency / occurrence count : ", len(listOfMatches))
As ‘sample’ string exists at 2 places in the another string, so regex pattern is matched at 2 places and a list of those matches is returned. Length of the list returned will tell the total occurrence count of sub-string in main string.
'sample' sub string frequency / occurrence count : 2
Count Overlapping occurrences of a sub-string in another string
The ways we have seen till now are not able to count the overlapping sub-strings. Let’s understand by example,
Suppose we have a string which has overlapping occurrence of sub-string ‘that’ i.e.,
mainStr = 'thathatthat'
Now if we count the occurrence of a sub-string ‘that’ in this string using string.count(),
# string.count() will not be able to count occurrences of overlapping sub-strings count = mainStr.count('that')
string.count() will return 2, where as there are 3 overlapping occurrence of ‘that’ in main string.
As, string.count() can not find the overlapping occurrences of a sub-string. So, let’s create a function to do this,
'''' Find occurrence count of overlapping substrings. Start from left and start searching for the substring when found increment the counter and keep on search from next index position. ''' def frequencyCount(mainStr, subStr): counter = pos = 0 while(True): pos = mainStr.find(subStr , pos) if pos > -1: counter = counter + 1 pos = pos + 1 else: break return counter
Now let’s use this function of find occurrence count of a overlapping sub-string ‘that’ in the main string,
# count occurrences of overlapping substrings count = frequencyCount(mainStr, 'that') print("'that' sub string frequency count : ", count)
Output:
'that' sub string frequency count : 3
Find occurrence count and index positions of a sub-string in another string
Find indices of non-overlapping sub-string in string using Python regex finditer()
Using Regex find all the matches of a sub-string in another main string and iterate over all those matches to find their index positions i.e.
# Create a Regex pattern to match the substring regexPattern = re.compile('sample') # Iterate over all the matches of substring using iterator of matchObjects returnes by finditer() iteratorOfMatchObs = regexPattern.finditer(mainStr) indexPositions = [] count = 0 for matchObj in iteratorOfMatchObs: indexPositions.append(matchObj.start()) count = count + 1 print("Occurrence Count of substring 'sample' : ", count) print("Index Positions of 'sample' are : ", indexPositions)
Output:
Occurrence Count of substring 'sample' : 2 Index Positions of 'sample' are : [10, 30]
It returns the count & indices of non-overlapping sub-strings only. To find the occurrence count & indices of overlapping sub-strings let’s modify the above create function
Find indices of overlapping sub-string in string using Python
'''' Find occurrence count of overlapping substrings and get their count and index positions. Start from left and start searching for the substring when found increment the counter and keep on search from next index position. ''' def frequencyCountAndPositions(mainStr, subStr): counter = pos = 0 indexpos = [] while(True): pos = mainStr.find(subStr , pos) if pos > -1: indexpos.append(pos) counter = counter + 1 pos = pos + 1 else: break return (counter, indexpos)
Let’s use this function to find indices of overlapping sub-strings in main string,
mainStr = 'thathatthat' result = frequencyCountAndPositions(mainStr, 'that') print("Occurrence Count of overlapping sub-strings 'that' : ", result[0]) print("Index Positions of 'that' are : ", result[1])
Output:
Occurrence Count of overlapping sub-strings 'that' : 3 Index Positions of 'that' are : [0, 3, 7]
Find nth occurrence of a sub-string in another string
Let’s use the same function frequencyCountAndPositions() to find the nth occurrence of a sub-string in another string i.e.
mainStr = 'This is a sample string and a sample code. It is very Short.' result = frequencyCountAndPositions(mainStr, 'is') if result[0] >= 2: print("Index Positions of 2nd Occurrence of sub-string 'is' : ", result[1][1])
Output:
Index Positions of 2nd Occurrence of sub-string 'is' : 5
Complete example is as follows,
import re '''' Find occurrence count of overlapping substrings. Start from left and start searching for the substring when found increment the counter and keep on search from next index position. ''' def frequencyCount(mainStr, subStr): counter = pos = 0 while(True): pos = mainStr.find(subStr , pos) if pos > -1: counter = counter + 1 pos = pos + 1 else: break return counter '''' Find occurrence count of overlapping substrings and get their count and index positions. Start from left and start searching for the substring when found increment the counter and keep on search from next index position. ''' def frequencyCountAndPositions(mainStr, subStr): counter = pos = 0 indexpos = [] while(True): pos = mainStr.find(subStr , pos) if pos > -1: indexpos.append(pos) counter = counter + 1 pos = pos + 1 else: break return (counter, indexpos) def main(): print(' **** Get occurrence count of a sub string in string using string.count() ****') mainStr = 'This is a sample string and a sample code. It is very short.' # Get the occurrence count of sub-string in main string. count = mainStr.count('sample') print("'sample' sub string frequency / occurrence count : " , count) print(' **** Get occurrence count of a sub string in string using Python Regex ****') # Create a Regex pattern to match the substring regexPattern = re.compile("sample") # Get a list of strings that matches the given pattern i.e. substring listOfMatches = regexPattern.findall(mainStr) print("'sample' sub string frequency / occurrence count : ", len(listOfMatches)) print(' **** Count overlapping sub-strings in the main string ****') mainStr = 'thathatthat' # string.count() will not be able to count occurrences of overlapping substrings count = mainStr.count('that') print("'that' sub string frequency count : ", count) # count occurrences of overlapping substrings count = frequencyCount(mainStr, 'that') print("'that' sub string frequency count : ", count) print('**** Find Occurrence count and all index position of a sub-string in a String **** ') mainStr = 'This is a sample string and a sample code. It is very Short.' # Create a Regex pattern to match the substring regexPattern = re.compile('sample') # Iterate over all the matches of substring using iterator of matchObjects returnes by finditer() iteratorOfMatchObs = regexPattern.finditer(mainStr) indexPositions = [] count = 0 for matchObj in iteratorOfMatchObs: indexPositions.append(matchObj.start()) count = count + 1 print("Occurrence Count of substring 'sample' : ", count) print("Index Positions of 'sample' are : ", indexPositions) mainStr = 'thathatthat' result = frequencyCountAndPositions(mainStr, 'that') print("Occurrence Count of sub string 'that' : ", result[0]) print("Index Positions of 'that' are : ", result[1]) print('*** Find the nth occurrence of sub-string in a string ****') mainStr = 'This is a sample string and a sample code. It is very Short.' result = frequencyCountAndPositions(mainStr, 'is') if result[0] >= 2: print("Index Positions of 2nd Occurrence of sub-string 'is' : ", result[1][1]) if __name__ == '__main__': main()
Output:
**** Get occurrence count of a sub string in string using string.count() **** 'sample' sub string frequency / occurrence count : 2 **** Get occurrence count of a sub string in string using Python Regex **** 'sample' sub string frequency / occurrence count : 2 **** Count overlapping sub-strings in the main string **** 'that' sub string frequency count : 2 'that' sub string frequency count : 3 **** Find Occurrence count and all index position of a sub-string in a String **** Occurrence Count of sub-string 'sample' : 2 Index Positions of 'sample' are : [10, 30] Occurrence Count of sub string 'that' : 3 Index Positions of 'that' are : [0, 3, 7] *** Find the nth occurrence of sub-string in a string **** Index Positions of 2nd Occurrence of sub-string 'is' : 5
Pandas Tutorials -Learn Data Analysis with Python
-
Pandas Tutorial Part #1 - Introduction to Data Analysis with Python
-
Pandas Tutorial Part #2 - Basics of Pandas Series
-
Pandas Tutorial Part #3 - Get & Set Series values
-
Pandas Tutorial Part #4 - Attributes & methods of Pandas Series
-
Pandas Tutorial Part #5 - Add or Remove Pandas Series elements
-
Pandas Tutorial Part #6 - Introduction to DataFrame
-
Pandas Tutorial Part #7 - DataFrame.loc[] - Select Rows / Columns by Indexing
-
Pandas Tutorial Part #8 - DataFrame.iloc[] - Select Rows / Columns by Label Names
-
Pandas Tutorial Part #9 - Filter DataFrame Rows
-
Pandas Tutorial Part #10 - Add/Remove DataFrame Rows & Columns
-
Pandas Tutorial Part #11 - DataFrame attributes & methods
-
Pandas Tutorial Part #12 - Handling Missing Data or NaN values
-
Pandas Tutorial Part #13 - Iterate over Rows & Columns of DataFrame
-
Pandas Tutorial Part #14 - Sorting DataFrame by Rows or Columns
-
Pandas Tutorial Part #15 - Merging or Concatenating DataFrames
-
Pandas Tutorial Part #16 - DataFrame GroupBy explained with examples
Are you looking to make a career in Data Science with Python?
Data Science is the future, and the future is here now. Data Scientists are now the most sought-after professionals today. To become a good Data Scientist or to make a career switch in Data Science one must possess the right skill set. We have curated a list of Best Professional Certificate in Data Science with Python. These courses will teach you the programming tools for Data Science like Pandas, NumPy, Matplotlib, Seaborn and how to use these libraries to implement Machine learning models.
Checkout the Detailed Review of Best Professional Certificate in Data Science with Python.
Remember, Data Science requires a lot of patience, persistence, and practice. So, start learning today.