Python string manipulation

Common string methods in python

join: join iterable (for instance list) of strings

' / '.join([
    'first',
    'second',
    'third'
])

'first / second / third'

'_joinedBy_'.join([
    'first',
    'second',
    'third'
])

'first_joinedBy_second_joinedBy_third'

split

Split: reverse of join, that is, split a long string into smaller pieces by specifying a separator (default is a single space)

'this is a long string with 11 parts separated by space'.split()

['this',
 'is',
 'a',
 'long',
 'string',
 'with',
 '11',
 'parts',
 'separated',
 'by',
 'space']

'we_can_also_split_with_other_characters'.split('_')

['we', 'can', 'also', 'split', 'with', 'other', 'characters']

'we_??_can_??_also_??_split_with_??_long_strings'.split('_??_')

['we', 'can', 'also', 'split_with', 'long_strings']

replace

'why would you replace this?'.replace('this', 'that')

'why would you replace that?'

strip, rstrip, lstrip

the strip methods remove spaces on both sides, or the the left/right:

'    please cleanup extra spaces    '.strip()

'please cleanup extra spaces'

'    please cleanup extra spaces, right only    '.rstrip()

'    please cleanup extra spaces, right only'

'    please cleanup extra spaces, right only    '.lstrip()

'please cleanup extra spaces, right only    '

casefold, lower, upper

These are python string methods to change the case of a string, often useful to compare strings while ignoring the case.

'I AM SHOUTING no more'.lower()

'i am shouting no more'

Casefold is like lower, but more agressive.

'I AM SHOUTING no more'.casefold()

'i am shouting no more'

'Please make this more visible'.upper()

'PLEASE MAKE THIS MORE VISIBLE'

Comparing strings after folding the case.

'I AM SHOUTING'.casefold() == 'I am Shouting'.casefold()

True

'I AM SHOUTING' == 'I am Shouting'

False

Use casefold instead of lower for case-insensitive string comparison.

casefold and lower are mostly the same except in a few unicode characters:

# From https://stackoverflow.com/a/74702121/13430450
import sys
import unicodedata as ud

print("Unicode version:", ud.unidata_version, "\n")
total = 0
for codepoint in map(chr, range(sys.maxunicode)):
    lower, casefold = codepoint.lower(), codepoint.casefold()
    if lower != casefold:
        total += 1
        if total < 7: # only printing the first 7 examples that mismatch
            for conversion, converted in zip(
                ("origin", "lower", "casefold"),
                (codepoint, lower, casefold)
            ):
                print(conversion, [ud.name(cp) for cp in converted], converted)
            print()
print("Total differences:", total)

Unicode version: 13.0.0 

origin ['MICRO SIGN'] µ
lower ['MICRO SIGN'] µ
casefold ['GREEK SMALL LETTER MU'] μ

origin ['LATIN SMALL LETTER SHARP S'] ß
lower ['LATIN SMALL LETTER SHARP S'] ß
casefold ['LATIN SMALL LETTER S', 'LATIN SMALL LETTER S'] ss

origin ['LATIN SMALL LETTER N PRECEDED BY APOSTROPHE'] ŉ
lower ['LATIN SMALL LETTER N PRECEDED BY APOSTROPHE'] ŉ
casefold ['MODIFIER LETTER APOSTROPHE', 'LATIN SMALL LETTER N'] ʼn

origin ['LATIN SMALL LETTER LONG S'] ſ
lower ['LATIN SMALL LETTER LONG S'] ſ
casefold ['LATIN SMALL LETTER S'] s

origin ['LATIN SMALL LETTER J WITH CARON'] ǰ
lower ['LATIN SMALL LETTER J WITH CARON'] ǰ
casefold ['LATIN SMALL LETTER J', 'COMBINING CARON'] ǰ

origin ['COMBINING GREEK YPOGEGRAMMENI'] ͅ
lower ['COMBINING GREEK YPOGEGRAMMENI'] ͅ
casefold ['GREEK SMALL LETTER IOTA'] ι

Total differences: 297

startswith

Method that returns a boolean

'I am'.startswith('I ')

True

'I am foooo'.startswith('I am f')

True

'I am foooo'.startswith('You')

False

This can be chained with casefold for case-insensitive comparisons:

'I am foooo'.casefold().startswith('i am')

True

endswith

splitlines

"""
This is a long text in multiple
lines but I would rather
have a list of lines instead
""".splitlines()

['',
 'This is a long text in multiple',
 'lines but I would rather',
 'have a list of lines instead']

count

'this string contains 2 times letter a'.count('a')

We may further contain substrings (more than 1 characters:)

'this string contains 2 times the word contains'.count('contain')

removeprefix

'16:954:597 Data Wrangling and Husbandry (3)'.removeprefix('16:')

'954:597 Data Wrangling and Husbandry (3)'

'16:954:597 Data Wrangling and Husbandry (3)'.removeprefix('16:954.597')

'16:954:597 Data Wrangling and Husbandry (3)'

'16:954:597 Data Wrangling and Husbandry (3)'.removeprefix('16:954.597').strip()

'16:954:597 Data Wrangling and Husbandry (3)'

removesuffix

'16:954:597 Data Wrangling and Husbandry (3)'.removesuffix('(3)').strip()

'16:954:597 Data Wrangling and Husbandry'

Formatting string in variables

Here we can a few example of format strings. Format strings are of the form f'something {variable}, prefixing the first quote with f. It will replace the curly bracket by the variable.

year = 2020
event = 'Covid'
f'Results of the {year} {event}'

'Results of the 2020 Covid'

It also works with lists:

l = [year, event]
f'Results of the {l[0]} {l[1]}'

'Results of the 2020 Covid'

as well as dictionaries:

dic = {'year': 2023,
       'event': 'some referendum'}
f'Results of the {dic["year"]} {dic["event"]}'

'Results of the 2023 some referendum'

We may further use methods inside the format string:

dic = {'year': 2024,
       'event': 'some referendum          '}
f'Results of the {dic["year"]} {dic["event"].upper().strip()}'

'Results of the 2024 SOME REFERENDUM'