Unicode Regular Expressions in Python

November 12, 2010

I was struggling with some unicode issues in Python. I wanted my app to match Eikanger-Bjørsvik Musikklag, which contained extended characters.

I was able to achieve this by doing the following:

Regex

Prefixing my regular expression with ur made it into a unicode raw string:

lRegEx = ur'\s*([\w\(\)&\'\-\. ]+)'

and then the regular expression was compiled as unicode:

lMatches = re.compile(lRegEx, re.U).match(line)  
# This used to be lMatches = re.match(lRegEx, line)

This meant that the \w would match unicode characters too. We extract the group values like normal:

lBandName = lMatches.group(1).strip()

Form Errors

In this case I was returning a form error which contained the unicode string, using the following code:

lBandName = lFormErrors[len(_BAND_PREFIX):-15]

This string wasn't unicode, and was failing later in the code when it was used as a %s replacement in a string, with:

'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)

To fix this, the above line of code was changed to:

lBandName = unicode(lFormErrors[len(_BAND_PREFIX):-15], 'UTF-8', errors='strict')

References

Tags: python regex unicode