Unicode Regular Expressions in Python
November 12, 2010
I was struggling with some unicode issues in Python. I wanted my app to match Eikanger-Bjørsvik Musikklag
, which contained extended characters.
I was able to achieve this by doing the following:
Regex
Prefixing my regular expression with ur made it into a unicode raw string:
lRegEx = ur'\s*([\w\(\)&\'\-\. ]+)'
and then the regular expression was compiled as unicode:
lMatches = re.compile(lRegEx, re.U).match(line) # This used to be lMatches = re.match(lRegEx, line)
This meant that the \w would match unicode characters too. We extract the group values like normal:
lBandName = lMatches.group(1).strip()
Form Errors
In this case I was returning a form error which contained the unicode string, using the following code:
lBandName = lFormErrors[len(_BAND_PREFIX):-15]
This string wasn't unicode, and was failing later in the code when it was used as a %s replacement in a string, with:
'ascii' codec can't decode byte 0xc3 in position 11: ordinal not in range(128)
To fix this, the above line of code was changed to:
lBandName = unicode(lFormErrors[len(_BAND_PREFIX):-15], 'UTF-8', errors='strict')