fíam

(rhymes with liam)

  • URLsafe base64 encoding/decoding in two lines

    Aug. 28, 2008 at 02:42:51 CEST

    Simon has recently featured this snippet which shrinks a SHA1 hash from 40 to 27 characters using base65 encoding. I've been using another approach for some time and, modestly, I think mine is better, so let me tell you how I do it before you start using the suboptimal approach.

    Python does it for us

    First, the base64 module already provides a pair of functions, aptly named urlsafe_b64encode and urlsafe_b64decode, which do quasi-safe url encoding using base64. For example:

    >>> base64.urlsafe_b64encode(sha1('foobar').digest())
    'iEPX-SQWIR3p67lj_0zigSWTKHg='
    

    The problem here is the equal sign, since we can't put it safely in a URL. However, since we are using this to store a hash and we're going to generate the hash again when we verify the data, we can encode the hash again instead of decoding the base64-encoded string, which lets us safely remove the equal sign. Furthermore, we are also using 27 characters, the same as the other more complex implementation.

    Can we decode it? Sure!

    Sometimes, you'll want to decode the base64 encoded data. However, once we remove the equal sign, we can't no longer pass the base64 encoded string to urlsafe_b64decode:

    >>> base64.urlsafe_b64decode('iEPX-SQWIR3p67lj_0zigSWTKHg')
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/lib/python2.5/base64.py", line 112, in urlsafe_b64decode
        return b64decode(s, '-_')
    File "/usr/lib/python2.5/base64.py", line 76, in b64decode
         raise TypeError(msg)
     TypeError: Incorrect padding
    

    Let's examine how base64 works. The input string is processed in 3-byte blocks which are split into 6-bits pieces. Hence, we have 2^6 == 64 possible values per piece which translates to a single character into our base64 alphabet. So, every three byte block is translated to a 4 characters string. When the last block has less than three bytes, it's padded with zeros and one equal sign is added for every padding byte. This leads us to the conclusion that base64 encoded strings always have a length divisible by four. Let's take advantage of this:

    >>> s = 'iEPX-SQWIR3p67lj_0zigSWTKHg'
    >>> base64.urlsafe_b64decode(s + '=' * (4 - len(s) % 4))
    '\x88C\xd7\xf9$\x16!\x1d\xe9\xeb\xb9c\xffL\xe2\x81%\x93(x'
    

    So now we have our original string back.

    Putting it all together

    from base64 import urlsafe_b64encode, urlsafe_b64decode
    
    def uri_b64encode(s):
         return urlsafe_b64encode(s).strip('=')
    
    def uri_b64decode(s):
         return urlsafe_b64decode(s + '=' * (4 - len(s) % 4))
    
    Tags:

226 comments for "URLsafe base64 encoding/decoding in two lines"

(Won't be published)

Notify me of followup comments via e-mail

HTML is escaped, links are automatically converted