Base 52 is defined as "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOP". MECE never use any non-alphanumeric characters, unlike base 64 where the last two characters have spawned multiple variants due to availibity issues in some systems.
Encoding:
When encoding, escaped characters are placed after base 52 data (reverse order of quoted printable). This avoids having to escape commonly occured sequences in first word of a sentence where first character is in uppercase followed by a lowercase character. If an escape character (one of "VWXYZ") in the original data is preceded by a base 52 character or one of "QRSTU", it must be escaped. Otherwise, it may be left intact to conserve space and preserve readability.
The higher order digits will always appear first in the encoded sequence. For example 63, which is 15 × 5 + 3, will be encoded as "fX". Characters above unicode page zero are represented with 4 symbol sequences. For example, 2919 encodes to "2cTZ". The third symbol consists of a base 5 character set of "QRSTU", avoiding overlaps with base 52 characters. Characters beyond BMP are encoded in surrogate pairs.
To further improve readability, 4 is added to the integer value of the escaped character before encoded, this maps the escape characters to itself plus a base 52 prefix. For example, "X" is mapped to "iX" instead of "hY" without adding 4.
Decoding:
When decoding, a buffer needs to hold up to 3 previous symbols before they are released to the decoded data stream. When an escape character is seen, the previous symbol is examined. If the previous character is neither base 52 nor "QRSTU", the previous and current character are treated as raw data. If the previous character is one of the base 52 characters, it is removed from the buffer and forms a 2 symbol sequence with the current character which represents a character under unicode page zero. If it is one of "QRSTU", it is a 4 symbol sequence which may represent a character in the unicode BMP, or it must be discarded if the first two symbols are not base 52 characters. Discarding an invalid 4 symbol sequence allows insertion of a soft line break when needed, presence signature (if the text has been encoded), and other implementation specific sidebands.
Estimated sizes of encoded data for each unicode range:
Unicode lower bound | 0 | 0* | 80 | 100 | 800 | 10000 |
Unicode upper bound | 7f | 7f | ff | 7ff | ffff | 10ffff |
UTF8 Quoted Printable | 1 | 3 | 6 | 6 | 9 | 12 |
MECE case sensitive | 1 | 2 | 2 | 4 | 4 | 8 |
MECE case insensitive | 1 | 2 | 4 | 4 | 4 | 8 |
UTF8 Base64 | 1⅓ | 1⅓ | 2⅔ | 2⅔ | 4 | 5⅓ |
UTF9 Base64 | 1½ | 1½ | 1½ | 3 | 3 | 4½ |
Example Implementation
Raw: |
Encoded: |
Raw: |
Encoded: |