In order to store a message in its
database, Archiveopteryx has to ensure that
it is syntactically valid. This is usually not a problem — most
mail is valid, despite the urban legends. Some mail does have errors,
but even that tends to be
valid enough for
Having a better mail parser than Outlook isn't a cinch, but it isn't impossible either. Archiveopteryx has one.
This page lists some examples of syntax errors seen in real mail, and describes how Archiveopteryx handles them. The errors are roughly sorted from innocent/common to evil/uncommon.
Note that even when a message cannot be parsed, Archiveopteryx does not reject it. Rather, it is stored in a special manner, and after each upgrade, Archiveopteryx will offer to try parsing again. aox reparse -e can also help you compose a bug report containing an anonymised version of the invalid messages. (Most people don't bother, because it's all spam these days.)
A typical example.
Subject: Diese Probleme löst ab jetzt Kerstin
The ö is undeclared 8-bit, so we don't know which encoding it uses.
Archiveopteryx usually handles it by guessing the encoding. It has a dictionary of common words in several languages, so it can see that löst is a common German word and guess the encoding based on that. Or if might assume that Subject uses the same encoding as the body text and see if the body text's encoding is declared. If it is, and that encoding works for the subject, Archiveopteryx uses it.
In some cases, Archiveopteryx can determine that the text doesn't matter, for example here:
Date: Tue, 13 Sep 2005 14:33:27 +0100 (Mitteleuropäische Zeit)
That +0100 already specifies the timezone, so the timezone's name is redundant. In this case Archiveopteryx handles the syntax error (…ä…) by removing the redundant specification.
Finally, if all else fails, Archiveopteryx can store the text
using the special character encoding
Old archived mail and spam often contains illegally formed email addresses. Occasionally bugs also mangle addresses. Usually this is fixable, but not always.
The proper form of that is simply
Other errors appear occasionally, such as this:
Archiveopteryx contains fixes for dozens or hundreds of errors (dozens? hundreds? it's often difficult to count this kind of bug, so we just pile on the test cases). If a message contains a really bad email address, Archiveopteryx may be unable to receive that message.
RFC 2047 specifies a way to embed non-ASCII data in (some/most) header fields. Many encoders get it wrong.
Archiveopteryx decodes the right form first, and if any signs of 2047-encoding remain, it tries any of a dozen workarounds. (No examples here, they're so very ugly and I'm tired.)
Bad 2047 encoding crops up fairly often in addresses.
For some reason, the MIME Content-* fields suffer from many more errors than most other fields. A typical example:
Content-Type: text/plain; iso-8859-1
Content-Type: text/plain; charset=iso-8859-1
Occasionally there are two different Content-Type fields. Archiveopteryx tries to select the right one by looking at the message body.
The benign case is when addresses are spread over several fields (only one is legal):
Cc: firstname.lastname@example.org Cc: email@example.com
Cc: firstname.lastname@example.org, email@example.com
Sometimes we see two date fields, or other repetition. It's usually easy to resolve that. The worst case is when two different Subject field are present (some spammers like to do that).
Sometimes illegal characters arrive in a body text, e.g. null bytes
in ASCII, single-byte
double-byte segments in euc-2022-jp,
Microsoft extensions, etc.
Archiveopteryx generally handles these by changing the character
set label to fit the contents. For example, when Microsoft's smart
quotes are used in ASCII or Latin-1, Archiveopteryx relabels the text as
In some cases, that isn't possible, and Archiveopteryx has to
unknown-8bit (and store the 8-bit input unchanged).
Finally, there are cases where the input cannot be decoded, such as
when a gb2312 body text contains a byte value greater than 128, or
double-byte segment contains an odd number of bytes. In
this case, Archiveopteryx stores a
and mail readers display � in place of the garbage.
The Archiveopteryx developers have several test corpora containing many thousands of invalid addresses and messages.
These corpora aren't public, since they contain (traces of) real people's email addresses and actual body text. Usually private content has been scrubbed (see aox anonymise), but we're sure there's some private data still hiding in the corners.
All of the canonicalisations are based on these test corpora, and each new version of Archiveopteryx is tested to ensure that it produces the expected output for the invalid input.
In case of questions, please write to firstname.lastname@example.org.
Last modified: 2011-12-22