In order to store a message in its
database, Archiveopteryx has to ensure that
it is syntactically valid. This is usually not a problem — most
mail is valid, despite the urban legends. Some mail does have errors,
but even that tends to be valid enough for
Outlook
.
Having a better mail parser than Outlook isn't a cinch, but it isn't impossible either. Archiveopteryx has one.
This page lists some examples of syntax errors seen in real mail, and describes how Archiveopteryx handles them. The errors are roughly sorted from innocent/common to evil/uncommon.
Note that even when a message cannot be parsed, Archiveopteryx does not reject it. Rather, it is stored in a special manner, and after each upgrade, Archiveopteryx will offer to try parsing again. aox reparse -e can also help you compose a bug report containing an anonymised version of the invalid messages. (Most people don't bother, because it's all spam these days.)
A typical example.
Subject: Die Löst wäre dann 42
The ö and ä are undeclared 8-bit, so we don't know which encoding they use.
Archiveopteryx usually handles it by guessing the encoding. It has a dictionary of common words in several languages, so it can see that lösung and wäre are common German words and guess the encoding based on that. Or if might assume that Subject uses the same encoding as the body text and see if the body text's encoding is declared. If it is, and that encoding works for the subject, Archiveopteryx uses it.
In some cases, Archiveopteryx can determine that the text doesn't matter, for example here:
Date: Tue, 13 Sep 2005 14:33:27 +0100 (Mitteleuropäische Zeit)
That +0100 already specifies the timezone, so the timezone's name is redundant. In this case Archiveopteryx handles the syntax error (…ä…) by removing the redundant specification.
Finally, if all else fails, Archiveopteryx can store the text
using the special character encoding unknown-8bit
.
Old archived mail and spam often contains illegally formed email addresses. Occasionally bugs also mangle addresses. Usually this is fixable, but not always.
To: <Undisclosed-Recipient:@uucp-relay.eunet.no;>
The proper form of that is simply
To: Undisclosed-Recipient:;
Other errors appear occasionally, such as this:
To: <"Enquiries@example.com">
Archiveopteryx contains fixes for dozens or hundreds of errors (dozens? hundreds? it's often difficult to count this kind of bug, so we just pile on the test cases). If a message contains a really bad email address, Archiveopteryx may be unable to receive that message.
RFC 2047 specifies a way to embed non-ASCII data in (some/most) header fields. Many encoders get it wrong.
Archiveopteryx decodes the right form first, and if any signs of 2047-encoding remain, it tries any of a dozen workarounds. (No examples here, they're so very ugly and I'm tired.)
Bad 2047 encoding crops up fairly often in addresses.
For some reason, the MIME Content-* fields suffer from many more errors than most other fields. A typical example:
Content-Type: text/plain; iso-8859-1
Fixed:
Content-Type: text/plain; charset=iso-8859-1
Occasionally there are two different Content-Type fields. Archiveopteryx tries to select the right one by looking at the message body.
The benign case is when addresses are spread over several fields (only one is legal):
Cc: max@example.org Cc: team@exemplo.com.br
Correct:
Cc: max@example.org, team@exemplo.com.br
Sometimes we see two date fields, or other repetition. It's usually easy to resolve that. The worst case is when two different Subject field are present (some spammers like to do that).
Sometimes illegal characters arrive in a body text, e.g. null bytes
in ASCII, single-byte double-byte segments
in euc-2022-jp,
Microsoft extensions, etc.
Archiveopteryx generally handles these by changing the character
set label to fit the contents. For example, when Microsoft's smart
quotes are used in ASCII or Latin-1, Archiveopteryx relabels the text as
cp1252
.
In some cases, that isn't possible, and Archiveopteryx has to
relabel as unknown-8bit
(and store the 8-bit input unchanged).
Finally, there are cases where the input cannot be decoded, such as
when a gb2312 body text contains a byte value greater than 128, or
when a double-byte segment
contains an odd number of bytes. In
this case, Archiveopteryx stores a
replacement character,
and mail readers display � in place of the garbage.
The Archiveopteryx developers have several test corpora containing many thousands of invalid addresses and messages.
These corpora aren't public, since they contain (traces of) real people's email addresses and actual body text. Usually private content has been scrubbed (see aox anonymise), but we're sure there's some private data still hiding in the corners.
All of the canonicalisations are based on these test corpora, and each new version of Archiveopteryx is tested to ensure that it produces the expected output for the invalid input.
In case of questions, please write to info@aox.org.
Last modified: 2011-12-22
Location: aox.org/badmail/