Syntactically invalid messages

In order to store a message in its database, Archiveopteryx has to ensure that it is syntactically valid. This is usually not a problem — most mail is valid, despite the urban legends. Some mail does have errors, but even that tends to be valid enough for Outlook.

Having a better mail parser than Outlook isn't a cinch, but it isn't impossible either. Archiveopteryx has one.

This page lists some examples of syntax errors seen in real mail, and describes how Archiveopteryx handles them. The errors are roughly sorted from innocent/common to evil/uncommon.

Note that even when a message cannot be parsed, Archiveopteryx does not reject it. Rather, it is stored in a special manner, and after each upgrade, Archiveopteryx will offer to try parsing again. aox reparse -e can also help you compose a bug report containing an anonymised version of the invalid messages. (Most people don't bother, because it's all spam these days.)

Illegal 8-bit text

A typical example.

Subject: Die Löst wäre dann 42

The ö and ä are undeclared 8-bit, so we don't know which encoding they use.

Archiveopteryx usually handles it by guessing the encoding. It has a dictionary of common words in several languages, so it can see that lösung and wäre are common German words and guess the encoding based on that. Or if might assume that Subject uses the same encoding as the body text and see if the body text's encoding is declared. If it is, and that encoding works for the subject, Archiveopteryx uses it.

In some cases, Archiveopteryx can determine that the text doesn't matter, for example here:

Date: Tue, 13 Sep 2005 14:33:27 +0100 (Mitteleuropäische Zeit)

That +0100 already specifies the timezone, so the timezone's name is redundant. In this case Archiveopteryx handles the syntax error (…ä…) by removing the redundant specification.

Finally, if all else fails, Archiveopteryx can store the text using the special character encoding unknown-8bit.

Address syntax

Old archived mail and spam often contains illegally formed email addresses. Occasionally bugs also mangle addresses. Usually this is fixable, but not always.

To: <Undisclosed-Recipient:@uucp-relay.eunet.no;>

The proper form of that is simply

To: Undisclosed-Recipient:;

Other errors appear occasionally, such as this:

To: <"Enquiries@example.com">

Archiveopteryx contains fixes for dozens or hundreds of errors (dozens? hundreds? it's often difficult to count this kind of bug, so we just pile on the test cases). If a message contains a really bad email address, Archiveopteryx may be unable to receive that message.

Bad 2047 encoding

RFC 2047 specifies a way to embed non-ASCII data in (some/most) header fields. Many encoders get it wrong.

Archiveopteryx decodes the right form first, and if any signs of 2047-encoding remain, it tries any of a dozen workarounds. (No examples here, they're so very ugly and I'm tired.)

Bad 2047 encoding crops up fairly often in addresses.

MIME syntax errors

For some reason, the MIME Content-* fields suffer from many more errors than most other fields. A typical example:

Content-Type: text/plain; iso-8859-1

Fixed:

Content-Type: text/plain; charset=iso-8859-1

Occasionally there are two different Content-Type fields. Archiveopteryx tries to select the right one by looking at the message body.

Repeated header fields

The benign case is when addresses are spread over several fields (only one is legal):

Cc: max@example.org Cc: team@exemplo.com.br

Correct:

Cc: max@example.org, team@exemplo.com.br

Sometimes we see two date fields, or other repetition. It's usually easy to resolve that. The worst case is when two different Subject field are present (some spammers like to do that).

Invalid body encoding

Sometimes illegal characters arrive in a body text, e.g. null bytes in ASCII, single-byte double-byte segments in euc-2022-jp, Microsoft extensions, etc.

Archiveopteryx generally handles these by changing the character set label to fit the contents. For example, when Microsoft's smart quotes are used in ASCII or Latin-1, Archiveopteryx relabels the text as cp1252.

In some cases, that isn't possible, and Archiveopteryx has to relabel as unknown-8bit (and store the 8-bit input unchanged).

Finally, there are cases where the input cannot be decoded, such as when a gb2312 body text contains a byte value greater than 128, or when a double-byte segment contains an odd number of bytes. In this case, Archiveopteryx stores a replacement character, and mail readers display � in place of the garbage.

Verifying correctness

The Archiveopteryx developers have several test corpora containing many thousands of invalid addresses and messages.

These corpora aren't public, since they contain (traces of) real people's email addresses and actual body text. Usually private content has been scrubbed (see aox anonymise), but we're sure there's some private data still hiding in the corners.

All of the canonicalisations are based on these test corpora, and each new version of Archiveopteryx is tested to ensure that it produces the expected output for the invalid input.

In case of questions, please write to info@aox.org.

Relevant links

About this page

Last modified: 2011-12-22
Location: aox.org/badmail/