Email Address Format Myths

I'm used to enter a unique email address whenever I really have to provide one. Up to some weeks ago I used the reg.{0}@mydomain.com pattern, where I replaced {0} with a name I associate with the site, often the second level domain part (e.g. "amazon"). I applied a set of server side processing rules to move some of them to special folders, and automatically delete mails to blacklisted names. This worked very well for the last, say, 7-8 years. More than once I observed that a name I used to register for a once free service was obviously sold to spammers some months after the service was discontinued, so I just blacklisted this name.

I can do this because I have my own domain and mail server, and I've set a catch-all alias. Unfortunately I receive a lot of spam to obviously random generated email aliases (like abnd798as8fihkasd89@...), so I finally decided to upgrade my mail server to SmarterMail 3, which not only provides a much better spam filter system, but also allows the so called plus-addressing, allowing the same thing (the format is now reg+{0}@mydomain.com, and it automatically moves mails to a {0}-folder (optional: always, only if folder already exists, never)) but without the hassle of a catch-all alias.

At the beginning this worked very well. But soon I got problems because some web apps don't allow the +-character in email addresses. This is strange, since a + in the local part (that is the part before the @) perfectly legal.

Obviously there exists some strange myths about what characters are allowed in an email address or not. I suspect the reason for some of this myths is that the rules for domain names are very restrictive - domain names indeed don't allow +-characters. But the local-part does, and much more. I've also heard myth saying that everything after a + is considered as a comment. Or that + is considered equal to @ in case you can't enter a real @ for whatever reason. Btw, GMail supports Plus-Addressing (no, they did not invent it!), but the part after the + is apparently limited to six characters (I haven't tested that).

According to RFC2821 (SMTP) and RFC2822 (Internet Message Format), a mailbox address is constructed as follows:

1: 
2: 
3: 
4: 
5: 
6: 
7: 
Mailbox = Local-part "@" Domain
Local-part = Dot-string / Quoted-string
Dot-string = Atom *("." Atom)
Atom = 1*atext
atext = ALPHA / DIGIT / "!" / "#" / "$" / "%" / "&" / "'" /
"*" / "+" / "-" / "/" / "=" / "?" / "^" / "_" / "`" / "{" /
"|" / "}" / "~"

To summarize, the local part (before @) allows the following characters:

1: 
2: 
3: 
4: 
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789
.!#$%&"*+-/=?^_`{|}~

While the domain part (after @) only allows the following characters (except IDN, but that's another topic):

1: 
2: 
3: 
4: 
ABCDEFGHIJKLMNOPQRSTUVWXYZ
abcdefghijklmnopqrstuvwxyz
0123456789
.-

Note that domain names are never case-sensitive, but the local-part might be (allowed, but optional).

Beside of the next best web-app I've also heard of some MTAs making problems with local-parts that aren't valid domain names. Why on earth do developers still validate the whole email address with the domain name rules? Just for simplicity? Please, don't do that, or I'll have to put your name to the "plus-haters list of shame" ;)

(Migrated Comments)

MovGP0, September 28, 2006

As Internet-Technican I see such problems all day. It seems that most Mail-Application Programmers don't read RFCs...

Even when you restrict your mail-address to letters, digits and a dot or underscore, you can run into troubles too when having the dot right before the @-sign...

ys, MovGP0

Christoph Ruegg, September 29, 2006

Thanks, looks like I'm not alone with these problems then :)

Yes, it is indeed not allowed to have a dot right before the @ sign (neither as the first character), according to the mentioned RFCs, as Atom = 1*atext means "one or more atext's"

Christoph Ruegg, October 5, 2006

Just as a reference: I built the following regular expression (regex) directly from the two RFCs 2821 & 2822, but I dropped support for the following special cases:

  • General Address Literal (domain-part: generalization of the IPv6 literal); However, IPv4 and IPv6 literals are supported (although the IPv6 format is not checked for valid addresses)!
  • I don't check the total lengths
  • Quoted-Strings in the local part (they're not recommended anyway)

Some RFC-valid email addresses:

RFC-invalid email addresses:

And this is the regex (should be all in one line):

1: 
2: 
3: 
4: 
5: 
[-a-zA-Z0-9!#$%&'\*+/=?^_'{|}~]+
(?:\.[-a-zA-Z0-9!#$%&'\*+/=?^_'{|}~]+)*
@([a-zA-Z0-9](?:[-a-zA-Z0-9]\*[a-zA-Z0-9])?
(?:\.[a-zA-Z0-9](?:[-a-zA-Z0-9]\*[a-zA-Z0-9])?)+
|\[(?:\d{1,3}(?:\.\d{1,3}){3}|IPv6:[0-9A-Fa-f:]{4,39})\])