On Tue, 08 Jun 2010 13:50:08 -0700, "H. Peter Anvin" <hpa@xxxxxxxxx> wrote: > On 06/08/2010 12:57 PM, Carl Worth wrote: > > When I did that, I was careful to escape lines from the bodies of email > > messages that begin with zero or more '>' characters followed > > immediately by "From " (From_ lines) by adding an initial '>'. [2] ... > The problem with that is that it is not universally applied. Right. And since I can't fix this universe, I'd like to at least start with getting notmuch and git to use the same thing. Currently, git is using a non-standard not-quite-safe mbox format while notmuch doesn't yet emit anything like mbox. So we have a nice opportunity to fix these two projects to at least work well together, (if we can agree on a format). > As far as I can tell, the Content-Length: is the most reliably handled > format and probably is what we should use. This is the "mboxcl2" format > in your list.[*] Unfortunately "mboxcl2" and "mboxrd" cannot be > distinguished from each other by inspection, which is a major defect of > both formats. What do you mean by "most reliably handled format"? Of the four mbox formats listed on the page I cited[*], "mboxo" and "mboxcl" are easy to discard as they both irreversibly corrupt messages. That leaves both "mboxrd" and "mboxcl2" as candidates. Either of these formats is reliable if both the reader and writer use the same format. When the reader and writer don't agree, then there are problems as follows ("W:" indicates writing, "R:" indicates reading expecting a particular format): W:mboxrd then R:mboxcl2 -> Reader may corrupt by failing to remove '>' Reader must give up/guess without CL headers Guessing is at least unlikely to mis-split messages W:mboxcl2 then R:mboxrd -> Reader may corrupt by erroneously removing '>' Reader may mis-split messages on "From " in content I preferred to implement mboxrd over mboxcl2 for several reasons: 1. The mboxrd writer implementation is much simpler. This format affords a simple streaming implementation where mboxcl2 requires knowing the length of the message in advance. 2. The mboxrd format is robust in the face of file changes that invalidate the Content-Length headers, (for example, a person can hand-edit an mboxrd file without invalidating it, but cannot do the same with an mboxcl2 file). 3. The mboxrd reader implementation is much simpler. An mboxcl2 reader necessarily has special-cases that an mboxrd implementation does not. What to do if there is no Content-Length header? What to do if the Content-Length header appears wrong? etc. Recovery code for these cases might well be to fallback to something like an mboxrd implementation, which demonstrates the increased complexity here. As can be seen in my patch, doing an mboxrd reader in git-mailsplit was quite simple. An mboxcl2 reader would be quite a bit more complicated, but with no actual benefit in reliability, (assuming that the reader matches the writer). > The statement that "the entire "mbox" family of mailbox formats is > gradually becoming irrelevant, and of only historical interest" is also > pretty silly -- mbox is still the preferred format for moving groups of > email from MUA to MUA, even if it is no longer used for active live > spool storage. But, of course, you knew that already. Indeed. Though I was surprised to recently find that postfix does still by default deliver to /var/mail/$user in "mboxo" format (ugh). -Carl [*] http://homepage.ntlworld.com/jonathan.deboynepollard/FGA/mail-mbox-formats.html -- carl.d.worth@xxxxxxxxx
Attachment:
pgpmmTXwspFKn.pgp
Description: PGP signature