Re: Cyrus Patches used at FastMail.FM

Henrique de Moraes Holschuh <hmh@xxxxxxxxxx> · Sat, 27 May 2006 10:18:17 -0300

On Sat, 27 May 2006, Bron Gondwana wrote:
> > > 4. Recode correctly in MIME encoding based on the message charset - Would 
> > > be great but:
> > 
> > Nah, just according to some site-wide config option from imap.conf would
> > be
> > more than good enough.
> 
> Wow, I'm glad you know enough about everyone else's situtation to
> be able to make a blanket determination like that.  We have users
> from all over the world, using:

That's not it.  I was just pointing out that the bar for a "won't break
anything" patch is not as high.  Well, for the old "won't break anything"
definition on this matter, anyway.

The best heuristics I can think of for this is a bit more complex, but it
still is not perfect: you *are* always reduced to guessing if the broken PoS
that generated the email gives you no charset information at all anywhere...

0. [config: optional] Do "MUA imbecility" charset fixup, checking at
repairing mistakes like this table (each one configurabe to not be done):

   ISO-8859-1 [config: ISO-8859-15 instead] data encoded as UTF-8
   Windows CP1252 data encoded as ISO-8859-1 or ISO-8859-15 (check both)
   Windows CP1252 data encoded as UTF-8
   <other rules people contribute as common MUA fuckups, I bet there
    are some common patters for CJK and Cirilic, at the very least).

1. Reject messages with invalid charset (note: data *missing* charset
information is ignored for this step: this step is not to reject 8bit data
in headers, just completely bogus data).  This step is non-optional.

2. In "non-strict-mode": Determine a set of all charsets declared for the
headers, use frequency count as key for priority.  Then, do the same for the
message body, but at a lower pirority band.  Then do the same for a admin
configured set of charsets, at an yet lower priority band.  Do the same for
US-ASCII at the lowest possible priority.   Try to encode all header data
missing charset information using the priority-ordered set.   If no valid
encodings are possible, reject the message or do the "X" thing (according to
config).

As you see, it is far more complex, but also far more likely to do what you
want.  But I have never seen advocated in this list that such level of
functionality should be required of a patch dealing with 8bit data in
headers...

BTW: suggestions to improve the above algo are welcome.

Implementation suggestion: add filter plugins (for all ways a message can
enter the spool), and the above as a filter.  I know a lot of people would
be happy to plug an AV as a filter too for IMAP APPEND...

> 59 different charactersets - all on the same set of servers.  I
> don't know any characterset that would be sufficient to put in
> imapd.conf that would give the expected results for all of them.

There is none.

> If everyone is applying the patch because the real world isn't such a
> pretty place as all that, then maybe it really does belong in upstream,

Or a different version of the patch... but really, so far nobody who asked
(or wrote) such a patch really cared for even the simplest "recode to
configured charset" version.  IMAP SEARCH is simply kicked to hell (or
whatever, I am still wating someone to tell me it doesn't break ) :-)

> In this case, I'm presuming the "breaks multi-language search" is an
> indexing issue - and an alternative would be to skip/replace/guess the
> character at indexing stage but leave the full message on disk untouched.

Yes, you can do that too.  Flag all charset-less 8bit data as charset
"Unknown", never give it to any unaware UTF-8-processing function (security
issue), and do binary matches against it.

Sounds like quite a good compromise to me, actually.  But it does cause
reduced functionality when compared to the "do your best do fix the message
when it arrives in the spool".  OTOH hand, it is probably a lot easier to
implement...

> Tough luck if you can't search it as expected - at least you haven't
> LOST information.

Well, basically that's what all of us have been doing when we have no
choice, I suppose.

> That last point is particularly important.  By rejecting the message out of
> hand, you are preserving your pristine innards but lose interoperability

We *have* to preserve our innards to a degree where the code will never
malfunction because we broke expectations.  If the Cyrus code can, and does
deal well with the unexpected 8bit data, then the condition of preserving
the innards is *already met*.  If it doesn't, we have to either fix it so
that it does, or we should never accept such data.

> with reality.  By silently replacing unknown 8bit data with an 'X' you are
> throwing away information and lying to whoever delivered it that you've
> faithfully saved/reproduced what you were told.  The middle ground is to

Well, they lied to you when they said they had an e-mail for you too, so
don't give me that excuse :-)   An e-mail is something that perfectly
follows the proper RFCs, anything else is just __broken___ e-mail, and all
bets are off.

> accept the message, store it "as is" and ignore the stuff you don't
> understand when building indexes.

And ignore it when using the indexes. Yes, but where's the patch to do that?

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh
----
Cyrus Home Page: http://asg.web.cmu.edu/cyrus
Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu
List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html