On Sat, 27 May 2006, Bron Gondwana wrote: > > > 4. Recode correctly in MIME encoding based on the message charset - Would > > > be great but: > > > > Nah, just according to some site-wide config option from imap.conf would > > be > > more than good enough. > > Wow, I'm glad you know enough about everyone else's situtation to > be able to make a blanket determination like that. We have users > from all over the world, using: That's not it. I was just pointing out that the bar for a "won't break anything" patch is not as high. Well, for the old "won't break anything" definition on this matter, anyway. The best heuristics I can think of for this is a bit more complex, but it still is not perfect: you *are* always reduced to guessing if the broken PoS that generated the email gives you no charset information at all anywhere... 0. [config: optional] Do "MUA imbecility" charset fixup, checking at repairing mistakes like this table (each one configurabe to not be done): ISO-8859-1 [config: ISO-8859-15 instead] data encoded as UTF-8 Windows CP1252 data encoded as ISO-8859-1 or ISO-8859-15 (check both) Windows CP1252 data encoded as UTF-8 <other rules people contribute as common MUA fuckups, I bet there are some common patters for CJK and Cirilic, at the very least). 1. Reject messages with invalid charset (note: data *missing* charset information is ignored for this step: this step is not to reject 8bit data in headers, just completely bogus data). This step is non-optional. 2. In "non-strict-mode": Determine a set of all charsets declared for the headers, use frequency count as key for priority. Then, do the same for the message body, but at a lower pirority band. Then do the same for a admin configured set of charsets, at an yet lower priority band. Do the same for US-ASCII at the lowest possible priority. Try to encode all header data missing charset information using the priority-ordered set. If no valid encodings are possible, reject the message or do the "X" thing (according to config). As you see, it is far more complex, but also far more likely to do what you want. But I have never seen advocated in this list that such level of functionality should be required of a patch dealing with 8bit data in headers... BTW: suggestions to improve the above algo are welcome. Implementation suggestion: add filter plugins (for all ways a message can enter the spool), and the above as a filter. I know a lot of people would be happy to plug an AV as a filter too for IMAP APPEND... > 59 different charactersets - all on the same set of servers. I > don't know any characterset that would be sufficient to put in > imapd.conf that would give the expected results for all of them. There is none. > If everyone is applying the patch because the real world isn't such a > pretty place as all that, then maybe it really does belong in upstream, Or a different version of the patch... but really, so far nobody who asked (or wrote) such a patch really cared for even the simplest "recode to configured charset" version. IMAP SEARCH is simply kicked to hell (or whatever, I am still wating someone to tell me it doesn't break ) :-) > In this case, I'm presuming the "breaks multi-language search" is an > indexing issue - and an alternative would be to skip/replace/guess the > character at indexing stage but leave the full message on disk untouched. Yes, you can do that too. Flag all charset-less 8bit data as charset "Unknown", never give it to any unaware UTF-8-processing function (security issue), and do binary matches against it. Sounds like quite a good compromise to me, actually. But it does cause reduced functionality when compared to the "do your best do fix the message when it arrives in the spool". OTOH hand, it is probably a lot easier to implement... > Tough luck if you can't search it as expected - at least you haven't > LOST information. Well, basically that's what all of us have been doing when we have no choice, I suppose. > That last point is particularly important. By rejecting the message out of > hand, you are preserving your pristine innards but lose interoperability We *have* to preserve our innards to a degree where the code will never malfunction because we broke expectations. If the Cyrus code can, and does deal well with the unexpected 8bit data, then the condition of preserving the innards is *already met*. If it doesn't, we have to either fix it so that it does, or we should never accept such data. > with reality. By silently replacing unknown 8bit data with an 'X' you are > throwing away information and lying to whoever delivered it that you've > faithfully saved/reproduced what you were told. The middle ground is to Well, they lied to you when they said they had an e-mail for you too, so don't give me that excuse :-) An e-mail is something that perfectly follows the proper RFCs, anything else is just __broken___ e-mail, and all bets are off. > accept the message, store it "as is" and ignore the stuff you don't > understand when building indexes. And ignore it when using the indexes. Yes, but where's the patch to do that? -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh ---- Cyrus Home Page: http://asg.web.cmu.edu/cyrus Cyrus Wiki/FAQ: http://cyruswiki.andrew.cmu.edu List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html