On Wed, Sep 15, 2010 at 08:40:59AM +0530, Shuvam Misra wrote: > Dear Rob, > > I had reservations about some of these things too. :( In particular, > I was wondering about having to remember and recreate the exact > transfer-encoding. If both of us forward the same attachment in two > emails, and one encodes in quoted-printable, the other in base64, Cyrus > had better be able to recreate them exactly or have some other > workarounds. > > I wasn't aware of the mmap() usage and the direct seeking into the middle > of the message body. But the bigger problem is what you've described about > reproducing the message byte-identically. If that can be solved, then we > can make Cyrus re-create the message while loading from disk and stick it > into RAM. There's not actually THAT much parsing of the message body. I would guess it's about 9 places: imap/cyrdump.c 250: r = mailbox_map_message(state->mailbox, uids[i], &base, &len); imap/index.c 1013: if (mailbox_map_message(mailbox, im->record.uid, 1535: if (mailbox_map_message(mailbox, im->record.uid, &msg_base, &msg_size)) 2441: if (mailbox_map_message(mailbox, im->record.uid, &msg_base, &msg_size)) { 2716: if (mailbox_map_message(mailbox, im->record.uid, &msg_base, &msg_size)) 3152: if (mailbox_map_message(mailbox, im->record.uid, 3337: if (mailbox_map_message(mailbox, uid, &msgfile.base, &msgfile.size)) { 5112: if (mailbox_map_message(mailbox, im->record.uid, &msg_base, &msg_size)) (those 8 plus one in imap/message.c where it gets parsed originally) > Can we just brainstorm with you and others in this thread... how do we > re-create a byte-identical attachment from a disk file? What is the list > of attributes we will need to store per stripped attachment to allow an > exact re-creation? I did a bunch of work on this a while back. Basically for the byte idential reverse, as I said - keep a list of the most common mapping functions and try to figure out which one it is algorithmically. In theory we can work out what the common ones are pretty fast. > - file name/reference > > - full MIME header of the attachment block > > - separator string (this will be retained in the message body anyway) > > - transfer encoding All this stuff I'd keep as a binary diff from the "nearly right" re-encoding. > - if encoding = base64 then > base64 line length Yeah, that's an interesting one. Assuming it's not totally pathological there will be some base64 pattern you can find quickly. > - checksum of encoded attachment (as a sanity check in case the re-encoding > fails to recreate exactly the same image as the original) We like sha1s. > If encoding = quoted-printable or uuencode, then don't strip the > attachment at all. Makes sense. There might be some size based logic here too - only bother applying this on messages over 20k, and where the attachment is at least 20k in size. Anything smaller than that is pretty pointless. > What other conditions may we need to look for to bypass attachment > stripping? > > Can we just tap into all of you to get the ideas on paper, even if > it's not being implemented by anyone right now? It'll at least help us > understand the system's internals better. Sure. Ideas are good :) I don't think I'm sold on the value though. And given that Rob is actually the one who argued me down from implementing this years ago ;) But maybe our use case isn't the same as yours. Bron. ---- Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/