> A 500-user company can easily acquire an email archive of 2-5TB. I don't > care how much the IO load of that archive server increases, but I'd like > to reduce disk space utilisation. If the customer can stick to 2TB of It would be interesting to measure the amount of duplication that is going on with attachments in emails. While we could do that with Fastmail data, I think because of the broad range of users, we'd be getting one data point, which might be quite different to a data point inside one company. Eg. An architectural firm might end up sending big blueprint documents back and forth between each other a lot, so they'd gain a lot from deduplication. Also even within deduplication, there's some interesting ideas as well. For instance, if you know the same file is being sent back and forth a lot with minor changes, you might want to store the most "recent" version, and store binary diffs between the most recent and old versions (eg xdelta). Yes accessing the older versions would be much slower (have to get most recent + apply N deltas), but the space savings could be huge. > Can we just brainstorm with you and others in this thread... how do we > re-create a byte-identical attachment from a disk file? One overall implementation issue. With the message file, do you: 1. Completely rewrite the message file removing the attachments and adding any extra meta data you want in it's place 2. Leave the message file as exactly the same size, just don't write out the attachment content and assume your filesystem supports sparse files (http://en.wikipedia.org/wiki/Sparse_file) The advantage of 2 is that it leaves the message file size correct, and all the offsets in the file are still correct. The downsides are that you must ensure your FS supports sparse files well, and there's the question of where do you actually store the information that links to the external file? > - file name/reference > - full MIME header of the attachment block I'd leave these intact in the actual message, and just add an extra X-Detached-File header or something like that includes some external reference to the file. Hmmm, that'll break signing though. Not so easy... > - separator string (this will be retained in the message body anyway) > - transfer encoding > - if encoding = base64 then > base64 line length Remember every line can actually be a different length! In most cases they will be the same length, but you can't assume it. And you do see messages that have lines in repeating groups like 76, 76, 76, 76, 74, 76, 76, 76, 76, 74, ... repeat ... or cases like that, a pain to deal with. > - checksum of encoded attachment (as a sanity check in case the > re-encoding > fails to recreate exactly the same image as the original) This is seeming a bit more tricky... Rob ---- Cyrus Home Page: http://www.cyrusimap.org/ List Archives/Info: http://lists.andrew.cmu.edu/pipermail/info-cyrus/