On 8/29/06, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
Jon Smirl <jonsmirl@xxxxxxxxx> wrote: > sha1s are effectively 20 byte pointer addresses into the pack. With 2M > objects you can easily get away with 4 byte address and a mapping > table. Another idea would be to replace the 20 byte sha1 in tree > objects with 32b file offsets - requiring that anything the tree > refers to has to already be in the pack before the tree entry can be > written. I've thought of that, but when you transfer a "thin" pack over the wire the base object may not even be in the pack. Thus you can't use an offset to reference it. Otherwise there's probably little reason why the base couldn't be referenced by its 4 byte offset rather than its full 20 byte object ID. Added up over all deltas in the mozilla pack it saves a whopping 23 MiB.
Every time an object goes on the wire these 'pack internal' optimizations need to be undone. If you are sending the whole pack everything can be sent as is. These intense compression schemes are meant for archival level data. Everybody should end up with a copy of the entire archive and that will be the end of those objects moving on the wire.
From what I was able to gather I don't think Clucene stores the documents themselves as the tokenized compressed data. Or if it does you lose everything between the tokens. There's a number of things we want to preserve in the original "document" like whitespace that would be likely stripped when constructing tokens.
I can't remember if the Clucene code includes the ability to compress using the dictionary. I had thought that the code was in there but maybe not. Things that aren't in the dictionary use an escape code and are copied intact. I have the Lucene book on my desk, I'll flip through it and see what it says. Might be worthwhile to poke around on the net and see if you can come up with an existing dictionary based compressor. There has got to be one out there, this is a 30 year old concept.
But it shouldn't be that difficult to produce a rough estimate of what that storage size would be.
-- Jon Smirl jonsmirl@xxxxxxxxx - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html