On 8/14/06, Johannes Schindelin <Johannes.Schindelin@xxxxxx> wrote:
I still think that this is important to think through: Is it worth a couple of kilobytes (I doubt that it would be as much as 1MB in _total_), and be on the unsafe side?
The maps look something like this: void => 10101010 char => 111101 int => 1110101 tree => 11010111 commit => 101011001 These maps are repeated in every one of my 1M revisions including the deltas. I have 1GB pack files with 1M entries in them - 1K each entry. Each byte saved out of a zlib entry take 1MB off my pack. Note that the current internal maps aren't the same in each of each of the zlib blobs since the algorithm that builds the internal maps depends on the order the identifiers were encountered. If the git tools add a global dictionary the tools would still be able to read existing packs. If old tools try to read a new dictionary based pack they will get the zlib NEED_DICT error. If the entire file was one big zlib blob there would only be one dictionary and adding a fixed dictionary wouldn't make any difference. But since it is 1M little zlib blobs it is has 1M dictionaries. The only "unsafe" aspect I see to this is if the global dictionary doesn't contain any of the words in the documents being encoded. In that case the global dictionary will occupy the short huffman keys forcing longer internal keys. The keys for the words in the document would be longer by a about a bit on average. A solution for making this work over time would be to store the global dictionary at the front of the pack file and for the unpack tools to use the stored copy. This would let us change the global dictionary in the pack tool with no downside, you could even support multiple dictionaries in the pack tool. If someone wants to get fancy you could write a tool that would scan a pack file and compute an optimal fixed dictionary. Store it at the front of the pack file and repack using it. Global dictionaries are common in full text searching. I seem to recall an article stating that Google's global dictionary has about 250K entries in it. If git packs switch to a global dictionary model it's not a big leap to add a full text search index. You just need objects for each word in the dictionary pointing to the revisions that contain it. -- Jon Smirl jonsmirl@xxxxxxxxx - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html