On Feb 26, 2007, at 23:32, Nicolas Pitre wrote:
One thought I had here was to expand the fan-out table from 1<<8
entries to 1<<16 entries, then store only the low 18 bytes of
the SHA-1. We would have another 2 bytes worth of space to store
the offset, pushing our total offset up to 48 bits.
That would penalize small packs a lot. the index would always start
from 256KB in size. With a pack of 100 objects (our current treshold
for keeping a pack) that means a 258KB index file. Currently the
index
file for a 100-object pack is 3.4KB.
Why can't we do it with the current 1<<8 entry fan-out?
This would allow increases of pack file size up to 1 TB.
For larger repositories, we just need to use multiple
pack files. A couple hundred 1 TB pack files doesn't seem
to be a big issue.
Say a couple years from now, we can write data to stable storage
(disks/flash/holograms or whatever) at 1 GB/sec, then it would still
take 16 minutes to write a 1 TB file. At that point we'd need a
bigger overhaul than just larger offsets in the pack file.
BTW, here are a few issues with the current pack file format:
- The final SHA1 consists of the count of objects in the file
and all compressed data. Why? This is horrible for streaming
applications where you only know the count of objects at the
end, then you need to access *all* data to compute the SHA-1.
Much better to just use compute a SHA1 over the SHA1's of each
object. That way at least the data streamed can be streamed to
disk. Buffering one SHA1 per object is probably going to be OK.
- The object count is implicit in the SHA1 of all objects and the
objects we find in the file. Why do we need it in the first place?
Better to recompute it when necessary. This makes true streaming
possible.
-Geert
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html