On Thu, Jan 31, 2013 at 06:06:56PM +0700, Nguyen Thai Ngoc Duy wrote: > On Wed, Jan 30, 2013 at 09:16:29PM +0700, Duy Nguyen wrote: > > Perhaps we could store abbrev sha-1 instead of full sha-1. Nice > > space/time trade-off. > > Following the on-disk format experiment yesterday, I changed the > format to: > > - a list a _short_ SHA-1 of cached commits > - a list of cache entries, each (5 uint32_t) consists of: > - uint32_t for the index in .idx sha-1 table to get full SHA-1 of > the commit > - uint32_t for timestamp > - uint32_t for tree, 1st and 2nd parents for the index in .idx > table Thanks for working on this, as it was the next step I was going to take. :) The short-sha1 is a clever idea. Looks like it saves us on the order of 4MB for linux-2.6 (versus the full 20-byte sha1). Not as big as the savings we get from dropping the other 3 sha1's to uint32_t, but still not bad. I guess the next steps in iterating on this would be: 1. splitting out the refactoring here into separate patches 2. squashing the cleaned-up bits into my patch 4/6 3. deciding whether this should go into a separate file or as part of index v3. Your offsets depend on the .idx file having a sorted sha1 list. That is not likely to change, but it would still be nice to make sure they cannot get out of sync. I'm still curious what the performance impact is for mmap-ing N versus N+8MB. > The length of SHA-1 is chosen to be able to unambiguously identify any > cached commits. Full SHA-1 check is done after to catch false > positives. Just to be clear, these false positives come because the abbreviation is unambiguous within the packfile, but we might be looking for a commit that is not even in our pack, right? -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html