Jeff King wrote: > On Thu, Mar 09, 2017 at 12:24:08PM -0800, Jonathan Nieder wrote: >>> SHA-1 to SHA-3: lookup SHA-1 in .msha1, reverse .idx, find offset to >>> read the SHA-3. >>> SHA-3 to SHA-1: lookup SHA-3 in .idx, and reverse the .msha1 file to >>> translate offset to SHA-1. >> >> Thanks for this suggestion. I was initially vaguely nervous about >> lookup times in an idx-style file, but as you say, object reads from a >> packfile already have to deal with this kind of lookup and work fine. > > Not exactly. The "reverse .idx" step has to build the reverse mapping on > the fly, and it's non-trivial. Sure. To be clear, I was handwaving over that since adding an on-disk reverse .idx is a relatively small change. [...] > So I think it's solvable, but I suspect we would want an extension to > the .idx format to store the mapping array, in order to keep it log-n. i.e., this. The loose object side is the more worrying bit, since we currently don't have any practical bound on the number of loose objects. One way to deal with that is to disallow loose objects completely. Use packfiles for new objects, batching the objects produced by a single process into a single packfile. Teach "git gc --auto" a behavior similar to Martin Fick's "git exproll" to combine packfiles between full gcs to maintain reasonable performance. For unreachable objects, instead of using loose objects, use "unreachable garbage" packs explicitly labeled as such, with similar semantics to what JGit's DfsRepository backend uses (described in the discussion at https://git.eclipse.org/r/89455). That's a direction that I want in the long term anyway. I was hoping not to couple such changes with the hash transition but it might be one of the simpler ways to go. Jonathan