On Fri, Sep 28, 2012 at 5:00 AM, Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> wrote: > On Thu, Sep 27, 2012 at 7:47 AM, Shawn Pearce <spearce@xxxxxxxxxxx> wrote: >> * https://git.eclipse.org/r/7939 >> >> Defines the new E003 index format and the bit set >> implementation logic. > > Quote from the patch's message: > > "Currently, the new index format can only be used with pack files that > contain a complete closure of the object graph e.g. the result of a > garbage collection." > > You mentioned this before in your idea mail a while back. I wonder if > it's worth storing bitmaps for all packs, not just the self contained > ones. Colby and I started talking about this late last week too. It seems feasible, but does add a bit more complexity to the algorithm used when enumerating. > We could have one leaf bitmap per pack to mark all leaves where > we'll need to traverse outside the pack. Commit leaves are the best as > we can potentially reuse commit bitmaps from other packs. Tree leaves > will be followed in the normal/slow way. Yes, Colby proposed the same idea. We cannot make a "leaf bitmap per pack". The leaf SHA-1s are not in the pack and therefore cannot have a bit assigned to them. We could add a new section that listed the unique leaf SHA-1s in their own private table, and then assigned per bitmap a leaf bitmap that set to 1 for any leaf object that is outside of the pack. This would probably take up the least amount of disk space, vs. storing the list of leaf SHA-1s after each bitmap. If a pack has only 1 bitmap (e.g. it is a small chunk of recent history) there is really no difference in disk usage. If the pack has 2 or 3 commit bitmaps along a string of approximately 300 commits, you will have an identical leaf set for each of those bitmaps so using a single leaf SHA-1 table would support reusing the redundant leaf pointers. One of the problems we have seen with these non-closed packs is they waste an incredible amount of disk. As an example, do a `git fetch` from Linus tree when you are more than a few weeks behind. You will get back more than 100 objects, so the thin pack will be saved and completed with additional base objects. That thin pack will go from a few MiBs to more than 40 MiB of data on disk, thanks to the redundant base objects being appended to the end of the pack. For most uses these packs are best eliminated and replaced with a new complete closure pack. The redundant base objects disappear, and Git stops wasting a huge amount of disk. > For connectivity check, fewer trees/commits to deflate/parse means > less time. And connectivity check is done on every git-fetch (I > suspect the other end of a push also has the same check). It's not > unusual for me to fetch some repos once every few months so these > incomplete packs could be quite big and it'll take some time for gc > --auto to kick in (of course we could adjust gc --auto to start based > on the number of non-bitmapped objects, in additional to number of > packs). Yes, of course. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html