On Sun, Sep 30, 2012 at 6:59 PM, Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> wrote: > On Mon, Oct 1, 2012 at 8:07 AM, Shawn Pearce <spearce@xxxxxxxxxxx> wrote: >>> You mentioned this before in your idea mail a while back. I wonder if >>> it's worth storing bitmaps for all packs, not just the self contained >>> ones. >> >> Colby and I started talking about this late last week too. It seems >> feasible, but does add a bit more complexity to the algorithm used >> when enumerating. > > Yes. Though at server side, if it's too much trouble, the packer can > just ignore open packs and use only closed ones. Its not trouble once the code is written. We were just trying to be expedient in producing a prototype that we could start to deploy on real-world workloads. Enumerating the non-closed-pack objects using the classical implementation is still slow and consumes CPU time at the server, using partial bitmaps should eliminate most of that CPU usage and reduce server loads. One of the more troublesome problems is building the bitmaps is difficult from a streaming processor like index-pack. You need the reachability graph for all objects, which is not currently produced when moving data over the wire. We do an fsck after-the-fact to verify we didn't get corrupt data, but this is optional and currently after the pack is stored. We need to refactor this code to run earlier to get the bitmap built. If we take Peff's idea and put the bitmap data into a new stream rather than the pack-*.idx file we can produce the bitmap at the same time as the fsck check, which is probably a simpler change. >>> We could have one leaf bitmap per pack to mark all leaves where >>> we'll need to traverse outside the pack. Commit leaves are the best as >>> we can potentially reuse commit bitmaps from other packs. Tree leaves >>> will be followed in the normal/slow way. >> >> Yes, Colby proposed the same idea. >> >> We cannot make a "leaf bitmap per pack". The leaf SHA-1s are not in >> the pack and therefore cannot have a bit assigned to them. > > We could mark all objects _in_ the pack that lead to an external > object. That's what I meant by leaves. We need to parse the leaves to > find out actual SHA-1s that are outside the pack. OK, yes, I was being pretty stupid. Of course we could mark the objects _in_ the pack as leaves and parse the leaves when we need to find the external pointers. > Or we could go with > your approach below too. I actually think my approach may be better. The root tree of a leaf commit would need to be scanned every time to identify trees that are reachable from this leaf commit, but are not reachable in the ancestry of the leaf's parents. This turns out to be rather expensive to compute every time a server or an fsck algorithm considers the partial pack. Its also somewhat uncommon for such an object to exist, it has to happen by an unrelated side branch introducing the same object and the creator of the pack seeing that object already exist and not including it in the pack. Defining the pack's "edge" as a list of SHA-1s not in this pack but known to be required allows us to compute that leaf root tree reachability once, and never consider parsing it again. Which saves servers that host frequently accessed Git repositories but aren't repacking all of the time. (FWIW we repack frequently, I hear GitHub does too, because a fully repacked repository serves clients better than a partially packed one.) >> We could >> add a new section that listed the unique leaf SHA-1s in their own >> private table, and then assigned per bitmap a leaf bitmap that set to >> 1 for any leaf object that is outside of the pack. > >> One of the problems we have seen with these non-closed packs is they >> waste an incredible amount of disk. As an example, do a `git fetch` >> from Linus tree when you are more than a few weeks behind. You will >> get back more than 100 objects, so the thin pack will be saved and >> completed with additional base objects. That thin pack will go from a >> few MiBs to more than 40 MiB of data on disk, thanks to the redundant >> base objects being appended to the end of the pack. For most uses >> these packs are best eliminated and replaced with a new complete >> closure pack. The redundant base objects disappear, and Git stops >> wasting a huge amount of disk. > > That's probably a different problem. I appreciate disk savings but I > would not want to wait a few more minutes for repack on every > git-fetch. But if this bitmap thing makes repack much faster than > currently, repacking after every git-fetch may become practical. I know the "Counting objects" phase of a repack is very expensive, but so too is the time required to do the IO in and out of every object in the repository. Spinning disks only transfer data so fast. Assuming the drive can move 50 MiB/s each way, a 500 MiB repository will take 10 seconds just to write back out the new pack. You aren't ever going to want to wait for repacking during git-fetch. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html