On Fri, Mar 14, 2014 at 10:29 PM, Jeff King <peff@xxxxxxxx> wrote: >> If an object is reused, we already know its compressed size. If it's >> not reused and is a loose object, we could use on-disk size. It's a >> lot harder to estimate an not-reused, deltified object. All we have is >> the uncompressed size, and the size of each delta in the delta chain.. >> Neither gives a good hint of what the compressed size would be. > > Hmm. I think we do have the compressed delta size after having run the > compression phase (because that is ultimately what we compare to find > the best delta). There are cases when we try not to find deltas (large blobs, file too small, or -delta attribute). The large blob case is especially interesting because progress bar crawls slowly when they write these objects. > Loose objects are probably the hardest here, as we > actually recompress them (IIRC, because packfiles encode the type/size > info outside of the compressed bit, whereas it is inside for loose > objects; the "experimental loose" format harmonized this, but it never > caught on). > > Without doing that recompression, any value you came up with would be an > estimate, though it would be pretty close (not off by more than a few > bytes per object). That's my hope. Although if they tweak compression level then the estimation could be off (gzip -9 and gzip -1 produce big difference in size) > However, you can't just run through the packing list > and add up the object sizes; you'd need to do a real "dry-run" through > the writing phase. There are probably more I'm missing, but you need at > least to figure out: > > 1. The actual compressed size of a full loose object, as described > above. > > 2. The variable-length headers for each object based on its type and > size. We could run through a "typical" repo, calculate the average header length then use it for all objects? > > 3. The final form that the object will take based on what has come > before. For example, if there is a max pack size, we may split an > object from its delta base, in which case we have to throw away the > delta. We don't know where those breaks will be until we walk > through the whole list. Ah this could probably be avoided. max pack size does not apply to streaming pack-objects, where progress bar is most shown. Falling back to object number in this case does not sound too bad. > > 4. If an object we attempt to reuse turns out to be corrupted, we > fall back to the non-reuse code path, which will have a different > size. So you'd need to actually check the reused object CRCs during > the dry-run (and for local repacks, not transfers, we actually > inflate and check the zlib, too, for safety). Ugh.. > > So I think it's _possible_. But it's definitely not trivial. For now, I > think it makes sense to go with something like the patch I posted > earlier (which I'll re-roll in a few minutes). That fixes what is IMHO a > regression in the bitmaps case. And it does not make it any harder for > somebody to later convert us to a true byte-counter (i.e., it is the > easy half already). Agreed. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html