Re: non-smooth progress indication for git fsck and git gc

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Sat, 01 Sep 2018 14:53:28 +0200

On Thu, Aug 16 2018, Ævar Arnfjörð Bjarmason wrote:

> On Thu, Aug 16 2018, Jeff King wrote:
>
>> On Thu, Aug 16, 2018 at 08:54:25AM +0200, Ulrich Windl wrote:
>>
>>> I'd like to point out some minor issue observed while processing some
>>> 50000-object repository with many binary objects, but most are rather
>>> small:
>>>
>>> Between the two phases of "git fsck" (checking directories and
>>> checking objects) there was a break of several seconds where no
>>> progress was indicated.
>>>
>>> During "git gc" the writing objects phase did not update for some
>>> seconds, but then the percentage counter jumped like from 15% to 42%.
>>>
>>> I understand that updating the progress output too often can be a
>>> performance bottleneck, while upating it too rarely might only bore
>>> the user... ;-)
>>
>> We update the counter integer for every object we process, and then
>> actually update the display whenever the percentage increases or a
>> second has elapsed, whichever comes first.
>>
>> What you're seeing is likely an artifact of _what_ we're counting:
>> written objects. Not all objects are the same size, so it's not uncommon
>> to see thousands/sec when dealing with small ones, and then several
>> seconds for one giant blob.
>>
>> The only way to solve that is to count bytes. We don't have a total byte
>> count in most cases, and it wouldn't always make sense (e.g., the
>> "Compressing objects" meter can show the same issue, but it's not really
>> putting through bytes in a linear way).  In some cases we do show
>> transmitted size and throughput, but that's just for network operations.
>> We could do the same for "gc" with the patch below. But usually
>> throughput isn't all that interesting for a filesystem write, because
>> bandwidth isn't the bottleneck.
>>
>> Possibly we could have a "half throughput" mode that counts up the total
>> size written, but omits the speed indicator. That's not an unreasonable
>> thing to show for a local pack, since you end up with the final pack
>> size. The object counter would still be jumpy, but you'd at least have
>> one number updated at least once per second as you put through a large
>> blob.
>>
>> If you really want a smooth percentage, then we'd have to start counting
>> bytes instead of objects. Two reasons we don't do that are:
>>
>>   - we often don't know the total byte size exactly. E.g., for a
>>     packfile write, it depends on the result of deflating each object.
>>     You can make an approximation and just pretend at the end that you
>>     hit 100%.  Or you can count the pre-deflate sizes, but then your
>>     meter doesn't match the bytes from the throughput counter.
>>
>>   - for something like fsck, we're not actually writing out bytes.  So I
>>     guess you'd be measuring "here's how many bytes of objects I
>>     fsck-ed". But is that on-disk compressed bytes, or decompressed
>>     bytes?
>>
>>     If the former, that's only marginally better as a measure of effort,
>>     since delta compression means that a small number of on-disk bytes
>>     may require a big effort (imagine processing a 100 byte blob versus
>>     a 100 byte delta off of a 100MB blob).
>>
>>     The latter is probably more accurate. But it's also not free to
>>     pre-generate the total. We can get the number of objects or the size
>>     of the packfile in constant-time, but totaling up the uncompressed
>>     size of all objects is O(n). So that's extra computation, but it
>>     also means a potential lag before we can start the progress meter.
>>
>>     I'm also not sure how meaningful a byte count is for a user there.
>>
>> So there. That's probably more than you wanted to know about Git's
>> progress code. I think it probably _could_ be improved by counting
>> more/different things, but I also think it can be a bit of a rabbit
>> hole. Which is why AFAIK nobody's really looked too seriously into
>> changing it.
>>
>> -Peff
>
> This is all interesting, but I think unrelated to what Ulrich is talking
> about. Quote:
>
>     Between the two phases of "git fsck" (checking directories and
>     checking objects) there was a break of several seconds where no
>     progress was indicated
>
> I.e. it's not about the pause you get with your testcase (which is
> certainly another issue) but the break between the two progress bars.
>
> Here's a test case you can clone:
> https://github.com/avar/2015-04-03-1M-git (or might already have
> "locally" :)
>
> If you fsck this repository it'll take around (on my spinning rust
> server) 30 seconds between 100% of "Checking object directories" before
> you get any output from "Checking objects".
>
> The breakdown of that is (this is from approximate eyeballing):
>
>  * We spend 1-3 seconds just on this:
>    https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L181
>
>  * We spend the majority of the ~30s on this:
>    https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L70-L79
>
>  * Wes spend another 3-5 seconds on this QSORT:
>    https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L105
>
> I.e. it's not about objects v.s. bytes, but the structural problem with
> the code that we pass a progress bar down to verify_pack() which does a
> lot of work in verify_pack_index() and verify_packfile() before we even
> get to iterating over the objects in the file, and only then do we start
> displaying progress.

This patch almost entirely fixes the problem, except for the 1-3 seconds
we take on the qsort:

    diff --git a/sha1dc/sha1.c b/sha1dc/sha1.c
    index df0630bc6d..6d6b9b54c4 100644
    --- a/sha1dc/sha1.c
    +++ b/sha1dc/sha1.c
    @@ -25,6 +25,7 @@

     #include "sha1.h"
     #include "ubc_check.h"
    +#include "../progress.h"

     #if (defined(__amd64__) || defined(__amd64) || defined(__x86_64__) || defined(__x86_64) || \
          defined(i386) || defined(__i386) || defined(__i386__) || defined(__i486__)  || \
    @@ -1820,6 +1821,8 @@ void SHA1DCSetCallback(SHA1_CTX* ctx, collision_block_callback callback)
     void SHA1DCUpdate(SHA1_CTX* ctx, const char* buf, size_t len)
     {
            unsigned left, fill;
    +       struct progress *progress;
    +       size_t len_cp = len, done = 0;

            if (len == 0)
                    return;
    @@ -1836,6 +1839,8 @@ void SHA1DCUpdate(SHA1_CTX* ctx, const char* buf, size_t len)
                    len -= fill;
                    left = 0;
            }
    +
    +       progress = start_delayed_progress(_("Hashing"), len_cp);
            while (len >= 64)
            {
                    ctx->total += 64;
    @@ -1848,7 +1853,12 @@ void SHA1DCUpdate(SHA1_CTX* ctx, const char* buf, size_t len)
     #endif /* defined(SHA1DC_ALLOW_UNALIGNED_ACCESS) */
                    buf += 64;
                    len -= 64;
    +               done += 64;
    +               display_progress(progress, done);
            }
    +       display_progress(progress, len_cp);
    +       stop_progress(&progress);
    +
            if (len > 0)
            {
                    ctx->total += len;

With this we'll get output like:

    $ ~/g/git/git  -C ~/g/2015-04-03-1M-git/  --exec-path=$PWD fsck
    Checking object directories: 100% (256/256), done.
    Hashing: 100% (452634108/452634108), done.
    Hashing: 100% (1073741824/1073741824), done.
    Hashing: 100% (1073741824/1073741824), done.
    Hashing: 100% (1008001572/1008001572), done.
    Checking objects:   2% (262144/13064614)
    ^C

All tests pass with this. Isn't it awesome? Except it's of course a
massive hack, we wouldn't want to just hook into SHA1DC like this.

The problem comes down to us needing to call git_hash_sha1_update() with
some really huge input, that function is going to take a *long* time,
and the only way we're getting incremental progress is:

 1) If we ourselves split the input into N chunks
 2) If we hack into the SHA1 library itself

This patch does #2, but for this to be acceptable we'd need to do
something like #1.

The least hacky way I can think of doing this is something like the
following: We'd define two git_hash_algo entries for SHA-1, one that could give you
progress reports, and that one would only ever hash in chunks of at most
size N, and would wrap git_SHA1_Update() in something similar to the
progress code above.

We'd also need to extend the git_hash_algo struct to e.g. have a char *
we'd use as a description, instead of just "Hashing".

Then the caller here could just switch out the hash algo if
show_progress was true, temporarily some "int do_progress" and "char
*progress_description" entries in the struct, and off we go:

    https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack-check.c#L69-L80

Still a hack, but I can't see another way of making it work without
compromising all hashing performance with some progress reporting.