Re: Antw: Re: non-smooth progress indication for git fsck and git gc

"Ulrich Windl" <Ulrich.Windl@xxxxxxxxxxxxxxxxxxxx> · Mon, 20 Aug 2018 11:37:23 +0200

Hi!

Here are some stats from the repository. First the fast import ones (which had
good performance, but probably all cached, also):

% git fast-import <../git-stream
/usr/lib/git/git-fast-import statistics:
---------------------------------------------------------------------
Alloc'd objects:      55000
Total objects:        51959 (        56 duplicates                  )
      blobs  :        47991 (         0 duplicates          0 deltas of       
  0 attempts)
      trees  :         3946 (        56 duplicates        994 deltas of      
3929 attempts)
      commits:           22 (         0 duplicates          0 deltas of       
  0 attempts)
      tags   :            0 (         0 duplicates          0 deltas of       
  0 attempts)
Total branches:          15 (        15 loads     )
      marks:        1048576 (     48013 unique    )
      atoms:          43335
Memory total:          9611 KiB
       pools:          7033 KiB
     objects:          2578 KiB
---------------------------------------------------------------------
pack_report: getpagesize()            =       4096
pack_report: core.packedGitWindowSize = 1073741824
pack_report: core.packedGitLimit      = 8589934592
pack_report: pack_used_ctr            =       1780
pack_report: pack_mmap_calls          =         23
pack_report: pack_open_windows        =          1 /          1
pack_report: pack_mapped              =    2905751 /    2905751
---------------------------------------------------------------------

Then the output from git-sizer:

Processing blobs: 47991                        
Processing trees: 3946                        
Processing commits: 22                        
Matching commits to trees: 22                        
Processing annotated tags: 0                        
Processing references: 15                        
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size      |           |                                |
| * Blobs                      |           |                                |
|   * Total size               |  13.7 GiB | *                              |
|                              |           |                                |
| Biggest objects              |           |                                |
| * Trees                      |           |                                |
|   * Maximum entries      [1] |  13.4 k   | *************                  |
| * Blobs                      |           |                                |
|   * Maximum size         [2] |   279 MiB | *****************************  |
|                              |           |                                |
| Biggest checkouts            |           |                                |
| * Maximum path depth     [3] |    10     | *                              |
| * Maximum path length    [3] |   130 B   | *                              |
| * Total size of files    [3] |  8.63 GiB | *********                      |

[1]  b701345cbd4317276568b9d9890fd77f38933bdc
(refs/heads/master:Resources/default/CIFP)
[2]  19f54c4a7595667329c1be23200234f0fe50ac56
(refs/heads/master:Resources/default/apt.dat)
[3]  b0e3d3a2b7f2504117408f13486c905a8eb8fb1e (refs/heads/master^{tree})

Some notes:
[1] is a directory with many short (typically < 1kB) text files
[2] is a very large text file with many changes
[3] Yes, some paths are long

Regards,
Ulrich

>>> Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> schrieb am 20.08.2018 um 10:57
in
Nachricht <87woslpg9i.fsf@xxxxxxxxxxxxxxxxxxx>:

> On Mon, Aug 20 2018, Ulrich Windl wrote:
> 
>>>>> Jeff King <peff@xxxxxxxx> schrieb am 16.08.2018 um 22:55 in Nachricht
>> <20180816205556.GA8257@xxxxxxxxxxxxxxxxxxxxx>:
>>> On Thu, Aug 16, 2018 at 10:35:53PM +0200, Ævar Arnfjörð Bjarmason wrote:
>>>
>>>> This is all interesting, but I think unrelated to what Ulrich is talking
>>>> about. Quote:
>>>>
>>>>     Between the two phases of "git fsck" (checking directories and
>>>>     checking objects) there was a break of several seconds where no
>>>>     progress was indicated
>>>>
>>>> I.e. it's not about the pause you get with your testcase (which is
>>>> certainly another issue) but the break between the two progress bars.
>>>
>>> I think he's talking about both. What I said responds to this:
>>
>> Hi guys!
>>
>> Yes, I was wondering what git does between the two visible phases, and 
> between
>> the lines I was suggesting another progress message between those phases.
At
>> least the maximum unspecific three-dot-message "Thinking..." could be 
> displayed
>> ;-) Of course anything more appropriate would be welcome.
>> Also that message should only be displayed if it's foreseeable that the
>> operation will take significant time. In my case (I just repeated it a few
>> minutes ago) the delay is significant (at least 10 seconds). As noted 
> earlier I
>> was hoping to capture the timing in a screencast, but it seems all the 
> delays
>> were just optimized away in the recording.
>>
>>>
>>>> >> During "git gc" the writing objects phase did not update for some
>>>> >> seconds, but then the percentage counter jumped like from 15% to 42%.
>>>
>>> But yeah, I missed that the fsck thing was specifically about a break
>>> between two meters. That's a separate problem, but also worth
>>> discussing (and hopefully much easier to address).
>>>
>>>> If you fsck this repository it'll take around (on my spinning rust
>>>> server) 30 seconds between 100% of "Checking object directories" before
>>>> you get any output from "Checking objects".
>>>>
>>>> The breakdown of that is (this is from approximate eyeballing):
>>>>
>>>>  * We spend 1-3 seconds just on this:
>>>>
>>>
>> 
>
https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack

>>
>>> -check.c#L181
>>>
>>> OK, so that's checking the sha1 over the .idx file. We could put a meter
>>> on that. I wouldn't expect it to generally be all that slow outside of
>>> pathological cases, since it scales with the number of objects (and 1s
>>> is our minimum update anyway, so that might be OK as-is). Your case has
>>> 13M objects, which is quite large.
>>
>> Sometimes an oldish CPU could bring performance surprises, maybe. Anyway
my
>> CPU is question is an AMD Phenom2 quad-core with 3.2GHz nominal, and there

> is a
>> classic spinning disk with 5400RPM built in...
>>
>>>
>>>>  * We spend the majority of the ~30s on this:
>>>>
>>>
>> 
>
https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack

>>
>>> -check.c#L70-L79
>>>
>>> This is hashing the actual packfile. This is potentially quite long,
>>> especially if you have a ton of big objects.
>>
>> That seems to apply. BTW: Is there a way go get some repository statistics
>> like a histogram of object sizes (or whatever that might be useful to help
>> making decisions)?
> 
> The git-sizer program is really helpful in this regard:
> https://github.com/github/git-sizer 
> 
>>>
>>> I wonder if we need to do this as a separate step anyway, though. Our
>>> verification is based on index-pack these days, which means it's going
>>> to walk over the whole content as part of the "Indexing objects" step to
>>> expand base objects and mark deltas for later. Could we feed this hash
>>> as part of that walk over the data? It's not going to save us 30s, but
>>> it's likely to be more efficient. And it would fold the effort naturally
>>> into the existing progress meter.
>>>
>>>>  * Wes spend another 3-5 seconds on this QSORT:
>>>>
>>>
>> 
>
https://github.com/git/git/blob/63749b2dea5d1501ff85bab7b8a7f64911d21dea/pack

>>
>>> -check.c#L105
>>>
>>> That's a tough one. I'm not sure how we'd count it (how many compares we
>>> do?). And each item is doing so little work that hitting the progress
>>> code may make things noticeably slower.
>>
>> If it's sorting, maybe add some code like (wild guess):
>>
>> if (objects_to_sort > magic_number)
>>    message("Sorting something...");
> 
> I think a good solution to these cases is to just introduce something to
> the progress.c mode where it learns how to display a counter where we
> don't know what the end-state will be. Something like your proposed
> magic_number can just be covered under the more general case where we
> don't show the progress bar unless it's been 1 second (which I believe
> is the default).
> 
>>>
>>> Again, your case is pretty big. Just based on the number of objects,
>>> linux.git should be 1.5-2.5 seconds on your machine for the same
>>> operation. Which I think may be small enough to ignore (or even just
>>> print a generic before/after). It's really the 30s packfile hash that's
>>> making the whole thing so terrible.
>>>
>>> -Peff