Re: Inefficiency of partial shallow clone vs shallow clone + "old-style" sparse checkout

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3/31/2020 6:23 PM, Konstantin Tokarev wrote:
> 01.04.2020, 01:10, "Konstantin Tokarev" <annulen@xxxxxxxxx>:
>> 28.03.2020, 19:58, "Derrick Stolee" <stolee@xxxxxxxxx>:
>>>  On 3/28/2020 10:40 AM, Jeff King wrote:
>>>>   On Sat, Mar 28, 2020 at 12:08:17AM +0300, Konstantin Tokarev wrote:
>>>>
>>>>>   Is it a known thing that addition of --filter=blob:none to workflow
>>>>>   with shalow clone (e.g. --depth=1) and following sparse checkout may
>>>>>   significantly slow down process and result in much larger .git
>>>>>   repository?
>>>
>>>  In general, I would recommend not using shallow clones in conjunction
>>>  with partial clone. The blob:none filter will get you what you really
>>>  want from shallow clone without any of the downsides of shallow clone.
>>
>> Is it really so?
>>
>> As you can see from my measurements [1], in my case simple shallow clone (1)
>> runs faster than simple partial clone (2) and produces slightly smaller .git,
>> from which I can infer that (2) downloads some data which is not downloaded
>> in (1).
> 
> Actually, as I have full git logs for all these cases, there is no need to be guessing:
>     (1) downloads 295085 git objects of total size 1.00 GiB
>     (2) downloads 1949129 git objects of total size 1.01 GiB

It is worth pointing out that these sizes are very close. The number of objects
may be part of why the timing is so different as the client needs to parse all
deltas to verify the object contents.

Re-running the test with GIT_TRACE2_PERF=1 might reveal some interesting info
about which regions are slower than others.

> Total sizes are very close, but (2) downloads much more objects, and also it uses
> 3 passes to download them which leads to less efficient use of network bandwidth.

Three passes, being:

1. Download commits and trees.
2. Initialize sparse-checkout with blobs at root.
3. Expand sparse-checkout.

Is that right? You could group 1 & 2 by setting your sparse-checkout patterns
before initializing a checkout (if you clone with --no-checkout). Your link
says you did this:

	git clone <mode> --no-checkout <url> <dir>
	git sparse-checkout init
	git sparse-checkout set '/*' '!LayoutTests'

Try doing it this way instead:

	git clone <mode> --no-checkout <url> <dir>
	git config core.sparseCheckout true
	git sparse-checkout set '/*' '!LayoutTests'

By doing it this way, you skip the step where the 'init' subcommand looks
for all blobs at root and does a network call for them. Should remove some
overhead.

Less efficient use of network bandwidth is one thing, but shallow clones are
also more CPU-intensive with the "counting objects" phase on the server. Your
link shares the following end-to-end timings:

* Shallow-clone: 234s
* Partial clone: 286s
* Both(???): 1023s

The data implies that by asking for both you actually got a full clone (4.1 GB).

The 234s to 286s difference is meaningful. Almost a minute.

>> To be clear, use case which I'm interested right now is checking out sources in
>> cloud CI system like GitHub Actions for one shot build. Right now checkout usually
>> takes 1-2 minutes and my hope was that someday in the future it would be possible\
>> to make it faster.

As long as you delete the shallow clone every time, then you also remove the
downsides of a shallow clone related to a later fetch or attempts to push.

If possible, a repo this size would benefit from persistent build agents that
you control. They can keep a copy of the repo around and do incremental fetches
that are much faster. It's a larger investment to run your own build lab, though.
But sometimes making builds faster is expensive. It depends on how "expensive" those
four minute clones per build are in terms of your team waiting.

Thanks,
-Stolee



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux