01.04.2020, 03:09, "Derrick Stolee" <stolee@xxxxxxxxx>: > On 3/31/2020 6:23 PM, Konstantin Tokarev wrote: >> 01.04.2020, 01:10, "Konstantin Tokarev" <annulen@xxxxxxxxx>: >>> 28.03.2020, 19:58, "Derrick Stolee" <stolee@xxxxxxxxx>: >>>> On 3/28/2020 10:40 AM, Jeff King wrote: >>>>> On Sat, Mar 28, 2020 at 12:08:17AM +0300, Konstantin Tokarev wrote: >>>>> >>>>>> Is it a known thing that addition of --filter=blob:none to workflow >>>>>> with shalow clone (e.g. --depth=1) and following sparse checkout may >>>>>> significantly slow down process and result in much larger .git >>>>>> repository? >>>> >>>> In general, I would recommend not using shallow clones in conjunction >>>> with partial clone. The blob:none filter will get you what you really >>>> want from shallow clone without any of the downsides of shallow clone. >>> >>> Is it really so? >>> >>> As you can see from my measurements [1], in my case simple shallow clone (1) >>> runs faster than simple partial clone (2) and produces slightly smaller .git, >>> from which I can infer that (2) downloads some data which is not downloaded >>> in (1). >> >> Actually, as I have full git logs for all these cases, there is no need to be guessing: >> (1) downloads 295085 git objects of total size 1.00 GiB >> (2) downloads 1949129 git objects of total size 1.01 GiB > > It is worth pointing out that these sizes are very close. The number of objects > may be part of why the timing is so different as the client needs to parse all > deltas to verify the object contents. > > Re-running the test with GIT_TRACE2_PERF=1 might reveal some interesting info > about which regions are slower than others. Here are trace results for (1) with fix discussed below: https://gist.github.com/annulen/58b868e35e992105e7028946a8370795 Here are trace results for (2) with fix discussed below: https://gist.github.com/annulen/fa1ef1b5d1056e6dede815e9ebf85c03 > >> Total sizes are very close, but (2) downloads much more objects, and also it uses >> 3 passes to download them which leads to less efficient use of network bandwidth. > > Three passes, being: > > 1. Download commits and trees. > 2. Initialize sparse-checkout with blobs at root. > 3. Expand sparse-checkout. > > Is that right? You could group 1 & 2 by setting your sparse-checkout patterns > before initializing a checkout (if you clone with --no-checkout). Your link > says you did this: > > git clone <mode> --no-checkout <url> <dir> > git sparse-checkout init > git sparse-checkout set '/*' '!LayoutTests' > > Try doing it this way instead: > > git clone <mode> --no-checkout <url> <dir> > git config core.sparseCheckout true > git sparse-checkout set '/*' '!LayoutTests' > > By doing it this way, you skip the step where the 'init' subcommand looks > for all blobs at root and does a network call for them. Should remove some > overhead. Thanks, that helped. Now git downloads object only two times. >From reading man page I assumed that `git sparse-checkout init` should do the same as `git config core.sparseCheckout true`, unless `--cone` argument is specified. > > Less efficient use of network bandwidth is one thing, but shallow clones are > also more CPU-intensive with the "counting objects" phase on the server. Your > link shares the following end-to-end timings: > > * Shallow-clone: 234s > * Partial clone: 286s > * Both(???): 1023s > > The data implies that by asking for both you actually got a full clone (4.1 GB). No, this is still a partial clone, full clone takes more than 6 GB > > The 234s to 286s difference is meaningful. Almost a minute. > >>> To be clear, use case which I'm interested right now is checking out sources in >>> cloud CI system like GitHub Actions for one shot build. Right now checkout usually >>> takes 1-2 minutes and my hope was that someday in the future it would be possible\ >>> to make it faster. > > As long as you delete the shallow clone every time, then you also remove the > downsides of a shallow clone related to a later fetch or attempts to push. > > If possible, a repo this size would benefit from persistent build agents that > you control. They can keep a copy of the repo around and do incremental fetches > that are much faster. It's a larger investment to run your own build lab, though. > But sometimes making builds faster is expensive. It depends on how "expensive" those > four minute clones per build are in terms of your team waiting. No, current checkout times for shallow clone + sparseCheckout are quite acceptable. (FWIW, initially I used shallow clone without sparseCheckout, as the latter is not supported by GitHub Actions out of the box, and those times were NOT acceptable, as depending on server load checkout could take 16 minutes or even more) For now it's just my curioisty and desire to provide info which could make Git better. It just seemed logical to me initially that if we limit both required paths in worktree and history depth on object download stage, it should be more efficient than limiting only history depth (or, at least, have the same efficiency). BTW, I did more measurements and results seem to be highly dependent on server side. Once partial clone (2) even worked faster than shallow clone + sparseCheckout (1). Still, most of the time (1) is faster than (2). -- Regards, Konstantin