This series adds parallel workers to the checkout machinery. The cache entries are distributed among helper processes which are responsible for reading, filtering and writing the blobs to the working tree. This should benefit all commands that call unpack_trees() or check_updates(), such as: checkout, clone, sparse-checkout, checkout-index, etc. This proposal is based on two previous ones, by Duy [1] and Jeff [2]. It uses some of the patches from these two series, with additional changes. The final parallel version was benchmarked during three operations with cold cache in the linux repo: cloning v5.8, checking out v5.8 from v2.6.15 and checking out v5.8 from v5.7. The three tables below show the mean run times and standard deviations for 5 runs in: a local file system, a Linux NFS server and Amazon EFS. The number of workers was chosen based on what produces the best result for each case. Local: Clone Checkout I Checkout II Sequential 8.180 s ± 0.021 s 6.936 s ± 0.030 s 2.585 s ± 0.005 s 10 workers 3.406 s ± 0.187 s 2.164 s ± 0.033 s 1.050 s ± 0.021 s Speedup 2.40 ± 0.13 3.21 ± 0.05 2.46 ± 0.05 Linux NFS server (v4.1, on EBS, single availability zone): Clone Checkout I Checkout II Sequential 208.069 s ± 2.522 s 198.610 s ± 1.979 s 54.376 s ± 1.333 s 32 workers 58.170 s ± 0.648 s 56.471 s ± 0.093 s 22.311 s ± 0.220 s Speedup 3.58 ± 0.06 3.52 ± 0.04 2.44 ± 0.06 EFS (v4.1, replicated over multiple availability zones): Clone Checkout I Checkout II Sequential 1143.655 s ± 11.819 s 1277.891 s ± 10.481 s 396.891 s ± 7.505 s 64 workers 94.778 s ± 4.984 s 201.674 s ± 2.286 s 149.951 s ± 12.895 s Speedup 12.07 ± 0.65 6.34 ± 0.09 2.65 ± 0.23 I also repeated the local benchmark tests including pc-p4-core [2], to make sure the new proposal doesn't have performance regressions: Clone Checkout I Checkout II pc-p4-core 3.746 s ± 0.044 s 3.158 s ± 0.041 s 1.597 s ± 0.019 s 10 workers 3.595 s ± 0.111 s 2.263 s ± 0.027 s 1.098 s ± 0.023 s Speedup 1.04 ± 0.03 1.40 ± 0.02 1.45 ± 0.04 The series is divided in three blocks: - The first 9 patches are preparatory steps in convert.c and entry.c. - The middle 7 actually implement parallel checkout. - The last 5 are ideas for further optimization of the parallel version. They don't bring a huge difference in local file systems (e.g. linux clone is only 1.04x faster than the previous parallel code), but in distributed file systems, there is a significant difference: 1.15x faster in NFS and 1.83x faster in Amazon EFS. (For comparison, the timings before these additional patches can be seen in the commit message of patch 11.) The first 4 patches come from [2]. I couldn't get in touch with Jeff yet and ask for his approval on then, so I didn't include his Signed-off-by, for the time being. Note: we probably want to add some extra validation and perf tests. But, for now, parallel checkout is enabled by default in this series (with no threshold on the minimum number of entries), so the test base is already exercising the parallel code. (see [3]) There are some additional optimization possibilities I want to experiment with later, such as: - Work stealing, to better re-distribute tasks in case of non-uniform work loads. Duy already proposed a way to implement this in his original series. - Add a --stat option to checkout--helper, to avoid calling stat() when state.refresh_cache is false. - Try to detect when a repository is in NFS/EFS to automatically use a higher number of workers, as this showed out to be very effective in distributed file systems. [1]: https://gitlab.com/pclouds/git/-/commits/parallel-checkout [2]: https://github.com/jeffhostetler/git/commits/pc-p4-core [3]: https://github.com/matheustavares/git/actions/runs/203036951 ---- Notes on the benchmarks: Local tests were executed in an i7-7700HQ (4 cores with hyper-threading) running Manjaro Linux, with SSD. NFS and EFS tests were executed in an Amazon EC2 c5n.large instance, with 2 vCPUs. The Linux NFS server was running on a m6g.large instance with 1 TB, EBS GP2 volume. For pc-p4-core tests, I used the set of parameters that resulted in the fasted mean execution (of 5 runs) on my machine, which was: - For clone: async mode, 22 helpers, 2 writers, 10 preloading slots - For checkout I: async mode, 20 helpers, 2 writers, 20 preloading slots - For checkout II: sync mode, 4 helpers, 2 writers, 30 preloading slots Jeff Hostetler (4): convert: make convert_attrs() and convert structs public convert: add [async_]convert_to_working_tree_ca() variants convert: add get_stream_filter_ca() variant convert: add conv_attrs classification Matheus Tavares (17): entry: extract a header file for entry.c functions entry: make fstat_output() and read_blob_entry() public entry: extract cache_entry update from write_entry() entry: move conv_attrs lookup up to checkout_entry() entry: add checkout_entry_ca() which takes preloaded conv_attrs unpack-trees: add basic support for parallel checkout parallel-checkout: make it truly parallel parallel-checkout: add configuration options parallel-checkout: support progress displaying make_transient_cache_entry(): optionally alloc from mem_pool builtin/checkout.c: complete parallel checkout support checkout-index: add parallel checkout support parallel-checkout: avoid stat() calls in workers entry: use is_dir_sep() when checking leading dirs symlinks: make has_dirs_only_path() track FL_NOENT parallel-checkout: create leading dirs in workers parallel-checkout: skip checking the working tree on clone .gitignore | 1 + Documentation/config/checkout.txt | 16 + Makefile | 2 + apply.c | 1 + builtin.h | 1 + builtin/checkout--helper.c | 135 +++++++ builtin/checkout-index.c | 17 + builtin/checkout.c | 21 +- builtin/difftool.c | 3 +- cache.h | 35 +- convert.c | 121 +++--- convert.h | 68 ++++ entry.c | 180 +++++++-- entry.h | 54 +++ git.c | 2 + parallel-checkout.c | 611 ++++++++++++++++++++++++++++++ parallel-checkout.h | 103 +++++ read-cache.c | 12 +- symlinks.c | 42 +- unpack-trees.c | 24 +- 20 files changed, 1292 insertions(+), 157 deletions(-) create mode 100644 builtin/checkout--helper.c create mode 100644 entry.h create mode 100644 parallel-checkout.c create mode 100644 parallel-checkout.h -- 2.27.0