On Wed, Mar 31, 2021 at 1:33 AM Christian Couder <christian.couder@xxxxxxxxx> wrote: > > On Wed, Mar 17, 2021 at 10:12 PM Matheus Tavares > <matheus.bernardino@xxxxxx> wrote: > > > The above benchmarks show that parallel checkout is most effective on > > repositories located on an SSD or over a distributed file system. For > > local file systems on spinning disks, and/or older machines, the > > parallelism does not always bring a good performance. For this reason, > > the default value for checkout.workers is one, a.k.a. sequential > > checkout. > > I wonder how many people are still using HDD, and how many will still > use them in a few years. > > I think having 1 as the default value for checkout.workers might be > good for a while for backward compatibility and stability, while > people who are interested can test parallel checkout on different > environments. But we might want, in a few releases, after some bugs, > if any, have been fixed, to use a default, maybe 10, that will provide > significant speedup for most people, and will justify the added > complexity, especially as your numbers still show a small speedup for > HDD when using 10. Yeah, I agree that it would be nice to have a better default in the near future. Unfortunately, on some other HDD machines that I tested, parallel checkout was slower than the sequential version. So I think it may not be possible to enable parallelism by default now. Nevertheless, I was also experimenting with some ideas to auto-detect if parallelism is efficient in a given environment/repo and auto-enable it if so. One interesting possibility suggested by Ævar [1] was to implement this as a git maintenance task, which could periodically probe the system and tune the checkout settings appropriately. [1]: https://lore.kernel.org/git/87y2ixpvos.fsf@xxxxxxxxxxxxxxxxxxx/ > > @@ -23,6 +25,19 @@ enum pc_status parallel_checkout_status(void) > > return parallel_checkout.status; > > } > > > > +#define DEFAULT_THRESHOLD_FOR_PARALLELISM 100 > > Using a "static const int" might be a bit better. Ok, I will change that. > > +void get_parallel_checkout_configs(int *num_workers, int *threshold) > > +{ > > + if (git_config_get_int("checkout.workers", num_workers)) > > + *num_workers = 1; > > I think it's better when an important default value like this "1" is > made more visible using a "static const int" or a "#define". Will do. > > @@ -568,7 +581,7 @@ int run_parallel_checkout(struct checkout *state) > > if (parallel_checkout.nr < num_workers) > > num_workers = parallel_checkout.nr; > > > > - if (num_workers <= 1) { > > + if (num_workers <= 1 || parallel_checkout.nr < threshold) { > > Here we check the number of workers... > > > write_items_sequentially(state); > > } else { > > struct pc_worker *workers = setup_workers(state, num_workers); > > [...] > > > @@ -480,7 +483,8 @@ static int check_updates(struct unpack_trees_options *o, > > } > > } > > stop_progress(&progress); > > - errs |= run_parallel_checkout(&state); > > + if (pc_workers > 1) > > + errs |= run_parallel_checkout(&state, pc_workers, pc_threshold); > ...but the number of workers was already checked here. The re-checking at run_parallel_checkout() is important because `num_workers` might actually become <= 1 after the above check at check_updates(). This happens when there aren't enough enqueued entries for 2+ workers, so we fall back to sequential checkout. It also kind of works as a safe-mechanism for the case where the run_parallel_checkout() caller forgot to check if the user actually wants parallelism before calling the function.