Re: [PATCH 3/5] parallel-checkout: add configuration options

Matheus Tavares Bernardino <matheus.bernardino@xxxxxx> · Fri, 2 Apr 2021 11:45:47 -0300

On Wed, Mar 31, 2021 at 1:33 AM Christian Couder
<christian.couder@xxxxxxxxx> wrote:
>
> On Wed, Mar 17, 2021 at 10:12 PM Matheus Tavares
> <matheus.bernardino@xxxxxx> wrote:
>
> > The above benchmarks show that parallel checkout is most effective on
> > repositories located on an SSD or over a distributed file system. For
> > local file systems on spinning disks, and/or older machines, the
> > parallelism does not always bring a good performance. For this reason,
> > the default value for checkout.workers is one, a.k.a. sequential
> > checkout.
>
> I wonder how many people are still using HDD, and how many will still
> use them in a few years.
>
> I think having 1 as the default value for checkout.workers might be
> good for a while for backward compatibility and stability, while
> people who are interested can test parallel checkout on different
> environments. But we might want, in a few releases, after some bugs,
> if any, have been fixed, to use a default, maybe 10, that will provide
> significant speedup for most people, and will justify the added
> complexity, especially as your numbers still show a small speedup for
> HDD when using 10.

Yeah, I agree that it would be nice to have a better default in the
near future. Unfortunately, on some other HDD machines that I tested,
parallel checkout was slower than the sequential version. So I think
it may not be possible to enable parallelism by default now.

Nevertheless, I was also experimenting with some ideas to auto-detect
if parallelism is efficient in a given environment/repo and
auto-enable it if so. One interesting possibility suggested by Ævar
[1] was to implement this as a git maintenance task, which could
periodically probe the system and tune the checkout settings
appropriately.

[1]: https://lore.kernel.org/git/87y2ixpvos.fsf@xxxxxxxxxxxxxxxxxxx/

> > @@ -23,6 +25,19 @@ enum pc_status parallel_checkout_status(void)
> >         return parallel_checkout.status;
> >  }
> >
> > +#define DEFAULT_THRESHOLD_FOR_PARALLELISM 100
>
> Using a "static const int" might be a bit better.

Ok, I will change that.

> > +void get_parallel_checkout_configs(int *num_workers, int *threshold)
> > +{
> > +       if (git_config_get_int("checkout.workers", num_workers))
> > +               *num_workers = 1;
>
> I think it's better when an important default value like this "1" is
> made more visible using a "static const int" or a "#define".

Will do.

> > @@ -568,7 +581,7 @@ int run_parallel_checkout(struct checkout *state)
> >         if (parallel_checkout.nr < num_workers)
> >                 num_workers = parallel_checkout.nr;
> >
> > -       if (num_workers <= 1) {
> > +       if (num_workers <= 1 || parallel_checkout.nr < threshold) {
>
> Here we check the number of workers...
>
> >                 write_items_sequentially(state);
> >         } else {
> >                 struct pc_worker *workers = setup_workers(state, num_workers);
>
> [...]
>
> > @@ -480,7 +483,8 @@ static int check_updates(struct unpack_trees_options *o,
> >                 }
> >         }
> >         stop_progress(&progress);
> > -       errs |= run_parallel_checkout(&state);
> > +       if (pc_workers > 1)
> > +               errs |= run_parallel_checkout(&state, pc_workers, pc_threshold);

> ...but the number of workers was already checked here.

The re-checking at run_parallel_checkout() is important because
`num_workers` might actually become <= 1 after the above check at
check_updates(). This happens when there aren't enough enqueued
entries for 2+ workers, so we fall back to sequential checkout. It
also kind of works as a safe-mechanism for the case where the
run_parallel_checkout() caller forgot to check if the user actually
wants parallelism before calling the function.