Re: [PATCH v4 10/19] unpack-trees: add basic support for parallel checkout

Matheus Tavares Bernardino <matheus.bernardino@xxxxxx> · Mon, 7 Dec 2020 16:06:16 -0300

Hi, Christian

On Sun, Dec 6, 2020 at 8:36 AM Christian Couder
<christian.couder@xxxxxxxxx> wrote:
>
> On Wed, Nov 4, 2020 at 9:34 PM Matheus Tavares
> <matheus.bernardino@xxxxxx> wrote:
> >
> > This new interface allows us to enqueue some of the entries being
> > checked out to later call write_entry() for them in parallel. For now,
> > the parallel checkout machinery is enabled by default and there is no
> > user configuration, but run_parallel_checkout() just writes the queued
> > entries in sequence (without spawning additional workers). The next
> > patch will actually implement the parallelism and, later, we will make
> > it configurable.

I just noticed that this part of the commit message is a little
outdated. I'll fix it for v5. Currently, the parallelism and
configuration are added in the same patch (the next one). This way,
the patch that adds parallelism can already include runtime numbers
for different configuration values (which shows when the change is
beneficial).

> I would think that it might be more logical to first add a
> configuration that does nothing, then add writing the queued entries
> in sequence without parallelism, and then add actual parallelism.

I'm not sure I get the idea. Would the first patch add just the
documentation for `checkout.workers` and
`checkout.thresholdForParallelism` in
`Documentation/config/checkout.txt`, without the support for it in the
code? In that case, wouldn't the patch become somewhat incomplete on
its own?

> > When there are path collisions among the entries being written (which
> > can happen e.g. with case-sensitive files in case-insensitive file
> > systems), the parallel checkout code detects the problem and marks the
> > item with PC_ITEM_COLLIDED.
>
> Is this needed in this step that only writes the queued entries in
> sequence without parallelism, or could this be added later, before the
> step that adds actual parallelism?

This is already used in this patch. Even though the parallel checkout
machinery only learns to spawn additional workers in the next patch,
it can already encounter path collisions when writing the entries
sequentially. PC_ITEM_COLLIDED is then used to mark the colliding
entries, so that they can be properly handled later.

> > Later, these items are sequentially fed to
> > checkout_entry() again. This is similar to the way the sequential code
> > deals with collisions, overwriting the previously checked out entries
> > with the subsequent ones. The only difference is that, when we start
> > writing the entries in parallel, we won't be able to determine which of
> > the colliding entries will survive on disk (for the sequential
> > algorithm, it is always the last one).
>
> So I guess that PC_ITEM_COLLIDED will then be used to decide which
> entries will not be checked out in parallel?

Yes, in the next patch, the parallel workers will detect collisions
when `open(path, O_CREAT | O_EXCL)` fails with EEXIST or EISDIR. The
workers then mark such items with PC_ITEM_COLLIDED and let the main
process sequentially write them later.

> > I also experimented with the idea of not overwriting colliding entries,
> > and it seemed to work well in my simple tests.
>
> There are a number of co-author of this patch, so it's not very clear
> who "I" is. Maybe:
>
> "The idea of not overwriting colliding entries seemed to work well in
> simple tests, however ..."

Makes sense, thanks.

> > However, because just one
> > entry of each colliding group would be actually written, the others
> > would have null lstat() fields on the index. This might not be a problem
> > by itself, but it could cause performance penalties for subsequent
> > commands that need to refresh the index: when the st_size value cached
> > is 0, read-cache.c:ie_modified() will go to the filesystem to see if the
> > contents match. As mentioned in the function:
> >
> >     * Immediately after read-tree or update-index --cacheinfo,
> >     * the length field is zero, as we have never even read the
> >     * lstat(2) information once, and we cannot trust DATA_CHANGED
> >     * returned by ie_match_stat() which in turn was returned by
> >     * ce_match_stat_basic() to signal that the filesize of the
> >     * blob changed.  We have to actually go to the filesystem to
> >     * see if the contents match, and if so, should answer "unchanged".
> >
> > So, if we have N entries in a colliding group and we decide to write and
> > lstat() only one of them, every subsequent git-status will have to read,
> > convert, and hash the written file N - 1 times, to check that the N - 1
> > unwritten entries are dirty. By checking out all colliding entries (like
> > the sequential code does), we only pay the overhead once.
>
> Ok.
>
> >  5 files changed, 410 insertions(+), 3 deletions(-)
>
> It looks like a lot of new code in one patch/commit, which is why it
> might be interesting to split it.

Yeah, this and the following patch ended up quite big... But I wasn't
sure how to further split them while still keeping each part buildable
and self-contained :(

> > @@ -7,6 +7,7 @@
> >  #include "progress.h"
> >  #include "fsmonitor.h"
> >  #include "entry.h"
> > +#include "parallel-checkout.h"
> >
> >  static void create_directories(const char *path, int path_len,
> >                                const struct checkout *state)
> > @@ -426,8 +427,17 @@ static void mark_colliding_entries(const struct checkout *state,
> >         for (i = 0; i < state->istate->cache_nr; i++) {
> >                 struct cache_entry *dup = state->istate->cache[i];
> >
> > -               if (dup == ce)
> > -                       break;
> > +               if (dup == ce) {
> > +                       /*
> > +                        * Parallel checkout creates the files in no particular
> > +                        * order. So the other side of the collision may appear
> > +                        * after the given cache_entry in the array.
> > +                        */
>
> Is it really the case right now that the code creates files in no
> particular order or will that be the case later when actual
> parallelism is implemented?

In this patch, the code already creates files in no particular order.
Since not all entries are eligible for parallel checkout, and because
ineligible entries are written first, the files are not created in the
same order that they appear in istate->cache[]. (Even though
everything is still written sequentially in this patch).

> > +                       if (parallel_checkout_status() == PC_RUNNING)
> > +                               continue;
> > +                       else
> > +                               break;
> > +               }
>
> > +struct parallel_checkout_item {
> > +       /* pointer to a istate->cache[] entry. Not owned by us. */
> > +       struct cache_entry *ce;
> > +       struct conv_attrs ca;
> > +       struct stat st;
> > +       enum pc_item_status status;
> > +};
>
> "item" seems not very clear to me. If there is only one
> parallel_checkout_item for each cache_entry then it might be better to
> use "parallel_checkout_entry" instead of "parallel_checkout_item".

Hmm, I'm a little uncertain about this one. I usually use "item" and
"entry" interchangeably when talking about elements on a list, as in
this case. Could perhaps the 'struct parallel_checkout_item'
definition be unclear because it's far from the 'struct
parallel_checkout', where the list is actually defined?

> > +enum pc_status {
> > +       PC_UNINITIALIZED = 0,
> > +       PC_ACCEPTING_ENTRIES,
> > +       PC_RUNNING,
> > +};
> > +
> > +enum pc_status parallel_checkout_status(void);
> > +void init_parallel_checkout(void);
>
> Maybe a comment to tell what the above function does could be helpful.
> If I had to guess, I would write something like:
>
> /*
>  * Put parallel checkout into the PC_ACCEPTING_ENTRIES state.
>  * Should be used only when in the PC_UNINITIALIZED state.
>  */

OK, will do, thanks!

> > +/*
> > + * Return -1 if parallel checkout is currently not enabled
>
> Is it "enabled" or "initialized" or "configured" here? Does it refer
> to `enum pc_status` or a config option or something else? Looking at
> the code, it is testing if the status PC_ACCEPTING_ENTRIES, so
> perhaps: s/not enabled/not accepting entries/

Yes, that's better, thanks!