Re: [PATCH v4 4/6] update-index: use the bulk-checkin infrastructure

Neeraj Singh <nksingh85@xxxxxxxxx> · Tue, 21 Sep 2021 18:27:25 -0700

On Tue, Sep 21, 2021 at 4:53 PM Ævar Arnfjörð Bjarmason
<avarab@xxxxxxxxx> wrote:
>
>
> On Mon, Sep 20 2021, Neeraj Singh via GitGitGadget wrote:
>
> > From: Neeraj Singh <neerajsi@xxxxxxxxxxxxx>
> >
> > The update-index functionality is used internally by 'git stash push' to
> > setup the internal stashed commit.
> >
> > This change enables bulk-checkin for update-index infrastructure to
> > speed up adding new objects to the object database by leveraging the
> > pack functionality and the new bulk-fsync functionality. This mode
> > is enabled when passing paths to update-index via the --stdin flag,
> > as is done by 'git stash'.
> >
> > There is some risk with this change, since under batch fsync, the object
> > files will not be available until the update-index is entirely complete.
> > This usage is unlikely, since any tool invoking update-index and
> > expecting to see objects would have to snoop the output of --verbose to
> > find out when update-index has actually processed a given path.
> > Additionally the index is locked for the duration of the update.
>
> Would you really need to sniff the verbose output? If I'm streaming data
> to update-index now it looks like I could assume before that
> update-index would have done the work if I managed to fflush() to it,
> since it's processing a line at a time and doing the work in that
> line-at-a-time loop.
>
> I.e. you could print lines to it, and then do concurrent object lookups
> knowing the data was written already...
>
> I think this is probably fine, but that case seems way likelier than
> someone sniffing back the verbose output, presumably for the "add" in
> update_one(), but that's called in the getline_fn() loop...

Does fflush really guarantee that the reader has picked up the input from
a pipe across all environments?  Even if a reader picks up the input, does
that mean that the reader is done processing it?

Do you think I really need to revise this comment? Maybe leave a terser,
'this usage is thought to be unlikely'?

>
> All of this makes me wonder why this isn't using tmp-objdir.c, i.e. we
> could have our cake and eat it too by writing the "real" objects, and
> then just renaming them between directories instead. But perhaps the
> answer has something to do with the metadata issues I raised.
>
> And well, tmp-objdir.c isn't going to help someone in practice that's
> relying on this "update-index --stdin" behavior, as they won't know
> where we staged the temporary files...
>

One motivation of the current design behind renaming the files is that
some networked filesystems don't seem to like cross-directory renames
much.  It also so happens that ReFS on Windows also prefers renames to
stay within the directory. Actually any filesystem would likely be
slightly faster,
since fewer objects are being modified (one dir versus two).