Re: [PATCH v9 0/9] Implement a batched fsync option for core.fsyncObjectFiles

Neeraj Singh <nksingh85@xxxxxxxxx> · Tue, 16 Nov 2021 23:06:25 -0800

On Tue, Nov 16, 2021 at 12:10 AM Ævar Arnfjörð Bjarmason
<avarab@xxxxxxxxx> wrote:
>
>
> On Mon, Nov 15 2021, Neeraj K. Singh via GitGitGadget wrote:
>
> >  * Per [2], I'm leaving the fsyncObjectFiles configuration as is with
> >    'true', 'false', and 'batch'. This makes using old and new versions of
> >    git with 'batch' mode a little trickier, but hopefully people will
> >    generally be moving forward in versions.
> >
> > [1] See
> > https://lore.kernel.org/git/pull.1067.git.1635287730.gitgitgadget@xxxxxxxxx/
> > [2] https://lore.kernel.org/git/xmqqh7cimuxt.fsf@gitster.g/
>
> I really think leaving that in-place is just being unnecessarily
> cavalier. There's a lot of mixed-version environments where git is
> deployed in, and we almost never break the configuration in this way (I
> think in the past always by mistake).

> In this case it's easy to avoid it, and coming up with a less narrow
> config model[1] seems like a good idea in any case to unify the various
> outstanding work in this area.
>
> More generally on this series, per the thread ending in [2] I really

My primary goal in all of these changes is to move git-for-windows over to
a default of batch fsync so that it can get closer to other platforms
in performance
of 'git add' while still retaining the same level of data integrity.
I'm hoping that
most end-users are just sticking to defaults here.

I'm happy to change the configuration schema again if there's a
consensus from the Git
community that backwards-compatibility of the configuration is
actually important to someone.

Also, if we're doing a deeper rethink of the fsync configuration (as
prompted by this work and
Eric Wong's and Patrick Steinhardts work), do we want to retain a mode
where we fsync some
parts of the persistent repo data but not others?  If we add fsyncing
of the index in addition to the refs,
I believe we would have covered all of the critical data structures
that would be needed to find the
data that a user has added to the repo if they complete a series of
git commands and then experience
a system crash.

> don't get why we have code like this:
>
>         @@ -503,10 +504,12 @@ static void unpack_all(void)
>                 if (!quiet)
>                         progress = start_progress(_("Unpacking objects"), nr_objects);
>                 CALLOC_ARRAY(obj_list, nr_objects);
>         +       plug_bulk_checkin();
>                 for (i = 0; i < nr_objects; i++) {
>                         unpack_one(i);
>                         display_progress(progress, i + 1);
>                 }
>         +       unplug_bulk_checkin();
>                 stop_progress(&progress);
>
>                 if (delta_list)
>
> As opposed to doing an fsync on the last object we're
> processing. I.e. why do we need the step of intentionally making the
> objects unavailable in the tmp-objdir, and creating a "cookie" file to
> sync at the start/end, as opposed to fsyncing on the last file (which
> we're writing out anyway).
>
> 1. https://lore.kernel.org/git/211110.86r1bogg27.gmgdl@xxxxxxxxxxxxxxxxxxx/
> 2. https://lore.kernel.org/git/20211111000349.GA703@neerajsi-x1.localdomain/

It's important to not expose an object's final name until its contents
have been fsynced
to disk. We want to ensure that wherever we crash, we won't have a
loose object that
Git may later try to open where the filename doesn't match the content
hash. I believe it's
okay for a given OID to be missing, since a later command could
recreate it, but an object
with a wrong hash looks like it would persist until we do a git-fsck.

I thought about figuring out how to sync the last object rather than some random
"cookie" file, but it wasn't clear to me how I'd figure out which
object is actually last
from library code in a way that doesn't burden each command with
somehow figuring
out its last object and communicating that. The 'cookie' approach
seems to lead to a cleaner
interface for callers.

Thanks,
Neeraj