Re: [PATCH v9 0/9] Implement a batched fsync option for core.fsyncObjectFiles

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Wed, 17 Nov 2021 08:24:49 +0100

On Tue, Nov 16 2021, Neeraj Singh wrote:

> On Tue, Nov 16, 2021 at 12:10 AM Ævar Arnfjörð Bjarmason
> <avarab@xxxxxxxxx> wrote:
>>
>>
>> On Mon, Nov 15 2021, Neeraj K. Singh via GitGitGadget wrote:
>>
>> >  * Per [2], I'm leaving the fsyncObjectFiles configuration as is with
>> >    'true', 'false', and 'batch'. This makes using old and new versions of
>> >    git with 'batch' mode a little trickier, but hopefully people will
>> >    generally be moving forward in versions.
>> >
>> > [1] See
>> > https://lore.kernel.org/git/pull.1067.git.1635287730.gitgitgadget@xxxxxxxxx/
>> > [2] https://lore.kernel.org/git/xmqqh7cimuxt.fsf@gitster.g/
>>
>> I really think leaving that in-place is just being unnecessarily
>> cavalier. There's a lot of mixed-version environments where git is
>> deployed in, and we almost never break the configuration in this way (I
>> think in the past always by mistake).
>
>> In this case it's easy to avoid it, and coming up with a less narrow
>> config model[1] seems like a good idea in any case to unify the various
>> outstanding work in this area.
>>
>> More generally on this series, per the thread ending in [2] I really
>
> My primary goal in all of these changes is to move git-for-windows over to
> a default of batch fsync so that it can get closer to other platforms
> in performance
> of 'git add' while still retaining the same level of data integrity.
> I'm hoping that
> most end-users are just sticking to defaults here.
>
> I'm happy to change the configuration schema again if there's a
> consensus from the Git
> community that backwards-compatibility of the configuration is
> actually important to someone.
>
> Also, if we're doing a deeper rethink of the fsync configuration (as
> prompted by this work and
> Eric Wong's and Patrick Steinhardts work), do we want to retain a mode
> where we fsync some
> parts of the persistent repo data but not others?  If we add fsyncing
> of the index in addition to the refs,
> I believe we would have covered all of the critical data structures
> that would be needed to find the
> data that a user has added to the repo if they complete a series of
> git commands and then experience
> a system crash.

Just talking about it is how we'll find consensus, maybe you & Junio
would like to keep it as-is. I don't see why we'd expose this bad edge
case in configuration handling to users when it's entirely avoidable,
and we're still in the design phase.

>> don't get why we have code like this:
>>
>>         @@ -503,10 +504,12 @@ static void unpack_all(void)
>>                 if (!quiet)
>>                         progress = start_progress(_("Unpacking objects"), nr_objects);
>>                 CALLOC_ARRAY(obj_list, nr_objects);
>>         +       plug_bulk_checkin();
>>                 for (i = 0; i < nr_objects; i++) {
>>                         unpack_one(i);
>>                         display_progress(progress, i + 1);
>>                 }
>>         +       unplug_bulk_checkin();
>>                 stop_progress(&progress);
>>
>>                 if (delta_list)
>>
>> As opposed to doing an fsync on the last object we're
>> processing. I.e. why do we need the step of intentionally making the
>> objects unavailable in the tmp-objdir, and creating a "cookie" file to
>> sync at the start/end, as opposed to fsyncing on the last file (which
>> we're writing out anyway).
>>
>> 1. https://lore.kernel.org/git/211110.86r1bogg27.gmgdl@xxxxxxxxxxxxxxxxxxx/
>> 2. https://lore.kernel.org/git/20211111000349.GA703@neerajsi-x1.localdomain/
>
> It's important to not expose an object's final name until its contents
> have been fsynced
> to disk. We want to ensure that wherever we crash, we won't have a
> loose object that
> Git may later try to open where the filename doesn't match the content
> hash. I believe it's
> okay for a given OID to be missing, since a later command could
> recreate it, but an object
> with a wrong hash looks like it would persist until we do a git-fsck.

Yes, we handle that rather badly, as I mentioned in some other threads,
but not doing the fsync on the last object v.s. a "cookie" file right
afterwards seems like a hail-mary at best, no?

> I thought about figuring out how to sync the last object rather than some random
> "cookie" file, but it wasn't clear to me how I'd figure out which
> object is actually last
> from library code in a way that doesn't burden each command with
> somehow figuring
> out its last object and communicating that. The 'cookie' approach
> seems to lead to a cleaner
> interface for callers.

The above quoted code is looping through nr_objects isn't it? Can't a
"do fsync" be passed down to unpack_one() when we process the last loose
object?