Re: [PATCH v2 2/7] core.fsyncmethod: batched disk flushes for loose-objects

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Mon, 21 Mar 2022 21:18:04 +0100

On Mon, Mar 21 2022, Neeraj Singh wrote:

[Don't have time for a full reply, sorry, just something quick]

> On Mon, Mar 21, 2022 at 9:52 AM Ævar Arnfjörð Bjarmason
> [...]
>> So, my question is still why the temporary object dir migration part of
>> this is needed.
>>
>> We are writing N loose object files, and we write those to temporary
>> names already.
>>
>> AFAIKT we could do all of this by doing the same
>> tmp/rename/sync_file_range dance on the main object store.
>>
>
> Why not the main object store? We want to maintain the invariant that any
> name in the main object store refers to a file that durably has the
> correct contents.
> If we do sync_file_range and then rename, and then crash, we now have a file
> in the main object store with some SHA name, whose contents may or may not
> match the SHA.  However, if we ensure an fsync happens before the rename,
> a crash at any point will leave us either with no file in the main
> object store or
> with a file that is durable on the disk.

Ah, I see.

Why does that matter? If the "bulk" mode works as advertised we might
have such a corrupt loose or pack file, but we won't have anything
referring to it as far as reachability goes.

I'm aware that the various code paths that handle OID writing don't deal
too well with it in practice to say the least, which one can try with
say:

    $ echo foo | git hash-object -w --stdin
    45b983be36b73c0788dc9cbcb76cbb80fc7bb057
    $ echo | sudo tee .git/objects/45/b983be36b73c0788dc9cbcb76cbb80fc7bb057

I.e. "fsck", "show" etc. will all scream bloddy murder, and re-running
that hash-object again even returns successful (we see it's there
already, and think it's OK).

But in any case, I think it would me much easier to both review and
reason about this code if these concerns were split up.

I.e. things that want no fsync at all (I'd think especially so) might
want to have such updates serialized in this manner, and as Junio
pointed out making these things inseparable as you've done creates API
concerns & fallout that's got nothing to do with what we need for the
performance gains of the bulk checkin fsyncing technique,
e.g. concurrent "update-index" consumers not being able to assume
reported objects exist as soon as they're reported.

>> Then instead of the "bulk_fsync" cookie file don't close() the last file
>> object file we write until we issue the fsync on it.
>>
>> But maybe this is all needed, I just can't understand from the commit
>> message why the "bulk checkin" part is being done.
>>
>> I think since we've been over this a few times without any success it
>> would really help to have some example of the smallest set of syscalls
>> to write a file like this safely. I.e. this is doing (pseudocode):
>>
>>     /* first the bulk path */
>>     open("bulk/x.tmp");
>>     write("bulk/x.tmp");
>>     sync_file_range("bulk/x.tmp");
>>     close("bulk/x.tmp");
>>     rename("bulk/x.tmp", "bulk/x");
>>     open("bulk/y.tmp");
>>     write("bulk/y.tmp");
>>     sync_file_range("bulk/y.tmp");
>>     close("bulk/y.tmp");
>>     rename("bulk/y.tmp", "bulk/y");
>>     /* Rename to "real" */
>>     rename("bulk/x", x");
>>     rename("bulk/y", y");
>>     /* sync a cookie */
>>     fsync("cookie");
>>
>
> The '/* Rename to "real" */' and '/* sync a cookie */' steps are
> reversed in your above sequence. It should be

Sorry.

> 1: (for each file)
>     a) open
>     b) write
>     c) sync_file_range
>     d) close
>     e) rename in tmp_objdir  -- The rename step is not required for
> bulk-fsync. An earlier version of this series didn't do it, but
> Jeff King pointed out that it was required for concurrency:
> https://lore.kernel.org/all/YVOrikAl%2Fu5%2FVi61@xxxxxxxxxxxxxxxxxxxxxxx/

Yes we definitely need the rename, I was wondering about why we needed
it 2x for each file, but that was answered above.

>> And I'm asking why it's not:
>>
>>     /* Rename to "real" as we go */
>>     open("x.tmp");
>>     write("x.tmp");
>>     sync_file_range("x.tmp");
>>     close("x.tmp");
>>     rename("x.tmp", "x");
>>     last_fd = open("y.tmp"); /* don't close() the last one yet */
>>     write("y.tmp");
>>     sync_file_range("y.tmp");
>>     rename("y.tmp", "y");
>>     /* sync a cookie */
>>     fsync(last_fd);
>>
>> Which I guess is two questions:
>>
>>  A. do we need the cookie, or can we re-use the fd of the last thing we
>>     write?
>
> We can re-use the FD of the last thing we write, but that results in a
> tricker API which
> is more intrusive on callers. I was originally using a lockfile, but
> found a usage where
> there was no lockfile in unpack-objects.

Ok, so it's something we could do, but passing down 2-3 functions to
object-file.c was a hassle.

I tried to hack that up earlier and found that it wasn't *too
bad*. I.e. we'd pass some "flags" about our intent, and amend various
functions to take "don't close this one" and pass up the fd (or even do
that as a global).

In any case, having the commit message clearly document what's needed
for what & what's essential & just shortcut taken for the convenience of
the current implementation would be really useful.

Then we can always e.g. change this later to just do the the fsync() on
the last of N we write.

[Ran out of time, sorry]