Re: [PATCH v2 2/7] core.fsyncmethod: batched disk flushes for loose-objects

Neeraj Singh <nksingh85@xxxxxxxxx> · Mon, 21 Mar 2022 17:13:36 -0700

On Mon, Mar 21, 2022 at 1:37 PM Ævar Arnfjörð Bjarmason
<avarab@xxxxxxxxx> wrote:
>
>
> On Mon, Mar 21 2022, Neeraj Singh wrote:
>
> [Don't have time for a full reply, sorry, just something quick]
>
> > On Mon, Mar 21, 2022 at 9:52 AM Ævar Arnfjörð Bjarmason
> > [...]
> >> So, my question is still why the temporary object dir migration part of
> >> this is needed.
> >>
> >> We are writing N loose object files, and we write those to temporary
> >> names already.
> >>
> >> AFAIKT we could do all of this by doing the same
> >> tmp/rename/sync_file_range dance on the main object store.
> >>
> >
> > Why not the main object store? We want to maintain the invariant that any
> > name in the main object store refers to a file that durably has the
> > correct contents.
> > If we do sync_file_range and then rename, and then crash, we now have a file
> > in the main object store with some SHA name, whose contents may or may not
> > match the SHA.  However, if we ensure an fsync happens before the rename,
> > a crash at any point will leave us either with no file in the main
> > object store or
> > with a file that is durable on the disk.
>
> Ah, I see.
>
> Why does that matter? If the "bulk" mode works as advertised we might
> have such a corrupt loose or pack file, but we won't have anything
> referring to it as far as reachability goes.
>
> I'm aware that the various code paths that handle OID writing don't deal
> too well with it in practice to say the least, which one can try with
> say:
>
>     $ echo foo | git hash-object -w --stdin
>     45b983be36b73c0788dc9cbcb76cbb80fc7bb057
>     $ echo | sudo tee .git/objects/45/b983be36b73c0788dc9cbcb76cbb80fc7bb057
>
> I.e. "fsck", "show" etc. will all scream bloddy murder, and re-running
> that hash-object again even returns successful (we see it's there
> already, and think it's OK).
>

I was under the impression that in-practice a corrupt loose-object can create
persistent problems in the repo for future commands, since we might not
aggressively verify that an existing file with a certain OID really is
valid when
adding a new instance of the data with the same OID.

If you don't have an fsync barrier before producing the final
content-addressable
name, you can't reason about "this operation happened before that operation,"
so it wouldn't really be valid to say that "we won't have anything
referring to it as far
as reachability goes."

It's entirely possible that you'd have trees pointing to other trees
or blobs that aren't
valid, since data writes can be durable in any order. At this point,
future attempts add
the same blobs or trees might silently drop the updates.  I'm betting that's why
core.fsyncObjectFiles was added in the first place, since someone
observed severe
persistent consequences for this form of corruption.

> But in any case, I think it would me much easier to both review and
> reason about this code if these concerns were split up.
>
> I.e. things that want no fsync at all (I'd think especially so) might
> want to have such updates serialized in this manner, and as Junio
> pointed out making these things inseparable as you've done creates API
> concerns & fallout that's got nothing to do with what we need for the
> performance gains of the bulk checkin fsyncing technique,
> e.g. concurrent "update-index" consumers not being able to assume
> reported objects exist as soon as they're reported.
>

I want to explicitly not respond to this concern. I don't believe this
100 line patch
can be usefully split.

> >> Then instead of the "bulk_fsync" cookie file don't close() the last file
> >> object file we write until we issue the fsync on it.
> >>
> >> But maybe this is all needed, I just can't understand from the commit
> >> message why the "bulk checkin" part is being done.
> >>
> >> I think since we've been over this a few times without any success it
> >> would really help to have some example of the smallest set of syscalls
> >> to write a file like this safely. I.e. this is doing (pseudocode):
> >>
> >>     /* first the bulk path */
> >>     open("bulk/x.tmp");
> >>     write("bulk/x.tmp");
> >>     sync_file_range("bulk/x.tmp");
> >>     close("bulk/x.tmp");
> >>     rename("bulk/x.tmp", "bulk/x");
> >>     open("bulk/y.tmp");
> >>     write("bulk/y.tmp");
> >>     sync_file_range("bulk/y.tmp");
> >>     close("bulk/y.tmp");
> >>     rename("bulk/y.tmp", "bulk/y");
> >>     /* Rename to "real" */
> >>     rename("bulk/x", x");
> >>     rename("bulk/y", y");
> >>     /* sync a cookie */
> >>     fsync("cookie");
> >>
> >
> > The '/* Rename to "real" */' and '/* sync a cookie */' steps are
> > reversed in your above sequence. It should be
>
> Sorry.
>
> > 1: (for each file)
> >     a) open
> >     b) write
> >     c) sync_file_range
> >     d) close
> >     e) rename in tmp_objdir  -- The rename step is not required for
> > bulk-fsync. An earlier version of this series didn't do it, but
> > Jeff King pointed out that it was required for concurrency:
> > https://lore.kernel.org/all/YVOrikAl%2Fu5%2FVi61@xxxxxxxxxxxxxxxxxxxxxxx/
>
> Yes we definitely need the rename, I was wondering about why we needed
> it 2x for each file, but that was answered above.
>
> >> And I'm asking why it's not:
> >>
> >>     /* Rename to "real" as we go */
> >>     open("x.tmp");
> >>     write("x.tmp");
> >>     sync_file_range("x.tmp");
> >>     close("x.tmp");
> >>     rename("x.tmp", "x");
> >>     last_fd = open("y.tmp"); /* don't close() the last one yet */
> >>     write("y.tmp");
> >>     sync_file_range("y.tmp");
> >>     rename("y.tmp", "y");
> >>     /* sync a cookie */
> >>     fsync(last_fd);
> >>
> >> Which I guess is two questions:
> >>
> >>  A. do we need the cookie, or can we re-use the fd of the last thing we
> >>     write?
> >
> > We can re-use the FD of the last thing we write, but that results in a
> > tricker API which
> > is more intrusive on callers. I was originally using a lockfile, but
> > found a usage where
> > there was no lockfile in unpack-objects.
>
> Ok, so it's something we could do, but passing down 2-3 functions to
> object-file.c was a hassle.
>
> I tried to hack that up earlier and found that it wasn't *too
> bad*. I.e. we'd pass some "flags" about our intent, and amend various
> functions to take "don't close this one" and pass up the fd (or even do
> that as a global).
>
> In any case, having the commit message clearly document what's needed
> for what & what's essential & just shortcut taken for the convenience of
> the current implementation would be really useful.
>
> Then we can always e.g. change this later to just do the the fsync() on
> the last of N we write.
>

I left a comment in the (now very long) commit message that indicates the
dummy file is there to make the API simpler.

Thanks,
Neeraj