Re: [PATCH v9 0/9] Implement a batched fsync option for core.fsyncObjectFiles

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Thu, 10 Mar 2022 15:01:34 +0100

On Wed, Mar 09 2022, Neeraj Singh wrote:

> On Wed, Mar 9, 2022 at 3:10 PM Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> wrote:
>>
>> Replying to an old-ish E-Mail of mine with some more thought that came
>> to mind after[1] (another recently resurrected fsync() thread).
>>
>> I wonder if there's another twist on the plan outlined in [2] that would
>> be both portable & efficient, i.e. the "slow" POSIX way to write files
>> A..Z is to open/write/close/fsync each one, so we'll trigger a HW flush
>> N times.
>>
>> And as we've discussed, doing it just on Z will implicitly flush A..Y on
>> common OS's in the wild, which we're taking advantage of here.
>>
>> But aside from the rename() dance in[2], what do those OS's do if you
>> write A..Z, fsync() the "fd" for Z, and then fsync A..Y (or, presumably
>> equivalently, in reverse order: Y..A).
>>
>> I'd think they'd be smart enough to know that they already implicitly
>> flushed that data since Z was flushend, and make those fsync()'s a
>> rather cheap noop.
>>
>> But I don't know, hence the question.
>>
>> If that's true then perhaps it's a path towards having our cake and
>> eating it too in some cases?
>>
>> I.e. an FS that would flush A..Y if we flush Z would do so quickly and
>> reliably, whereas a FS that doesn't have such an optimization might be
>> just as slow for all of A..Y, but at least it'll be safe.
>>
>> 1. https://lore.kernel.org/git/220309.867d93lztw.gmgdl@xxxxxxxxxxxxxxxxxxx/
>> 2. https://lore.kernel.org/git/e1747ce00af7ab3170a69955b07d995d5321d6f3.1637020263.git.gitgitgadget@xxxxxxxxx/
>
> The important angle here is that we need some way to indicate to the
> OS what A..Y is before we fsync on Z.  I.e. the OS will cache any
> writes in memory until some sync-ish operation is done on *that
> specific file*.  Syncing just 'Z' with no sync operations on A..Y
> doesn't indicate that A..Y would get written out.  Apparently the bad
> old ext3 behavior was similar to what you're proposing where a sync on
> 'Z' would imply something about independent files.

It's certainly starting to sound like I'm misunderstanding this whole
thing, but just to clarify again I'm talking about the sort of loops
mentioned upthread in my [1]. I.e. you have (to copy from that E-Mail):

    bulk_checkin_start_make_cookie():
    n = 10
    for i in 1..n:
        write_nth(i, fsync: 0);
    bulk_checkin_end_commit_cookie();

I.e. we have a "cookie" file in a given dir (where, in this example,
we'd also write files A..Z). I.e. we write:

    cookie
    {A..Z}
    cookie

And then only fsync() on the "cookie" at the end, which "flushes" the
A..Z updates on some FS's (again, all per my possibly-incorrect
understanding).

Which is why I proposed that in many/all cases we could do this,
i.e. just the same without the "cookie" file (which AFAICT isn't needed
per-se, but was just added to make the API a bit simpler in not needing
to modify the relevant loops):

    all_fsync = bulk_checkin_mode() ? 0 : fsync_turned_on_in_general();
    end_fsync = bulk_checkin_mode() ? 1 : all_fsync;
    n = 10;
    for i in 1..n:
        write_nth(i, fsync: (i == n) ? end_fsync : all_fsync);

I.e. we don't pay the cost of the fsync() as we're in the loop, but just
for the last file, which "flushes" the rest.

So far all of that's a paraphrasing of existing exchanges, but what I
was wondering now in[2] is if we add this to this last example above:

    for i in 1..n-1:
        fsync_nth(i)

Wouldn't those same OS's that are being clever about deferring the
syncing of A..Z as a "batch" be clever enough to turn that (re-)syncing
into a NOOP?

Of course in this case we'd need to keep the fd's open and be clever
about E[MN]FILE (i.e. "Too many open..."), or do an fsync() every Nth
for some reasonable Nth, e.g. somewhere in the 2^10..2^12 range.

But *if* this works it seems to me to be something we might be able to
enable when "core.fsyncObjectFiles" is configured on those systems.

I.e. the implicit assumption with that configuration was that if we sync
N loose objects and then update and fsync the ref that the FS would
queue up the ref update after the syncing of the loose objects.

This new "cookie" (or my suggested "fsync last of N") is basically
making the same assumption, just with the slight twist that some OSs/FSs
are known to behave like that on a per-subdir basis, no?

> Here's an interesting paper I recently came across that proposes the
> interface we'd really want, 'syncv':
> https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.924.1168&rep=rep1&type=pdf.

1. https://lore.kernel.org/git/211201.864k7sbdjt.gmgdl@xxxxxxxxxxxxxxxxxxx/
2. https://lore.kernel.org/git/220310.86lexilo3d.gmgdl@xxxxxxxxxxxxxxxxxxx/