Re: [PATCH v9 0/9] Implement a batched fsync option for core.fsyncObjectFiles

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Mar 10, 2022 at 6:17 AM Ævar Arnfjörð Bjarmason
<avarab@xxxxxxxxx> wrote:
>
>
> On Wed, Mar 09 2022, Neeraj Singh wrote:
>
> > On Wed, Mar 9, 2022 at 3:10 PM Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> wrote:
> >>
> >> Replying to an old-ish E-Mail of mine with some more thought that came
> >> to mind after[1] (another recently resurrected fsync() thread).
> >>
> >> I wonder if there's another twist on the plan outlined in [2] that would
> >> be both portable & efficient, i.e. the "slow" POSIX way to write files
> >> A..Z is to open/write/close/fsync each one, so we'll trigger a HW flush
> >> N times.
> >>
> >> And as we've discussed, doing it just on Z will implicitly flush A..Y on
> >> common OS's in the wild, which we're taking advantage of here.
> >>
> >> But aside from the rename() dance in[2], what do those OS's do if you
> >> write A..Z, fsync() the "fd" for Z, and then fsync A..Y (or, presumably
> >> equivalently, in reverse order: Y..A).
> >>
> >> I'd think they'd be smart enough to know that they already implicitly
> >> flushed that data since Z was flushend, and make those fsync()'s a
> >> rather cheap noop.
> >>
> >> But I don't know, hence the question.
> >>
> >> If that's true then perhaps it's a path towards having our cake and
> >> eating it too in some cases?
> >>
> >> I.e. an FS that would flush A..Y if we flush Z would do so quickly and
> >> reliably, whereas a FS that doesn't have such an optimization might be
> >> just as slow for all of A..Y, but at least it'll be safe.
> >>
> >> 1. https://lore.kernel.org/git/220309.867d93lztw.gmgdl@xxxxxxxxxxxxxxxxxxx/
> >> 2. https://lore.kernel.org/git/e1747ce00af7ab3170a69955b07d995d5321d6f3.1637020263.git.gitgitgadget@xxxxxxxxx/
> >
> > The important angle here is that we need some way to indicate to the
> > OS what A..Y is before we fsync on Z.  I.e. the OS will cache any
> > writes in memory until some sync-ish operation is done on *that
> > specific file*.  Syncing just 'Z' with no sync operations on A..Y
> > doesn't indicate that A..Y would get written out.  Apparently the bad
> > old ext3 behavior was similar to what you're proposing where a sync on
> > 'Z' would imply something about independent files.
>
> It's certainly starting to sound like I'm misunderstanding this whole
> thing, but just to clarify again I'm talking about the sort of loops
> mentioned upthread in my [1]. I.e. you have (to copy from that E-Mail):
>
>     bulk_checkin_start_make_cookie():
>     n = 10
>     for i in 1..n:
>         write_nth(i, fsync: 0);
>     bulk_checkin_end_commit_cookie();
>
> I.e. we have a "cookie" file in a given dir (where, in this example,
> we'd also write files A..Z). I.e. we write:
>
>     cookie
>     {A..Z}
>     cookie
>
> And then only fsync() on the "cookie" at the end, which "flushes" the
> A..Z updates on some FS's (again, all per my possibly-incorrect
> understanding).
>
> Which is why I proposed that in many/all cases we could do this,
> i.e. just the same without the "cookie" file (which AFAICT isn't needed
> per-se, but was just added to make the API a bit simpler in not needing
> to modify the relevant loops):
>
>     all_fsync = bulk_checkin_mode() ? 0 : fsync_turned_on_in_general();
>     end_fsync = bulk_checkin_mode() ? 1 : all_fsync;
>     n = 10;
>     for i in 1..n:
>         write_nth(i, fsync: (i == n) ? end_fsync : all_fsync);
>
> I.e. we don't pay the cost of the fsync() as we're in the loop, but just
> for the last file, which "flushes" the rest.
>
> So far all of that's a paraphrasing of existing exchanges, but what I
> was wondering now in[2] is if we add this to this last example above:
>
>     for i in 1..n-1:
>         fsync_nth(i)
>
> Wouldn't those same OS's that are being clever about deferring the
> syncing of A..Z as a "batch" be clever enough to turn that (re-)syncing
> into a NOOP?
>
> Of course in this case we'd need to keep the fd's open and be clever
> about E[MN]FILE (i.e. "Too many open..."), or do an fsync() every Nth
> for some reasonable Nth, e.g. somewhere in the 2^10..2^12 range.
>
> But *if* this works it seems to me to be something we might be able to
> enable when "core.fsyncObjectFiles" is configured on those systems.
>
> I.e. the implicit assumption with that configuration was that if we sync
> N loose objects and then update and fsync the ref that the FS would
> queue up the ref update after the syncing of the loose objects.
>
> This new "cookie" (or my suggested "fsync last of N") is basically
> making the same assumption, just with the slight twist that some OSs/FSs
> are known to behave like that on a per-subdir basis, no?
>
> > Here's an interesting paper I recently came across that proposes the
> > interface we'd really want, 'syncv':
> > https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.924.1168&rep=rep1&type=pdf.
>
> 1. https://lore.kernel.org/git/211201.864k7sbdjt.gmgdl@xxxxxxxxxxxxxxxxxxx/
> 2. https://lore.kernel.org/git/220310.86lexilo3d.gmgdl@xxxxxxxxxxxxxxxxxxx/

On the actual FS implementations in the three common OSes I'm familiar with
(macOS, Windows, Linux), each file has its own independent data caching in OS
memory.  Fsyncing one of them doesn't necessarily imply writing out
the OS cache for
any other file.  Except, apparently, on ext3 in data=ordered mode, but
that FS is no
longer common.  On Linux, we use sync_file_range to get the OS to
write the in-memory
cache to the storage hardware, which is what makes the data
'available' to fsync.

Now, we could consider an implementation where we call sync_file_range
without the
wait flags (i.e. without SYNC_FILE_RANGE_WAIT_BEFORE and
SYNC_FILE_RANGE_WAIT_AFTER). Then we could later fsync every file (or batch of
files), which might be more efficient if the OS coalesces the disk
cache flushes.  I expect
that this method is less likely to give us the desired performance on
common linux FSes,
however.

The macOS and Windows APIs are defined a bit differently from Linux.
In both those OSes,
we're actually calling fsync-equivalent APIs that are defined to write
back all the relevant data and
metadata, just without the storage cache flush.

So to summarize:
1. We need to do write(2) to get the data out of Git and into the OS
filesystem cache.
2. We need some API (macOS-fsync, Windows-NtFlushBuffersFileEx,
Linux-sync_file_range)
   to transfer the data per-file to the storage controller, but
without flushing the storage controller.
3. We need some api (macOS-F_FULLFSYNC, Windows-NtFlushBuffersFile, Linux-fsync)
   to push the storage controller cache to durable media. This only
needs to be done once
   at the end to push out the data made available in step (2).


Thanks,
Neeraj




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux