Re: [PATCH 2/2] core.fsyncobjectfiles: batch disk flushes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Aug 31, 2021 at 12:59:14PM -0700, Neeraj Singh wrote:
> > So in general there are very few metadata writes, and it is absolutely
> > essential to flush the cache before that, because otherwise your metadata
> > could point to data that might not actually have made it to disk.
> >
> > The best way to optimize such a workload is by first batching all the
> > data writeout for multiple fils in step one, then only doing one cache
> > flush and one log force (as we call it) to cover all the files.  syncfs
> > will do that, but without a good way to pick individual files.
> 
> Yes, I think we want to do step (1) of your sequence for all of the files, then
> issue steps (2-4) for all files as a group.  Of course, if the log
> fills up then we
> can flush the intermediate steps.  The unfortunate thing is that
> there's no Linux interface
> to do step (1) and to also ensure that the relevant data is in the log
> stream or is
> otherwise available to be part of the durable metadata.

There is also no interface to do 2-4 separately, mostly because they
are so hard to separate.  The only API I could envision is one that takes
an array of file descriptors and has the semantics of doing a fsync/
fdatasync for all of them, allowing the implementation to optimize
the order.  It would be implementable, but not quite as efficient
as syncfs.  I'm also pretty sure we've seen a few attempts at it in
the past that ran into various issues and didn't really make it far.

> I looked at that paper and definitely agree with you about the questionable
> implementation strategy they picked. I don't (yet) have detailed knowledge of
> SSD internals, but it's surprising to me that there is little value to
> barrier semantics
> within the drive as opposed to a full durability sync. At least for
> Windows, we have
> a database (the Registry) for which any single-threaded latency
> improvement would
> be welcome.

The major issue with barrier like in the paper above or as historic
Linux 2.6 kernels had it is that it enforces a global ordering.  For
software-only implementation like the Linux one this was already bad
enough, but a hardware/firmware implementation in nvme would mean you'd
have to add global serialize to a storage interface standard and its
implementations, while these are very much about offering parallelisms.
In fact we'd also have very similar issues with the modern Linux block
layer, which has applied some similar ideas.



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux