Re: [Lsf-pc] [LSF/MM TOPIC] atomic block device

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Feb 17, 2014 at 02:20:50AM -0800, Howard Chu wrote:
> Jan Kara wrote:
> >On Mon 17-02-14 19:56:27, Dave Chinner wrote:
> >>On Sat, Feb 15, 2014 at 10:47:12AM -0700, Andy Rudoff wrote:
> >>>On Sat, Feb 15, 2014 at 8:04 AM, Dan Williams <dan.j.williams@xxxxxxxxx>wrote:
> >>>
> >>>>In response to Dave's call [1] and highlighting Jeff's attend request
> >>>>[2] I'd like to stoke a discussion on an emulation layer for atomic
> >>>>block commands.  Specifically, SNIA has laid out their position on the
> >>>>command set an atomic block device may support (NVM Programming Model
> >>>>[3]) and it is a good conversation piece for this effort.  The goal
> >>>>would be to review the proposed operations, identify the capabilities
> >>>>that would be readily useful to filesystems / existing use cases, and
> >>>>tear down a straw man implementation proposal.
> >>>>
> >>>...
> >>>
> >>>>The argument for not doing this as a
> >>>>device-mapper target or stacked block device driver is to ease
> >>>>provisioning and make the emulation transparent.  On the other hand,
> >>>>the argument for doing this as a virtual block device is that the
> >>>>"failed to parse device metadata" is a known failure scenario for
> >>>>dm/md, but not sd for example.
> >>>>
> >>>
> >>>Hi Dan,
> >>>
> >>>Like Jeff, I'm a member of the NVMP workgroup and I'd like to ring in here
> >>>with a couple observations.  I think the most interesting cases where
> >>>atomics provide a benefit are cases where storage is RAIDed across multiple
> >>>devices.  Part of the argument for atomic writes on SSDs is that databases
> >>>and file systems can save bandwidth and complexity by avoiding
> >>>write-ahead-logging.  But even if every SSD supported it, the majority of
> >>>production databases span across devices for either capacity, performance,
> >>>or, most likely, high availability reasons.  So in my opinion, that very
> >>>much supports the idea of doing atomics at a layer where it applies to SW
> >>>RAIDed storage (as I believe Dave and others are suggesting).
> >>>
> >>>On the other side of the coin, I remember Dave talking about this during
> >>>our NVM discussion at LSF last year and I got the impression the size and
> >>>number of writes he'd need supported before he could really stop using his
> >>>journaling code was potentially large.  Dave: perhaps you can re-state the
> >>>number of writes and their total size that would have to be supported by
> >>>block level atomics in order for them to be worth using by XFS?
> >>
> >>Hi Andy - the numbers I gave last year were at the upper end of the
> >>number of iovecs we can dump into an atomic checkpoint in the XFS
> >>log at a time. because that is typically based on log size and the
> >>log can be up to 2GB in size, this tends to max out at somewhere
> >>around 150-200,000 individual iovecs and/or roughly 100MB of
> >>metadata.
> >>
> >>Yeah, it's a lot, but keep in mind that a workload running 250,000
> >>file creates a second on XFS is retiring somewhere around 300,000
> >>individual transactions per second, each of which will typically
> >>have 10-20 dirty regions in them.  If we were to write them as
> >>individual atomic writes at transaction commit time we'd need to
> >>sustain somewhere in the order of 3-6 _million IOPS_ to maintain
> >>this transaction rate with individual atomic writes for each
> >>transaction.
> >>
> >>That would also introduce unacceptible IO latency as we can't modify
> >>metadata while it is under IO, especially as a large number of these
> >>regions are redirtied repeatedly during ongoing operations(e.g.
> >>directory data and index blocks). Hence to avoid this problem with
> >>atomic writes, we need still need asynchronous transactions and
> >>in-memory aggregation of changes.  IOWs, checkpoints are the until
> >>of atomic write we need to for support in XFS.
> >>
> >>We can limit the size of checkpoints in XFS without too much
> >>trouble, either by amount of data or number of iovecs, but that
> >>comes at a performance code. To maintain current levels of
> >>performance we need a decent amount of in-memory change aggregation
> >>and hence we are going to need - at minimum - thousands of vectors
> >>in each atomic write. I'd prefer tens of thousands to hundreds of
> >>thousands of vectors because that's our typical unit of "atomic
> >>write" at current performance levels, but several thousand vectors
> >>and tens of MB is sufficient to start with....
> >   I did the math for ext4 and it worked out rather similarly. After the
> >transaction batching we do in memory, we have transactions which are tens
> >of MB in size. These go first to a physically contiguous journal during
> >transaction commit (that's the easy part but it would already save us one
> >cache flush + FUA write) and then during checkpoint to final locations on
> >disk which can be physically discontiguous so that can be thousands to tens
> >of thousands different locations (this would save us another cache flush +
> >FUA write).
> >
> >Similarly as in XFS case it is easy to force smaller transactions in ext4
> >but the smaller you make them the larger it the journaling overhead...
> 
> Again, if you simply tag writes with group IDs as I outlined before
> http://www.spinics.net/lists/linux-fsdevel/msg70047.html then you
> don't need explicit cache flushes, nor do you need to worry about
> transaction size limits.
>
> All you actually need is to ensure the
> ordering of a specific set of writes in relation to another specific
> set of writes, completely independent of other arbitrary writes. You
> folks are cooking up a solution for NVMe that's only practical when
> data transfer rates are fast enough that a 100MB write can be done
> in ~1ms, whereas a simple tweak of command tagging will work for
> everything from the slowest HDD to the fastest storage device.

Perhaps you'd like to outline how you avoid IO priority inversion in
a journal with such a scheme where current checkpoints are held off
by all other metadata writeback because, by definition, metadata
writeback must be in a lower ordered tag group than the current
checkpoint.

> As it is, the Atomic Write mechanism will be unusable for DBs when
> the transaction size exceeds whatever limit a particular device
> supports, thus requiring DB software to still provide a fallback
> mechanism, e.g. standard WAL, which only results in more complicated
> software. That's not a solution, that's just a new problem.

Realistically, I haven't seen a single proposal coming out of the
hardware vendors that makes filesystem journalling more efficient
than it already is. Atomic writes might be able to save a journal
flush on an fsync() and so make databases go faster, but it gives
up a whole heap of other optimisations that make non-database
workloads go fast. e.g. untarring a tarball.

Similarly, things like ordered writes are great until you consider
how they interact with journalling and cause priority inversion
issues. The only way to make use of ordered writes is to design the
filesystem around ordered writes from the ground up. i.e. the
soft updates complexity problem. Unlike atomic writes, this can't
easily be retrofitted to an exising filesystem, and once you have
soft updates in place you are effectively fixing the format and
features of the filesystem in stone because if you need to change a
single operation or on disk structure you have to work out the
dependency graph for the entire filesystem from the ground up again.

Perhaps - just perhaps - we're doing this all wrong. Bottom up
design of hardware offload features has a history of resulting in
functionality that looks good on paper but can't be used in general
production systems because it is too limited or has undesirable side
effects.  Perhaps we need to be more top down, similar to how I
proposed a "dm-atomic" layer to implement atomic writes in software.

That is, design the software layer first, then convert filesystems
to use it. Once the concept is proven (a software implementation
should be no slower than what it replaced), the hardware offload
primitives can be derived from the algorithms that the software
offload uses.

i.e. design offload algorithms that work for existing users, prove
then work, then provide those primitives in hardware knowing that
they work and will be useful....

You can implement all this ordered group writes in a DM module quite
easily, it's trivial to extend submit_bio to take a 64 bit sequence
tag for ordered group writes. All metadata IO in XFS already has an
ordered 64 bit tag associated with it (funnily enough, called the
Log Sequence Number) and you can tell XFS not to send cache flushes
simply by using the nobarrier mount option.

So there's your proof of concept implementation - prove it works,
that priority inversion isn't a problem and that performance is
equivalent to the existing cache flush based implementation, and
then you have a proposal that we can take seriously.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux