Re: [Lsf-pc] [LSF/MM TOPIC] atomic block device

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Jan Kara wrote:
On Mon 17-02-14 19:56:27, Dave Chinner wrote:
On Sat, Feb 15, 2014 at 10:47:12AM -0700, Andy Rudoff wrote:
On Sat, Feb 15, 2014 at 8:04 AM, Dan Williams <dan.j.williams@xxxxxxxxx>wrote:

In response to Dave's call [1] and highlighting Jeff's attend request
[2] I'd like to stoke a discussion on an emulation layer for atomic
block commands.  Specifically, SNIA has laid out their position on the
command set an atomic block device may support (NVM Programming Model
[3]) and it is a good conversation piece for this effort.  The goal
would be to review the proposed operations, identify the capabilities
that would be readily useful to filesystems / existing use cases, and
tear down a straw man implementation proposal.

...

The argument for not doing this as a
device-mapper target or stacked block device driver is to ease
provisioning and make the emulation transparent.  On the other hand,
the argument for doing this as a virtual block device is that the
"failed to parse device metadata" is a known failure scenario for
dm/md, but not sd for example.


Hi Dan,

Like Jeff, I'm a member of the NVMP workgroup and I'd like to ring in here
with a couple observations.  I think the most interesting cases where
atomics provide a benefit are cases where storage is RAIDed across multiple
devices.  Part of the argument for atomic writes on SSDs is that databases
and file systems can save bandwidth and complexity by avoiding
write-ahead-logging.  But even if every SSD supported it, the majority of
production databases span across devices for either capacity, performance,
or, most likely, high availability reasons.  So in my opinion, that very
much supports the idea of doing atomics at a layer where it applies to SW
RAIDed storage (as I believe Dave and others are suggesting).

On the other side of the coin, I remember Dave talking about this during
our NVM discussion at LSF last year and I got the impression the size and
number of writes he'd need supported before he could really stop using his
journaling code was potentially large.  Dave: perhaps you can re-state the
number of writes and their total size that would have to be supported by
block level atomics in order for them to be worth using by XFS?

Hi Andy - the numbers I gave last year were at the upper end of the
number of iovecs we can dump into an atomic checkpoint in the XFS
log at a time. because that is typically based on log size and the
log can be up to 2GB in size, this tends to max out at somewhere
around 150-200,000 individual iovecs and/or roughly 100MB of
metadata.

Yeah, it's a lot, but keep in mind that a workload running 250,000
file creates a second on XFS is retiring somewhere around 300,000
individual transactions per second, each of which will typically
have 10-20 dirty regions in them.  If we were to write them as
individual atomic writes at transaction commit time we'd need to
sustain somewhere in the order of 3-6 _million IOPS_ to maintain
this transaction rate with individual atomic writes for each
transaction.

That would also introduce unacceptible IO latency as we can't modify
metadata while it is under IO, especially as a large number of these
regions are redirtied repeatedly during ongoing operations(e.g.
directory data and index blocks). Hence to avoid this problem with
atomic writes, we need still need asynchronous transactions and
in-memory aggregation of changes.  IOWs, checkpoints are the until
of atomic write we need to for support in XFS.

We can limit the size of checkpoints in XFS without too much
trouble, either by amount of data or number of iovecs, but that
comes at a performance code. To maintain current levels of
performance we need a decent amount of in-memory change aggregation
and hence we are going to need - at minimum - thousands of vectors
in each atomic write. I'd prefer tens of thousands to hundreds of
thousands of vectors because that's our typical unit of "atomic
write" at current performance levels, but several thousand vectors
and tens of MB is sufficient to start with....
   I did the math for ext4 and it worked out rather similarly. After the
transaction batching we do in memory, we have transactions which are tens
of MB in size. These go first to a physically contiguous journal during
transaction commit (that's the easy part but it would already save us one
cache flush + FUA write) and then during checkpoint to final locations on
disk which can be physically discontiguous so that can be thousands to tens
of thousands different locations (this would save us another cache flush +
FUA write).

Similarly as in XFS case it is easy to force smaller transactions in ext4
but the smaller you make them the larger it the journaling overhead...

Again, if you simply tag writes with group IDs as I outlined before http://www.spinics.net/lists/linux-fsdevel/msg70047.html then you don't need explicit cache flushes, nor do you need to worry about transaction size limits. All you actually need is to ensure the ordering of a specific set of writes in relation to another specific set of writes, completely independent of other arbitrary writes. You folks are cooking up a solution for NVMe that's only practical when data transfer rates are fast enough that a 100MB write can be done in ~1ms, whereas a simple tweak of command tagging will work for everything from the slowest HDD to the fastest storage device.

As it is, the Atomic Write mechanism will be unusable for DBs when the transaction size exceeds whatever limit a particular device supports, thus requiring DB software to still provide a fallback mechanism, e.g. standard WAL, which only results in more complicated software. That's not a solution, that's just a new problem.

--
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux