Re: [LSF/MM TOPIC] atomic block device

Dan Williams <dan.j.williams@xxxxxxxxx> · Tue, 18 Feb 2014 11:07:56 -0800

On Mon, Feb 17, 2014 at 5:05 AM, Chris Mason <clm@xxxxxx> wrote:
> On 02/15/2014 10:04 AM, Dan Williams wrote:
>>
>> In response to Dave's call [1] and highlighting Jeff's attend request
>> [2] I'd like to stoke a discussion on an emulation layer for atomic
>> block commands.  Specifically, SNIA has laid out their position on the
>> command set an atomic block device may support (NVM Programming Model
>> [3]) and it is a good conversation piece for this effort.  The goal
>> would be to review the proposed operations, identify the capabilities
>> that would be readily useful to filesystems / existing use cases, and
>> tear down a straw man implementation proposal.
>>
>> The SNIA defined capabilities that seem the highest priority to implement
>> are:
>> * ATOMIC_MULTIWRITE - dis-contiguous LBA ranges, power fail atomic, no
>> ordering constraint relative to other i/o
>>
>> * ATOMIC_WRITE - contiguous LBA range, power fail atomic, no ordering
>> constraint relative to other i/o
>>
>> * EXISTS - not an atomic command, but defined in the NPM.  It is akin
>> to SEEK_{DATA|HOLE} to test whether an LBA is mapped or unmapped.  If
>> the LBA is mapped additionally specifies whether data is present or
>> the LBA is only allocated.
>>
>> * SCAR - again not an atomic command, but once we have metadata can
>> implement a bad block list, analogous to the bad-block-list support in
>> md.
>>
>> Initial thought is that this functionality is better implemented as a
>> library a block device driver (bio-based or request-based) can call to
>> emulate these features.  In the case where the feature is directly
>> supported by the underlying hardware device the emulation layer will
>> stub out and pass it through.  The argument for not doing this as a
>>
>> device-mapper target or stacked block device driver is to ease
>> provisioning and make the emulation transparent.  On the other hand,
>> the argument for doing this as a virtual block device is that the
>> "failed to parse device metadata" is a known failure scenario for
>> dm/md, but not sd for example.
>
>
> Hi Dan,
>
> I'd suggest a dm device instead of a special library, mostly because the
> emulated device is likely to need some kind of cleanup action after a crash,
> and the dm model is best suited to cleanly provide that.  It's also a good
> fit for people that want to duct tape a small amount of very fast nvm onto
> relatively slower devices.

Hi Chris,

I can see that.  It would be a surprising if sda fails to show up due
to metadata corruption.  Support for making the transition transparent
when the backing device supports the offloads can come later.

> The absolute minimum to provide something useful is a 16K discontig atomic.
> That won't help the filesystems much, but it will allow mysql to turn off
> double buffering.  Oracle would benefit from ~64K, mostly from a safety
> point of view since they don't double buffer.
>
> Helping the filesystems is harder, we need atomics bigger than any
> individual device is likely to provide.  But as Dave says elsewhere in the
> thread, we can limit that for specific workloads.

This sounds like a difference between "atomically handle a set of
commands up to the device's in-flight queue depth" vs "guarantee
atomic commit of transactions that may have landed on media a while
ago with a current set of in-flight requests".  If I am parsing the
difference correctly?

> I'm not sold on SCAR, since I'd expect the FTL or drive firmware provide
> that for us, what use case do you have in mind there?

The only use case I know for SCAR is the internal functionality RAID
firmwares implement to continue array rebuild upon encountering a bad
block.  Rather than stop the rebuild or silently corrupt data, "scar"
the lba ranges on the incoming rebuild target that otherwise could not
be recovered due to bad blocks on the other array members.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html