Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical Interface, and Vector I/Os

"Theodore Ts'o" <tytso@xxxxxxx> · Thu, 5 Jan 2017 20:11:44 -0500

On Thu, Jan 05, 2017 at 10:58:57PM +0000, Slava Dubeyko wrote:
> 
> Next point is read disturbance. If BER of physical page/block achieves some threshold then
> we need to move data from one page/block into another one. What subsystem will be
> responsible for this activity? The drive-managed case expects that device's GC will manage
> read disturbance issue. But what's about host-aware or host-managed case? If the host side
> hasn't information about BER then the host's software is unable to manage this issue. Finally,
> it sounds that we will have GC subsystem as on file system side as on device side. As a result,
> it means possible unpredictable performance degradation and decreasing device lifetime.
> Let's imagine that host-aware case could be unaware about read disturbance management.
> But how host-managed case can manage this issue?

One of the ways this could be done in the ZBC specification (assuming
that erase blocks == zones) would be set the "reset" bit in the zone
descriptor which is returned by the REPORT ZONES EXT command.  This is
a hint that the a reset write pointer should be sent to the zone in
question, and it could be set when you start seeing soft ECC errors or
the flash management layer has decided that the zone should be
rewritten in the near future.  A simple way to do this is to ask the
Host OS to copy the data to another zone and then send a reset write
pointer command for the zone.

So I think it very much could be done, and done within the framework
of the ZBC model --- although whether SSD manufactuers will chose to
do this, and/or choose to engage the T10/T13 standards committees to
add the necessary extensions to the ZBC specification is a question
that we probably can't answer in this venue or by the participants on
this thread.

> Wear leveling... Device will be responsible to manage wear-leveling for the case of device-managed
> and host-aware models. It looks like that the host side should be responsible to manage wear-leveling
> for the host-managed case. But it means that the host should manage bad blocks and to have direct
> access to physical pages/blocks. Otherwise, physical erase blocks will be hidden by device's indirection
> layer and wear-leveling management will be unavailable on the host side. As a result, device will have
> internal GC and the traditional issues (possible unpredictable performance degradation and decreasing
> device lifetime).

So I can imagine a setup where the flash translation layer manages the
mapping between zone numbers and the physical erase blocks, such that
when the host OS issues an "reset write pointer", it immediately gets
a new erase block assigned to the specific zone in question.  The
original erase block would then get erased in the background, when the
flash chip in question is available for maintenance activities.

I think you've been thinking about a model where *either* the host as
complete control over all aspects of the flash management, or the FTL
has complete control --- and it may be that there are more clever ways
that the work could be split between flash device and the host OS.

> Another interesting question... Let's imagine that we create file system volume for one device
> geometry. It means that geometry details will be stored in the file system metadata during volume
> creation for the case host-aware or host-managed case. Then we backups this volume and restore
> the volume on device with completely different geometry. So, what will we have for such case?
> Performance degradation? Or will we kill the device?

This is why I suspect that exposing the full details of the details of
the Flash layout via LUNS is a bad, bad, BAD idea.  It's much better
to use an abstraction such as Zones, and then have an abstraction
layer that hides the low-level details of the hardware from the OS.
The trick is picking an abstraction that exposes the _right_ set of
details so that the division of labor betewen the Host OS and the
storage device is at a better place.  Hence my suggestion of perhaps
providing a virtual mapping layer betewen "Zone number" and the
low-level physical erase block.

> I would like to have access channels/LUNs/zones on file system level.
> If, for example, LUN will be associated with partition then it means
> that it will need to aggregate several partitions inside of one volume.
> First of all, not every file system is ready for the aggregation several
> partitions inside of the one volume. Secondly, what's about aggregation
> several physical devices inside of one volume? It looks like as slightly
> tricky to distinguish partitions of the same device and different devices
> on file system level. Isn't it?

Yes, this is why using LUN's are a BAD idea.  There's too much code
--- in file systems, in the block layer in terms of how we expose
block devices, etc., that assumes that different LUN's are used for
different logical containers of storage.  There has been decades of
usage of this concept by enterprise storage arrays.  Trying to
appropriate LUN's for another use case is stupid.  And maybe we can't
stop OCSSD folks if they have gone down that questionable design path,
but there's nothing that says we have to expose it as a SCSI LUN
inside of Linux!

> OK. But I assume that SMR zone "reset" is significantly cheaper than
> NAND flash block erase operation. And you can fill your SMR zone with
> data then "reset" it and to fill again with data without significant penalty.

If you have virtual mapping layer between zones and erase blocks, a
reset write pointer could be fast for SSD's as well.  And that allows
the implementation of your suggestion below:

> Also, TRIM and zone "reset" are different, I suppose. Because, TRIM looks
> like as a hint for SSD controller. If SSD controller receives TRIM for some
> erase block then it doesn't mean  that erase operation will be done
> immediately. Usually, it should be done in the background because real
> erase operation is expensive operation.

Cheers,

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html