Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

Damien Le Moal <Damien.LeMoal@xxxxxxxx> · Wed, 3 Aug 2016 01:29:27 +0000

Hannes, Shaun,

Let me add some more comments.

> On Aug 2, 2016, at 23:35, Hannes Reinecke <hare@xxxxxxx> wrote:
> 
> On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
>> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@xxxxxx> wrote:
>>> 
>>> Can you please integrate this with Hannes series so that it uses
>>> his cache of the zone information?
>> 
>> Adding Hannes and Damien to Cc.
>> 
>> Christoph,
>> 
>> I can make a patch the marshal Hannes' RB-Tree into to a block report, that is
>> quite simple. I can even have the open/close/reset zone commands update the
>> RB-Tree .. the non-private parts anyway. I would prefer to do this around the
>> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
>> not need the RB-Tree to function with zoned media.
>> 
>> I do still have concerns with the approach which I have shared in smaller
>> forums but perhaps I have to bring them to this group.
>> 
>> First is the memory consumption. This isn't really much of a concern for large
>> servers with few drives but I think the embedded NAS market will grumble as
>> well as the large data pods trying to stuff 300+ drives in a chassis.
>> 
>> As of now the RB-Tree needs to hold ~30000 zones.
>> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
>> around 3.5 MB per zoned drive attached.
>> Which is fine if it is really needed, but most of it is fixed information
>> and it can be significantly condensed (I have proposed 8 bytes per zone held
>> in an array as more than adequate). Worse is that the crucial piece of
>> information, the current wp needed for scheduling the next write, is mostly
>> out of date because it is updated only after the write completes and zones
>> being actively written to must work off of the last location / size that was
>> submitted, not completed. The work around is for that tracking to be handled
>> in the private_data member. I am not saying that updating the wp on
>> completing a write isn’t important, I am saying that the bi_end_io hook is
>> the existing hook that works just fine.
>> 
> Which _actually_ is not true; with my patches I'll update the write
> pointer prior to submit the I/O (on the reasoning that most of the time
> I/O will succeed) and re-read the zone information if an I/O failed.
> (Which I'll have to do anyway as after an I/O failure the write pointer
> status is not clearly defined.)
> 
> I have thought about condensing the RB tree information, but then I
> figured that for 'real' SMR handling we cannot assume all zones are of
> fixed size, and hence we need all the information there.
> Any condensing method would assume a given structure of the zones, which
> the standard just doesn't provide.
> Or am I missing something here?

Indeed, the standards do not mandate any particular zone configuration,
constant zone size, etc. So writing code so that can be handled is certainly
the right way of doing things. However, if we decide to go forward with
mapping RESET WRITE POINTER command to DISCARD, then at least a constant
zone size (minus the last zone as you said) must be assumed, and that
information can be removed from the entries in the RB tree (as it will be
saved for the sysfs "zone_size" file anyway. Adding a little code to handle
that eventual last runt zone with a different size is not a big problem.

> 
> As for write pointer handling: yes, the write pointer on the zones is
> not really useful for upper-level usage.
> Where we do need it is to detect I/O which is crossing the write pointer
> (eg when doing reads over the entire zone).
> As per spec you will be getting an I/O error here, so we need to split
> the I/O on the write pointer to get valid results back.

To be precise here, the I/O splitting will be handled by the block layer
thanks to the "chunk_sectors" setting. But that relies on a constant zone
size assumption too.

The RB-tree here is most useful for reads over or after the write pointer as
this can have different behavior on different drives (URSWRZ bit). The RB-tree
allows us to hide these differences to upper layers and simplify support at
those levels.

I too agree that the write pointer value is not very useful to upper layers
(e.g. FS). What matters most of the times for these layers is an "allocation
pointer" or "submission pointer" which indicates the LBA to use for the next
BIO submission. That LBA will most of the time be in advance of the zone WP
because of request queing in the block scheduler.
Note that from what we have done already, in many cases, upper layers do not
even need this (e.g. F2FS, btrfs) and work fine without needing *any* 
zone->private_data because that "allocation pointer" information generally
already exists in a different form (e.g. a bit in a bitmap).
For these cases, not having the RB-tree of zones would force a lot
more code to be added to file systems for SMR support.

> 
>> This all tails into domain responsibility. With the RB-Tree doing half of the
>> work and the ‘responsible’ domain handling the active path via private_data
>> why have the split at all? It seems to be a double work to have second object
>> tracking the first so that I/O scheduling can function.
>> 
> Tracking the zone states via RB trees is mainly to handly host-managed
> drives seamlessly. Without an RB tree we will be subjected to I/O errors
> during boot, as eg with new drives we inevitably will try to read beyond
> the write pointer, which in turn will generate I/O errors on certain drives.
> I do agree that there is no strict need to setup an RB tree for
> host-aware drives; I can easily add an attribute/flag to disable RB
> trees for those.
> However, tracking zone information in the RB tree _dramatically_ speeds
> up device initialisation; issuing a blkdev_discard() for the entire
> drive will take _ages_ without it.

And the RB-tree will also be very useful for speeding up report zones
ioctls too. Unless we want those to force an update of the RB-tree information ?

> 
>> Finally is the error handling path when the RB-Tree encounters and error it
>> attempts to requery the drive topology virtually guaranteeing that the
>> private_data is now out-of-sync with the RB-Tree. Again this is something
>> that can be better encapsulated in the bi_end_io to be informed of the
>> failed I/O and schedule the appropriate recovery (including re-querying the
>> zone information of the affected zone(s)).
>> 
> Well, if we have an RB tree with write pointers of course we need to
> re-read the zone information on failure (and I thought I did that?).
> Plus the error information will be propagated via the usual mechanism,
> so they need to take care of updating any information in ->private_data.
> 
> I'm perfectly fine with integrating your patches for the various
> blkdev_XX callouts and associated ioctls; Jens favours that approach,
> too, so we should be combining those efforts.

Can we agree on an interface ?
Summarizing all the discussions we had, I am all in favor of the following:

1) A "zone_size" sysfs attribute to indicate that a drive is zoned:
The already existing device type file can tell us if this is an host
managed disk (type==20) or a host aware disk (type==0). Drive managed
disks could also be detected, but since no information on their zone
configuration can be obtained, lets treat those as regular drives.

2) Add ioctls for zone management:
Report zones (get information from RB tree), reset zone (simple wrapper
to ioctl for block discard), open zone, close zone and finish zone. That
will allow mkfs like tools to get zone information without having to parse
thousands of sysfs files (and can also be integrated in libzbc block backend
driver for a unified interface with the direct SG_IO path for kernels without
the ZBC code enabled).

3) Try to condense the blkzone data structure to save memory:
I think that we can at the very least remove the zone length, and also
may be the per zone spinlock too (a single spinlock and proper state flags can
be used).

Best regards.

------------------------
Damien Le Moal, Ph.D.
Sr. Manager, System Software Group, HGST Research,
HGST, a Western Digital brand
Damien.LeMoal@xxxxxxxx
(+81) 0466-98-3593 (ext. 513593)
1 kirihara-cho, Fujisawa, 
Kanagawa, 252-0888 Japan
www.hgst.com 

Western Digital Corporation (and its subsidiaries) E-mail Confidentiality Notice & Disclaimer:

This e-mail and any files transmitted with it may contain confidential or legally privileged information of WDC and/or its affiliates, and are intended solely for the use of the individual or entity to which they are addressed. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited. If you have received this e-mail in error, please notify the sender immediately and delete the e-mail in its entirety from your system.
��.n��������+%������w��{.n�����{����n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�