Re: [PATCH v6 0/2] Block layer support ZAC/ZBC commands

Hannes Reinecke <hare@xxxxxxx> · Tue, 2 Aug 2016 16:35:40 +0200

On 08/01/2016 07:07 PM, Shaun Tancheff wrote:
> On Mon, Aug 1, 2016 at 4:41 AM, Christoph Hellwig <hch@xxxxxx> wrote:
>>
>> Can you please integrate this with Hannes series so that it uses
>> his cache of the zone information?
> 
> Adding Hannes and Damien to Cc.
> 
> Christoph,
> 
> I can make a patch the marshal Hannes' RB-Tree into to a block report, that is
> quite simple. I can even have the open/close/reset zone commands update the
> RB-Tree .. the non-private parts anyway. I would prefer to do this around the
> CONFIG_SD_ZBC support, offering the existing type of patch for setups that do
> not need the RB-Tree to function with zoned media.
> 
> I do still have concerns with the approach which I have shared in smaller
> forums but perhaps I have to bring them to this group.
> 
> First is the memory consumption. This isn't really much of a concern for large
> servers with few drives but I think the embedded NAS market will grumble as
> well as the large data pods trying to stuff 300+ drives in a chassis.
> 
> As of now the RB-Tree needs to hold ~30000 zones.
> sizeof() reports struct blk_zone to use 120 bytes on x86_64. This yields
> around 3.5 MB per zoned drive attached.
> Which is fine if it is really needed, but most of it is fixed information
> and it can be significantly condensed (I have proposed 8 bytes per zone held
> in an array as more than adequate). Worse is that the crucial piece of
> information, the current wp needed for scheduling the next write, is mostly
> out of date because it is updated only after the write completes and zones
> being actively written to must work off of the last location / size that was
> submitted, not completed. The work around is for that tracking to be handled
> in the private_data member. I am not saying that updating the wp on
> completing a write isn’t important, I am saying that the bi_end_io hook is
> the existing hook that works just fine.
> 
Which _actually_ is not true; with my patches I'll update the write
pointer prior to submit the I/O (on the reasoning that most of the time
I/O will succeed) and re-read the zone information if an I/O failed.
(Which I'll have to do anyway as after an I/O failure the write pointer
status is not clearly defined.)

I have thought about condensing the RB tree information, but then I
figured that for 'real' SMR handling we cannot assume all zones are of
fixed size, and hence we need all the information there.
Any condensing method would assume a given structure of the zones, which
the standard just doesn't provide.
Or am I missing something here?

As for write pointer handling: yes, the write pointer on the zones is
not really useful for upper-level usage.
Where we do need it is to detect I/O which is crossing the write pointer
(eg when doing reads over the entire zone).
As per spec you will be getting an I/O error here, so we need to split
the I/O on the write pointer to get valid results back.

> This all tails into domain responsibility. With the RB-Tree doing half of the
> work and the ‘responsible’ domain handling the active path via private_data
> why have the split at all? It seems to be a double work to have second object
> tracking the first so that I/O scheduling can function.
> 
Tracking the zone states via RB trees is mainly to handly host-managed
drives seamlessly. Without an RB tree we will be subjected to I/O errors
during boot, as eg with new drives we inevitably will try to read beyond
the write pointer, which in turn will generate I/O errors on certain drives.
I do agree that there is no strict need to setup an RB tree for
host-aware drives; I can easily add an attribute/flag to disable RB
trees for those.
However, tracking zone information in the RB tree _dramatically_ speeds
up device initialisation; issuing a blkdev_discard() for the entire
drive will take _ages_ without it.

> Finally is the error handling path when the RB-Tree encounters and error it
> attempts to requery the drive topology virtually guaranteeing that the
> private_data is now out-of-sync with the RB-Tree. Again this is something
> that can be better encapsulated in the bi_end_io to be informed of the
> failed I/O and schedule the appropriate recovery (including re-querying the
> zone information of the affected zone(s)).
> 
Well, if we have an RB tree with write pointers of course we need to
re-read the zone information on failure (and I thought I did that?).
Plus the error information will be propagated via the usual mechanism,
so they need to take care of updating any information in ->private_data.

I'm perfectly fine with integrating your patches for the various
blkdev_XX callouts and associated ioctls; Jens favours that approach,
too, so we should be combining those efforts.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@xxxxxxx			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html