Re: [RFC PATCH] bcache: enable zoned device support

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Fri, 6 Dec 2019 07:42:34 +0000

On 2019/12/06 16:09, Hannes Reinecke wrote:
> On 12/6/19 5:37 AM, Coly Li wrote:
>> On 2019/12/6 8:30 上午, Damien Le Moal wrote:
>>> On 2019/12/06 9:22, Eric Wheeler wrote:
>>>> On Thu, 5 Dec 2019, Coly Li wrote:
>>>>> This is a very basic zoned device support. With this patch, bcache
>>>>> device is able to,
>>>>> - Export zoned device attribution via sysfs
>>>>> - Response report zones request, e.g. by command 'blkzone report'
>>>>> But the bcache device is still NOT able to,
>>>>> - Response any zoned device management request or IOCTL command
>>>>>
>>>>> Here are the testings I have done,
>>>>> - read /sys/block/bcache0/queue/zoned, content is 'host-managed'
>>>>> - read /sys/block/bcache0/queue/nr_zones, content is number of zones
>>>>>   including all zone types.
>>>>> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size
>>>>>   in sectors.
>>>>> - run 'blkzone report /dev/bcache0', all zones information displayed.
>>>>> - run 'blkzone reset /dev/bcache0', operation is rejected with error
>>>>>   information: "blkzone: /dev/bcache0: BLKRESETZONE ioctl failed:
>>>>>   Operation not supported"
>>>>> - Sequential writes by dd, I can see some zones' write pointer 'wptr'
>>>>>   values updated.
>>>>>
>>>>> All of these are very basic testings, if you have better testing
>>>>> tools or cases, please offer me hint.
>>>>
>>>> Interesting. 
>>>>
>>>> 1. should_writeback() could benefit by hinting true when an IO would fall 
>>>>    in a zoned region.
>>>>
>>>> 2. The writeback thread could writeback such that they prefer 
>>>>    fully(mostly)-populated zones when choosing what to write out.
>>>
>>> That definitely would be a good idea since that would certainly benefit
>>> backend-GC (that will be needed).
>>>
>>> However, I do not see the point in exposing the /dev/bcacheX block
>>> device itself as a zoned disk. In fact, I think we want exactly the
>>> opposite: expose it as a regular disk so that any FS or application can
>>> run. If the bcache backend disk is zoned, then the writeback handles
>>> sequential writes. This would be in the end a solution similar to
>>> dm-zoned, that is, a zoned disk becomes useable as a regular block
>>> device (random writes anywhere are possible), but likely far more
>>> efficient and faster. That may result in imposing some limitations on
>>> bcache operations though, e.g. it can only be setup with writeback, no
>>> writethrough allowed (not sure though...).
>>> Thoughts ?
>>>
>>
>> I come to realize this is really an idea on the opposite. Let me try to
>> explain what I understand, please correct me if I am wrong. The idea you
>> proposed indeed is to make bcache act as something like FTL for the
>> backend zoned SMR drive, that is, for all random writes, bcache may
>> convert them into sequential write onto the backend zoned SMR drive. In
>> the meantime, if there are hot data, bcache continues to act as a
>> caching device to accelerate read request.
>>
>> Yes, if I understand your proposal correctly, writeback mode might be
>> mandatory and backend-GC will be needed. The idea is interesting, it
>> looks like adding a log-structure storage layer between current bcache
>> B+tree indexing and zoned SMR hard drive.
>>
> Well, not sure if that's required.
> 
> Or, to be correct, we actually have _two_ use-cases:
> 1) Have a SMR drive as a backing device. This was my primary goal for
> handling these devices, as SMR device are typically not _that_ fast.
> (Damien once proudly reported getting the incredible speed of 1 IOPS :-)

Yes, it can get to that with dm-zoned if one goes crazy with sustained
random writes :) The physical drive itself does a lot more than 1 iops
in that case though and is as fast as any other HDD. But from the DM
logical drive side, the user can sometimes fall into the 1 iops
territory for really nasty workloads. Tests for well behaved users like
f2fs show that SMR and regular HDDs are on par for performance.

> So having bcache running on top of those will be a clear win.
> But in this scenario the cache device will be a normal device (typically
> an SSD), and we shouldn't need much modification here.

I agree. That should work mostly as is since the user will be zone aware
and already be issuing sequential writes. bcache write-through only
needs to follow the same pattern, not reordering any write, and
write-back only has to replay the same.

> In fact, a good testcase would be the btrfs patches which got posted
> earlier this week. With them you should be able to create a btrfs
> filesystem on the SMR drive, and use an SSD as a cache device.
> Getting this scenario to run would indeed be my primary goal, and I
> guess your patches should be more or less sufficient for that.

+ Will need the zone revalidation and zone type & write lock bitmaps to
prevent reordering from the block IO stack, unless bcache is a BIO
driver ? My knowledge of bcache is limited. Would need to look into the
details a little more to be able to comment.

> 2) Using a SMR drive as a _cache_ device. This seems to be contrary to
> the above statement of SMR drive not being fast, but then the NVMe WG is
> working on a similar mechanism for flash devices called 'ZNS' (zoned
> namespaces). And for those it really would make sense to have bcache
> being able to handle zoned devices as a cache device.
> But this is to my understanding really in the early stages, with no real
> hardware being available. Damien might disagree, though :-)

Yes, that would be another potential use case and ZNS indeed could fit
this model, assuming that zone sizes align (multiples) between front and
back devices.

> And the implementation is still on the works on the linux side, so it's
> more of a long-term goal.>
> But the first use-case is definitely something we should be looking at;
> SMR drives are available _and_ with large capacity, so any speedup there
> would be greatly appreciated.

Yes. And what I was talking about in my earlier email is actually a
third use case:
3) SMR drive as backend + regular SSD as frontend and the resulting
bcache device advertising itself as a regular disk, hiding all the zone
& sequential write constraint to the user. Since bcache already has some
form of indirection table for cached blocks, I thought we could hijack
this to implement a sort of FTL that would allow serializing random
writes to the backend with the help of the frontend as a write staging
buffer. Doing so, we get full random write capability with the benefit
of "hot" blocks staying in the cache. But again, not knowing enough
details about bcache, I may be talking too lightly here. Not sure if
that is reasonably easily feasible with the current bcache code.

Cheers.

> 
> Cheers,
> 
> Hannes
> 

-- 
Damien Le Moal
Western Digital Research