On 2019/12/06 16:09, Hannes Reinecke wrote: > On 12/6/19 5:37 AM, Coly Li wrote: >> On 2019/12/6 8:30 上午, Damien Le Moal wrote: >>> On 2019/12/06 9:22, Eric Wheeler wrote: >>>> On Thu, 5 Dec 2019, Coly Li wrote: >>>>> This is a very basic zoned device support. With this patch, bcache >>>>> device is able to, >>>>> - Export zoned device attribution via sysfs >>>>> - Response report zones request, e.g. by command 'blkzone report' >>>>> But the bcache device is still NOT able to, >>>>> - Response any zoned device management request or IOCTL command >>>>> >>>>> Here are the testings I have done, >>>>> - read /sys/block/bcache0/queue/zoned, content is 'host-managed' >>>>> - read /sys/block/bcache0/queue/nr_zones, content is number of zones >>>>> including all zone types. >>>>> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size >>>>> in sectors. >>>>> - run 'blkzone report /dev/bcache0', all zones information displayed. >>>>> - run 'blkzone reset /dev/bcache0', operation is rejected with error >>>>> information: "blkzone: /dev/bcache0: BLKRESETZONE ioctl failed: >>>>> Operation not supported" >>>>> - Sequential writes by dd, I can see some zones' write pointer 'wptr' >>>>> values updated. >>>>> >>>>> All of these are very basic testings, if you have better testing >>>>> tools or cases, please offer me hint. >>>> >>>> Interesting. >>>> >>>> 1. should_writeback() could benefit by hinting true when an IO would fall >>>> in a zoned region. >>>> >>>> 2. The writeback thread could writeback such that they prefer >>>> fully(mostly)-populated zones when choosing what to write out. >>> >>> That definitely would be a good idea since that would certainly benefit >>> backend-GC (that will be needed). >>> >>> However, I do not see the point in exposing the /dev/bcacheX block >>> device itself as a zoned disk. In fact, I think we want exactly the >>> opposite: expose it as a regular disk so that any FS or application can >>> run. If the bcache backend disk is zoned, then the writeback handles >>> sequential writes. This would be in the end a solution similar to >>> dm-zoned, that is, a zoned disk becomes useable as a regular block >>> device (random writes anywhere are possible), but likely far more >>> efficient and faster. That may result in imposing some limitations on >>> bcache operations though, e.g. it can only be setup with writeback, no >>> writethrough allowed (not sure though...). >>> Thoughts ? >>> >> >> I come to realize this is really an idea on the opposite. Let me try to >> explain what I understand, please correct me if I am wrong. The idea you >> proposed indeed is to make bcache act as something like FTL for the >> backend zoned SMR drive, that is, for all random writes, bcache may >> convert them into sequential write onto the backend zoned SMR drive. In >> the meantime, if there are hot data, bcache continues to act as a >> caching device to accelerate read request. >> >> Yes, if I understand your proposal correctly, writeback mode might be >> mandatory and backend-GC will be needed. The idea is interesting, it >> looks like adding a log-structure storage layer between current bcache >> B+tree indexing and zoned SMR hard drive. >> > Well, not sure if that's required. > > Or, to be correct, we actually have _two_ use-cases: > 1) Have a SMR drive as a backing device. This was my primary goal for > handling these devices, as SMR device are typically not _that_ fast. > (Damien once proudly reported getting the incredible speed of 1 IOPS :-) Yes, it can get to that with dm-zoned if one goes crazy with sustained random writes :) The physical drive itself does a lot more than 1 iops in that case though and is as fast as any other HDD. But from the DM logical drive side, the user can sometimes fall into the 1 iops territory for really nasty workloads. Tests for well behaved users like f2fs show that SMR and regular HDDs are on par for performance. > So having bcache running on top of those will be a clear win. > But in this scenario the cache device will be a normal device (typically > an SSD), and we shouldn't need much modification here. I agree. That should work mostly as is since the user will be zone aware and already be issuing sequential writes. bcache write-through only needs to follow the same pattern, not reordering any write, and write-back only has to replay the same. > In fact, a good testcase would be the btrfs patches which got posted > earlier this week. With them you should be able to create a btrfs > filesystem on the SMR drive, and use an SSD as a cache device. > Getting this scenario to run would indeed be my primary goal, and I > guess your patches should be more or less sufficient for that. + Will need the zone revalidation and zone type & write lock bitmaps to prevent reordering from the block IO stack, unless bcache is a BIO driver ? My knowledge of bcache is limited. Would need to look into the details a little more to be able to comment. > 2) Using a SMR drive as a _cache_ device. This seems to be contrary to > the above statement of SMR drive not being fast, but then the NVMe WG is > working on a similar mechanism for flash devices called 'ZNS' (zoned > namespaces). And for those it really would make sense to have bcache > being able to handle zoned devices as a cache device. > But this is to my understanding really in the early stages, with no real > hardware being available. Damien might disagree, though :-) Yes, that would be another potential use case and ZNS indeed could fit this model, assuming that zone sizes align (multiples) between front and back devices. > And the implementation is still on the works on the linux side, so it's > more of a long-term goal.> > But the first use-case is definitely something we should be looking at; > SMR drives are available _and_ with large capacity, so any speedup there > would be greatly appreciated. Yes. And what I was talking about in my earlier email is actually a third use case: 3) SMR drive as backend + regular SSD as frontend and the resulting bcache device advertising itself as a regular disk, hiding all the zone & sequential write constraint to the user. Since bcache already has some form of indirection table for cached blocks, I thought we could hijack this to implement a sort of FTL that would allow serializing random writes to the backend with the help of the frontend as a write staging buffer. Doing so, we get full random write capability with the benefit of "hot" blocks staying in the cache. But again, not knowing enough details about bcache, I may be talking too lightly here. Not sure if that is reasonably easily feasible with the current bcache code. Cheers. > > Cheers, > > Hannes > -- Damien Le Moal Western Digital Research