On 2019/12/6 3:42 下午, Damien Le Moal wrote: > On 2019/12/06 16:09, Hannes Reinecke wrote: >> On 12/6/19 5:37 AM, Coly Li wrote: >>> On 2019/12/6 8:30 上午, Damien Le Moal wrote: >>>> On 2019/12/06 9:22, Eric Wheeler wrote: >>>>> On Thu, 5 Dec 2019, Coly Li wrote: >>>>>> This is a very basic zoned device support. With this patch, bcache >>>>>> device is able to, >>>>>> - Export zoned device attribution via sysfs >>>>>> - Response report zones request, e.g. by command 'blkzone report' >>>>>> But the bcache device is still NOT able to, >>>>>> - Response any zoned device management request or IOCTL command >>>>>> >>>>>> Here are the testings I have done, >>>>>> - read /sys/block/bcache0/queue/zoned, content is 'host-managed' >>>>>> - read /sys/block/bcache0/queue/nr_zones, content is number of zones >>>>>> including all zone types. >>>>>> - read /sys/block/bcache0/queue/chunk_sectors, content is zone size >>>>>> in sectors. >>>>>> - run 'blkzone report /dev/bcache0', all zones information displayed. >>>>>> - run 'blkzone reset /dev/bcache0', operation is rejected with error >>>>>> information: "blkzone: /dev/bcache0: BLKRESETZONE ioctl failed: >>>>>> Operation not supported" >>>>>> - Sequential writes by dd, I can see some zones' write pointer 'wptr' >>>>>> values updated. >>>>>> >>>>>> All of these are very basic testings, if you have better testing >>>>>> tools or cases, please offer me hint. >>>>> >>>>> Interesting. >>>>> >>>>> 1. should_writeback() could benefit by hinting true when an IO would fall >>>>> in a zoned region. >>>>> >>>>> 2. The writeback thread could writeback such that they prefer >>>>> fully(mostly)-populated zones when choosing what to write out. >>>> >>>> That definitely would be a good idea since that would certainly benefit >>>> backend-GC (that will be needed). >>>> >>>> However, I do not see the point in exposing the /dev/bcacheX block >>>> device itself as a zoned disk. In fact, I think we want exactly the >>>> opposite: expose it as a regular disk so that any FS or application can >>>> run. If the bcache backend disk is zoned, then the writeback handles >>>> sequential writes. This would be in the end a solution similar to >>>> dm-zoned, that is, a zoned disk becomes useable as a regular block >>>> device (random writes anywhere are possible), but likely far more >>>> efficient and faster. That may result in imposing some limitations on >>>> bcache operations though, e.g. it can only be setup with writeback, no >>>> writethrough allowed (not sure though...). >>>> Thoughts ? >>>> >>> >>> I come to realize this is really an idea on the opposite. Let me try to >>> explain what I understand, please correct me if I am wrong. The idea you >>> proposed indeed is to make bcache act as something like FTL for the >>> backend zoned SMR drive, that is, for all random writes, bcache may >>> convert them into sequential write onto the backend zoned SMR drive. In >>> the meantime, if there are hot data, bcache continues to act as a >>> caching device to accelerate read request. >>> >>> Yes, if I understand your proposal correctly, writeback mode might be >>> mandatory and backend-GC will be needed. The idea is interesting, it >>> looks like adding a log-structure storage layer between current bcache >>> B+tree indexing and zoned SMR hard drive. >>> >> Well, not sure if that's required. >> >> Or, to be correct, we actually have _two_ use-cases: >> 1) Have a SMR drive as a backing device. This was my primary goal for >> handling these devices, as SMR device are typically not _that_ fast. >> (Damien once proudly reported getting the incredible speed of 1 IOPS :-) > > Yes, it can get to that with dm-zoned if one goes crazy with sustained > random writes :) The physical drive itself does a lot more than 1 iops > in that case though and is as fast as any other HDD. But from the DM > logical drive side, the user can sometimes fall into the 1 iops > territory for really nasty workloads. Tests for well behaved users like > f2fs show that SMR and regular HDDs are on par for performance. > >> So having bcache running on top of those will be a clear win. >> But in this scenario the cache device will be a normal device (typically >> an SSD), and we shouldn't need much modification here. > > I agree. That should work mostly as is since the user will be zone aware > and already be issuing sequential writes. bcache write-through only > needs to follow the same pattern, not reordering any write, and > write-back only has to replay the same. > >> In fact, a good testcase would be the btrfs patches which got posted >> earlier this week. With them you should be able to create a btrfs >> filesystem on the SMR drive, and use an SSD as a cache device. >> Getting this scenario to run would indeed be my primary goal, and I >> guess your patches should be more or less sufficient for that. > > + Will need the zone revalidation and zone type & write lock bitmaps to > prevent reordering from the block IO stack, unless bcache is a BIO > driver ? My knowledge of bcache is limited. Would need to look into the > details a little more to be able to comment. Hi Damien, Bcache should be a bio based driver, it splits and clones bios, and submits it by generic_make_request() to underlying block layer code. So zone revalidation and zone type & write lock bitmaps are unnecessary for bcache ? > >> 2) Using a SMR drive as a _cache_ device. This seems to be contrary to >> the above statement of SMR drive not being fast, but then the NVMe WG is >> working on a similar mechanism for flash devices called 'ZNS' (zoned >> namespaces). And for those it really would make sense to have bcache >> being able to handle zoned devices as a cache device. >> But this is to my understanding really in the early stages, with no real >> hardware being available. Damien might disagree, though :-) > > Yes, that would be another potential use case and ZNS indeed could fit > this model, assuming that zone sizes align (multiples) between front and > back devices. > >> And the implementation is still on the works on the linux side, so it's >> more of a long-term goal.> >> But the first use-case is definitely something we should be looking at; >> SMR drives are available _and_ with large capacity, so any speedup there >> would be greatly appreciated. > > Yes. And what I was talking about in my earlier email is actually a > third use case: > 3) SMR drive as backend + regular SSD as frontend and the resulting > bcache device advertising itself as a regular disk, hiding all the zone > & sequential write constraint to the user. Since bcache already has some > form of indirection table for cached blocks, I thought we could hijack > this to implement a sort of FTL that would allow serializing random > writes to the backend with the help of the frontend as a write staging > buffer. Doing so, we get full random write capability with the benefit > of "hot" blocks staying in the cache. But again, not knowing enough > details about bcache, I may be talking too lightly here. Not sure if > that is reasonably easily feasible with the current bcache code. There are three addresses involved in the above proposal. 1) User space LBA address: the LBA of block device which are combiled by bcache+SMR. 2) Cache device LBA address: where the random writing cached data blocks are stored on SSD. 3) SMR drive LBA address: where the sequential writing data blocks are stored on zoned SMR drive Therefore we need at least two layers mapping to connect these 3 addresses together. Currently only 1 mapping from bcache B+tree is not enough. Maybe stacking bcache backing device on top of dm-zoned target is a solution for proposal 3), let me try whether it works. -- Coly Li