Re: [Lsf-pc] [LSF/MM ATTEND] Online Logical Head Depop and SMR disks chunked writepages

Hannes Reinecke <hare@xxxxxxx> · Mon, 29 Feb 2016 11:06:32 +0800

On 02/29/2016 10:02 AM, Damien Le Moal wrote:
> 
>> On Wed 24-02-16 01:53:24, Damien Le Moal wrote:
>>>
>>>> On Tue 23-02-16 05:31:13, Damien Le Moal wrote:
>>>>>
>>>>>> On 02/22/16 18:56, Damien Le Moal wrote:
>>>>>>> 2) Write back of dirty pages to SMR block devices:
>>>>>>>
>>>>>>> Dirty pages of a block device inode are currently processed using the
>>>>>>> generic_writepages function, which can be executed simultaneously
>>>>>>> by multiple contexts (e.g sync, fsync, msync, sync_file_range, etc).
>>>>>>> Mutual exclusion of the dirty page processing being achieved only at
>>>>>>> the page level (page lock & page writeback flag), multiple processes
>>>>>>> executing a "sync" of overlapping block ranges over the same zone of
>>>>>>> an SMR disk can cause an out-of-LBA-order sequence of write requests
>>>>>>> being sent to the underlying device. On a host managed SMR disk, where
>>>>>>> sequential write to disk zones is mandatory, this result in errors and
>>>>>>> the impossibility for an application using raw sequential disk write
>>>>>>> accesses to be guaranteed successful completion of its write or fsync
>>>>>>> requests.
>>>>>>>
>>>>>>> Using the zone information attached to the SMR block device queue
>>>>>>> (introduced by Hannes), calls to the generic_writepages function can
>>>>>>> be made mutually exclusive on a per zone basis by locking the zones.
>>>>>>> This guarantees sequential request generation for each zone and avoid
>>>>>>> write errors without any modification to the generic code implementing
>>>>>>> generic_writepages.
>>>>>>>
>>>>>>> This is but one possible solution for supporting SMR host-managed
>>>>>>> devices without any major rewrite of page cache management and
>>>>>>> write-back processing. The opinion of the audience regarding this
>>>>>>> solution and discussing other potential solutions would be greatly
>>>>>>> appreciated.
>>>>>>
>>>>>> Hello Damien,
>>>>>>
>>>>>> Is it sufficient to support filesystems like BTRFS on top of SMR drives 
>>>>>> or would you also like to see that filesystems like ext4 can use SMR 
>>>>>> drives ? In the latter case: the behavior of SMR drives differs so 
>>>>>> significantly from that of other block devices that I'm not sure that we 
>>>>>> should try to support these directly from infrastructure like the page 
>>>>>> cache. If we look e.g. at NAND SSDs then we see that the characteristics 
>>>>>> of NAND do not match what filesystems expect (e.g. large erase blocks). 
>>>>>> That is why every SSD vendor provides an FTL (Flash Translation Layer), 
>>>>>> either inside the SSD or as a separate software driver. An FTL 
>>>>>> implements a so-called LFS (log-structured filesystem). With what I know 
>>>>>> about SMR this technology looks also suitable for implementation of a 
>>>>>> LFS. Has it already been considered to implement an LFS driver for SMR 
>>>>>> drives ? That would make it possible for any filesystem to access an SMR 
>>>>>> drive as any other block device. I'm not sure of this but maybe it will 
>>>>>> be possible to share some infrastructure with the LightNVM driver 
>>>>>> (directory drivers/lightnvm in the Linux kernel tree). This driver 
>>>>>> namely implements an FTL.
>>>>>
>>>>> I totally agree with you that trying to support SMR disks by only modifying
>>>>> the page cache so that unmodified standard file systems like BTRFS or ext4
>>>>> remain operational is not realistic at best, and more likely simply impossible.
>>>>> For this kind of use case, as you said, an FTL or a device mapper driver are
>>>>> much more suitable.
>>>>>
>>>>> The case I am considering for this discussion is for raw block device accesses
>>>>> by an application (writes from user space to /dev/sdxx). This is a very likely
>>>>> use case scenario for high capacity SMR disks with applications like distributed
>>>>> object stores / key value stores.
>>>>>
>>>>> In this case, write-back of dirty pages in the block device file inode mapping
>>>>> is handled in fs/block_dev.c using the generic helper function generic_writepages.
>>>>> This does not guarantee the generation of the required sequential write pattern
>>>>> per zone necessary for host-managed disks. As I explained, aligning calls of this
>>>>> function to zone boundaries while locking the zones under write-back solves
>>>>> simply the problem (implemented and tested). This is of course only one possible
>>>>> solution. Pushing modifications deeper in the code or providing a
>>>>> "generic_sequential_writepages" helper function are other potential solutions
>>>>> that in my opinion are worth discussing as other types of devices may benefit also
>>>>> in terms of performance (e.g. regular disk drives prefer sequential writes, and
>>>>> SSDs as well) and/or lighten the overhead on an underlying FTL or device mapper
>>>>> driver.
>>>>>
>>>>> For a file system, an SMR compliant implementation of a file inode mapping
>>>>> writepages method should be provided by the file system itself as the sequentiality
>>>>> of the write pattern depends further on the block allocation mechanism of the file
>>>>> system.
>>>>>
>>>>> Note that the goal here is not to hide to applications the sequential write
>>>>> constraint of SMR disks. The page cache itself (the mapping of the block
>>>>> device inode) remains unchanged. But the modification proposed guarantees that
>>>>> a well behaved application writing sequentially to zones through the page cache
>>>>> will see successful sync operations.
>>>>
>>>> So the easiest solution for the OS, when the application is already aware
>>>> of the storage constraints, would be for an application to use direct IO.
>>>> Because when using page-cache and writeback there are all sorts of
>>>> unexpected things that can happen (e.g. writeback decides to skip a page
>>>> because someone else locked it temporarily). So it will work in 99.9% of
>>>> cases but sometimes things will be out of order for hard-to-track down
>>>> reasons. And for ordinary drives this is not an issue because we just slow
>>>> down writeback a bit but rareness of this makes it non-issue. But for host
>>>> managed SMR the IO fails and that is something the application does not
>>>> expect.
>>>>
>>>> So I would really say just avoid using page-cache when you are using SMR
>>>> drives directly without a translation layer. For writes your throughput
>>>> won't suffer anyway since you have to do big sequential writes. Using
>>>> page-cache for reads may still be beneficial and if you are careful enough
>>>> not to do direct IO writes to the same range as you do buffered reads, this
>>>> will work fine.
>>>>
>>>> Thinking some more - if you want to make it foolproof, you could implement
>>>> something like read-only page cache for block devices. Any write will be in
>>>> fact direct IO write, writeable mmaps will be disallowed, reads will honor
>>>> O_DIRECT flag.
>>>
>>> Hi Jan,
>>>
>>> Indeed, using O_DIRECT for raw block device write is an obvious solution to
>>> guarantee the application successful sequential writes within a zone. However,
>>> host-managed SMR disks (and to a lesser extent host-aware drives too) already
>>> put on applications the constraint of ensuring sequential writes. Adding to this
>>> further mandatory rewrite to support direct I/Os is in my opinion asking a lot,
>>> if not too much.
>>
>> So I don't think adding O_DIRECT to open flags is such a burden -
>> sequential writes are IMO much harder to do :). And furthermore this could
>> happen magically inside the kernel in which case app needn't be aware about
>> this at all (similarly to how we handle writes to persistent memory).
>>
>>> The example you mention above of writeback skipping a locked page and resulting
>>> in I/O errors is precisely what the proposed patch avoids by first locking the
>>> zone the page belongs to. In the same spirit as the writeback page locking, if
>>> the zone is already locked, it is skipped. That is, zones are treated in a sense
>>> as gigantic pages, ensuring that the actual dirty pages within each one are
>>> processed in one go, sequentially.
>>
>> But you cannot rule out mm subsystem locking a page to do something (e.g.
>> migrate the page to help with compaction of large order pages). These other
>> places accessing and locking pages are what I'm worried about. Furthermore
>> kswapd can decide to writeback particular page under memory pressure and
>> that will just make SMR disk freak out.
>>
>>> This allows preserving all possible application level accesses (buffered,
>>> direct or mmapped). The only constraint is the one the disk imposes:
>>> writes must be sequential.
>>>
>>> Granted, this view may be too simplistic and may be overlooking some hard
>>> to track page locking paths which will compete with this. But I think
>>> that this can be easily solved by forcing the zone-aligned
>>> generic_writepages calls to not skip any page (a flag in struct
>>> writeback_control would do the trick). And no modification is necessary
>>> on the read side (i.e. page locking only is enough) since reading an SMR
>>> disks blocks after a zone write-pointer position does not make sense (in
>>> Hannes code, this is possible, but the request does not go to the disk
>>> and returns garbage data).
>>>
>>> Bottom line: no fundamental change to the page caching mechanism, only
>>> how it is being used/controlled for writeback makes this work.
>>> Considering the benefits on the application side, it is in my opinion a
>>> valid modification to have.
>>
>> See above, there are quite a few places which will break your assumptions.
>> And I don't think changing them all to handle SMR is worth it. IMO caching
>> sequential writes to SMR disks has low effect (if any) anyway so I would
>> just avoid that. We can talk about how to make this as seamless to
>> applications as possible. The only thing which I don't think is reasonably
>> doable without dirtying pagecache are writeable mmaps of an SMR device so
>> applications would have to avoid that.
> 
> Jan,
> 
> Thank you for your insight.
> These "few places" breaking sequential write sequences are indeed problematic
> for SMR drives. At the same time, I wonder how these paths would react to an I/O
> error generated by the check "write at write pointer" in the request submission
> path at the SCSI level. Could these be ignored in the case of an "unaligned write
> error" ? That is, the page is left dirty and hopefully the regular writeback path
> catches them later in the proper sequence. This may however be dangerous as there
> is no way to determine if the unaligned error is due to kswapd or other kernel
> threads trying to write back the "wrong" page, or the application having submitted
> an out of sequence write.
> 
> Until now, the discussion has focused on avoiding unaligned write errors for cached
> writes. But this happens only on host-managed SMR disks. Another aspect of the SMR
> support should also be to avoid random write to zones on host-aware disks. These will
> not return an error on unaligned writes and silently process them as a regular disk.
> However, this can over time degrade performance as the disk FW has to handle more and
> more internal zone defragmentation.
> 
To chime in here, we _might_ be able to fix this via a totally different
route.
If we were allow to pass _linked_ bios to ->make_request_fn (ie bios
where the ->bi_next field was already populated) we would have an easy
marker for merging those requests. At the same time we would be able to
process these linked bios as a single unit, allowing other bios only to
be added to the front or the back of these linked bios.
That would guarantee in-order delivery for SMR, and at the same time
allow us to get merging running for block-mq.

Alternatively one could try to use plugging here, but I'm not sure if
that would be sufficient; will need to test.

> If possible, I look forward to more discussions about this at LSF/MM.
> 
Same here.
Btw, I do like the idea of Online logical head depop.
No idea how we could implement that, but the idea is nice.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@xxxxxxx			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html