Re: [LSF/MM ATTEND] OCSSD topics

Matias Bjørling <mb@xxxxxxxxxxx> · Thu, 25 Jan 2018 22:02:41 +0100

On 01/25/2018 04:26 PM, Javier Gonzalez wrote:
Hi,

There are some topics that I would like to discuss at LSF/MM:
   - In the past year we have discussed a lot how we can integrate the
     Open-Channel SSD (OCSSD) spec with zone devices (SMR). This
     discussion is both at the interface level and at an in-kernel level.
     Now that Damien's and Hannes' patches are upstreamed in good shape,
     it would be a good moment to discuss how we can integrate the
     LightNVM subsystem with the existing code. 

The ZBC-OCSSD patches 
(https://github.com/OpenChannelSSD/linux/tree/zbc-support) that I made 
last year is a good starting point.

Specifically, in ALPSS'17
     we had discussions on how we can extend the kernel zoned device
     interface with the notion of parallel units that the OCSSD geometry
     builds upon. We are now bringing the OCSSD spec. to standarization,
     but we have time to incorporate feedback and changes into the spec.

Which spec? the OCSSD 2 spec that I have copyright on? I don't believe 
it has been submitted or is under consideration to any standards body 
yet and I don't currently plan to do that.

You might have meant "to be finalized". As you know, I am currently 
soliciting feedback and change requests from vendors and partners with 
respect to the specification and is planning on closing it soon. If CNEX 
is doing their own new specification, please be open about it, and don't 
put it under the OCSSD name.

     Some of the challenges are (i) adding vector I/O interface to the
     bio structure and (ii) extending the report zone to have the notion
     of parallelism. I have patches implementing the OCSSD 2.0 spec that
     abstract the geometry and allow upper layers to deal with write
     restrictions and the parallelism of the device, but this is still
     very much OCSSD-specific.

For the vector part, one can look into Ming's work on multi-page bvec 
(https://lkml.org/lkml/2017/12/18/496). When that code is in, it should 
be possible to implement the rest. One nagging feeling I have is that 
the block core code need to be updated to understand vectors. That will 
be complex given I/O checks are all based on ranges and is cheap, while 
for vectors it is significantly more expensive due to each LBA much be 
checked individually (one reason it is a separate subsystem). It might 
not be worth it until the vector api has broader market adoption. For 
example supported natively in the NVMe specification.

For extending report zones, one can do (start LBA:end LBA) (similarly to 
the device mapper interface), and then have a list of those to describe 
the start and end of each parallel unit.

   - I have started to use the above to do a f2fs implementation, where
     we would implement the data placement and I/O scheduling directly in
     the FS, as opposed to using pblk - at least for the journaled part.
     The random I/O partition necessary for metadata can either reside in
     a different drive or use a pblk instance for it. This is very much
     work in progress, so having feedback form the f2fs guys (or other
     journaled file systems) would help to start the work in the right
     direction. Maybe this is interesting for other file systems too...

We got much feedback from Jaegeuk. From his feedback, I did the ZBC work 
with f2fs, which used a single parallel unit. To improve on that, one 
solution is to extend dm-stripe to understand zones (it can already be 
configured correctly... but it should expose zone entries as well) and 
then use that for doing stripes across parallel units with f2fs. This 
would fit into the standard codebase and doesn't add a whole lot of 
OCSSD-only bits.

   - Finally, now that pblk is becoming stable, and given the advent of
     devices imposing sequential-only I/O, would it make sense to
     generalize pblk as a device mapper translation layer that can be
     used for random I/O partitions? 

dm-zoned fills this niche. Similarly as above, combine it with zone 
aware dm-stripe and it is a pretty good solution. However, given that 
pblk does a lot more than making I/Os sequential, I can see why it will 
be nice to have as a device mapper. It could be the dual-solution that 
we previously discussed, where pblk can use either the traditional 
scalar or vector interface, depending if the drive has exposed a 
separate vector interface.

We have ad internal use cases for
     using such translation layer for frontswap devices. Maybe others are
     looking at this too...

Javier