On Tue, Mar 8, 2016 at 1:56 PM, Jens Axboe <axboe@xxxxxx> wrote: > On 03/05/2016 01:48 PM, Martin K. Petersen wrote: >>>>>>> >>>>>>> "Jens" == Jens Axboe <axboe@xxxxxx> writes: >> >> >> Jens, >> >>>> OK. I'm still of the opinion that we should try to make this >>>> transparent. I could be swayed by workload descriptions and numbers >>>> comparing approaches, though. >> >> >> Jens> You can't just waive that flag and not have a solution. Any >> Jens> solution in that space would imply having policy in the kernel. A >> Jens> "just use a stream per file" is never going to work. >> >> I totally understand the desire to have explicit, long-lived >> "from-file-open to file-close" streams for things like database journals >> and whatnot. > > > That is an appealing use case. > >> However, I think that you are dismissing the benefits of being able to >> group I/Os to disjoint LBA ranges within a brief period of time as >> belonging to a single file. It's something that we know works well on >> other types of storage. And it's also a much better heuristic for data >> placement on SSDs than just picking the next available bucket. It does >> require some pipelining on the drive but they will need some front end >> logic to handle the proposed stream ID separation in any case. > > > I'm not a huge fan of heuristics based exclusively around the temporal and > spacial locality. Using that as a hint for a case where no stream ID (or > write tag) is given would be an improvement, though. And perhaps parts of > the space should be reserved to just that. > > But I don't think that should exclude doing this in a much more managed > fashion, personally I find that a lot saner than adding this sort of state > tracking in the kernel. > >> Also, in our experiments we essentially got the explicit stream ID for >> free by virtue of the journal being written often enough that it was >> rarely if ever evicted as an active stream by the device. With no >> changes whatsoever to any application. > > > Journal would be an easy one to guess, for sure. > >> My gripe with the current stuff is the same as before: The protocol is >> squarely aimed at papering over issues with current flash technology. It >> kinda-sorta works for other types of devices but it is very limiting. I >> appreciate that it is a great fit for the "handful of apps sharing a >> COTS NVMe drive on a cloud server" use case. But I think it is horrible >> for NVMe over Fabrics and pretty much everything else. That wouldn't be >> a big deal if the traditional storage models were going away. But I >> don't think they are... > > > I don't think erase blocks are going to go away in the near future. We're > going to have better media as well, that's a given, but cheaper TLC flash is > just going to make the current problem much worse. The patchset is really > about tagging the writes with a stream ID, nothing else. That could > potentially be any type of hinting, it's not exclusive to being used with > NVMe write directives at all. > Maybe I'm misunderstanding, but why does stream-id imply anything more than just "opaque tag set at the top of the stack that makes it down to a driver". Sure NVMe can interpret these as NVMe streams, but any other driver can have its own transport specific translation of what the hint means. I think the minute the opaque number requires specific driver behavior we'll fall into a rat hole of how to translate intent across usages. In other words, I think it will always be the case that the hint has application + transport/driver meaning, but otherwise the kernel is just a conduit. -- To unsubscribe from this list: send the line "unsubscribe linux-block" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html