Re: [PATCH 0/11] Update version of write stream ID patchset

Jens Axboe <axboe@xxxxxx> · Tue, 8 Mar 2016 14:56:31 -0700

On 03/05/2016 01:48 PM, Martin K. Petersen wrote:
"Jens" == Jens Axboe <axboe@xxxxxx> writes:

Jens,

OK.  I'm still of the opinion that we should try to make this
transparent.  I could be swayed by workload descriptions and numbers
comparing approaches, though.

Jens> You can't just waive that flag and not have a solution. Any
Jens> solution in that space would imply having policy in the kernel. A
Jens> "just use a stream per file" is never going to work.

I totally understand the desire to have explicit, long-lived
"from-file-open to file-close" streams for things like database journals
and whatnot.

That is an appealing use case.

However, I think that you are dismissing the benefits of being able to
group I/Os to disjoint LBA ranges within a brief period of time as
belonging to a single file. It's something that we know works well on
other types of storage. And it's also a much better heuristic for data
placement on SSDs than just picking the next available bucket. It does
require some pipelining on the drive but they will need some front end
logic to handle the proposed stream ID separation in any case.

I'm not a huge fan of heuristics based exclusively around the temporal 
and spacial locality. Using that as a hint for a case where no stream ID 
(or write tag) is given would be an improvement, though. And perhaps 
parts of the space should be reserved to just that.

But I don't think that should exclude doing this in a much more managed 
fashion, personally I find that a lot saner than adding this sort of 
state tracking in the kernel.

Also, in our experiments we essentially got the explicit stream ID for
free by virtue of the journal being written often enough that it was
rarely if ever evicted as an active stream by the device. With no
changes whatsoever to any application.

Journal would be an easy one to guess, for sure.

My gripe with the current stuff is the same as before: The protocol is
squarely aimed at papering over issues with current flash technology. It
kinda-sorta works for other types of devices but it is very limiting. I
appreciate that it is a great fit for the "handful of apps sharing a
COTS NVMe drive on a cloud server" use case. But I think it is horrible
for NVMe over Fabrics and pretty much everything else. That wouldn't be
a big deal if the traditional storage models were going away. But I
don't think they are...

I don't think erase blocks are going to go away in the near future. 
We're going to have better media as well, that's a given, but cheaper 
TLC flash is just going to make the current problem much worse. The 
patchset is really about tagging the writes with a stream ID, nothing 
else. That could potentially be any type of hinting, it's not exclusive 
to being used with NVMe write directives at all.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html