On 03/04/2016 03:03 PM, Jeff Moyer wrote:
Jens Axboe <axboe@xxxxxx> writes:
On 03/04/2016 02:01 PM, Jeff Moyer wrote:
OK. I'm still of the opinion that we should try to make this
transparent. I could be swayed by workload descriptions and numbers
comparing approaches, though.
You can't just waive that flag and not have a solution. Any solution
in that space would imply having policy in the kernel. A "just use a
stream per file" is never going to work.
Jens, I'm obviously missing a lot of the background information, here.
I want to stress that I'm not against your patches. I'm just trying to
understand if there's a sensible way to use the write stream support in
the kernel so that applcations don't /have/ to be converted. It sounds
like that's hard, and without any specs or hardware, I'm not going to be
able to even try to come up with solutions to that problem.
It's not hard to update an application to do this. As an example, one
thing I tried was converting RocksDB to use streams. A naive approach
was used, where we simply mapped each compaction level to a specific
stream, and got about a 30% reduction in WA just through that. The guys
from Samsung has done that with RocksDB as well, just a bit more
involved, and got better results. The application change was really no
more involved than calling fadvise() on the fd after opening it. That is
it. I don't know why you think that is hard.
As to doing this automagically, you'll need knowledge that you do not
have. The kernel or file system has no idea if data written to file X
and file Y have similar life times. You could start tracking that, of
course, but that would make you very unhappy. If I'm an application
storing files, I have a much better idea of what is related time wise.
And you don't really need a spec to understand how this works, the spec
will just tell you the mechanics of how we pass this information to the
device, how we find out what the device can support, etc. The basic gist
of it is that we can write data with similar life times to the right
place on media. For a flash disk, that would be the same EB.
I think it
would make for interesting research, though. I recall a paper from one
of the USENIX conferences that dealt with automatically identifying
write streams on a network storage server, but alas, I can't find the
reference right now.
Samsung released a paper on RocksDB and streams, iirc.
--
Jens Axboe
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html