On 04/16/2015 07:38 PM, Sage Weil wrote:
On Thu, 16 Apr 2015, Mark Nelson wrote:
On 04/16/2015 01:17 AM, Somnath Roy wrote:
Here is the data with omap separated to another SSD and after 1000GB of fio
writes (same profile)..
omap writes:
-------------
Total host writes in this period = 551020111 ------ ~2101 GB
Total flash writes in this period = 1150679336
data writes:
-----------
Total host writes in this period = 302550388 --- ~1154 GB
Total flash writes in this period = 600238328
So, actual data write WA is ~1.1 but omap overhead is ~2.1 and adding those
getting ~3.2 WA overall.
This all suggests that getting rocksdb to not rewrite the wal
entries at all will be the big win. I think Xiaoxi had tunable
suggestions for that? I didn't grok the rocksdb terms immediately so
they didn't make a lot of sense at the time.. this is probably a good
place to focus, though. The rocksdb compaction stats should help out
there.
But... today I ignored this entirely and put rocksdb in tmpfs and focused
just on the actual wal IOs done to the fragments files after the fact.
For simplicity I focused just on 128k random writes into 4mb objects.
fio can get ~18 mb/sec on my disk with iodepth=1. Interestingly, setting
iodepth=16 makes no different *until* I also set thinktime=10 (us, or
almost any value really) and thinktime_blocks=16, at which point it goes
up with the iodepth. I'm not quite sure what is going on there but it
seems to be preventing the elevator and/or disk from reordering writes and
make more efficient sweeps across the disk. In any case, though, with
that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64.
Similarly, with qa 1 and thinktime of 250us, it drops to like 15mb/sec,
which is basically what I was getting from newstore. Here's my fio
config:
http://fpaste.org/212110/42923089/
Yikes! That is a great observation Sage!
Conclusion: we need multiple threads (or libaio) to get lots of IOs in
flight so that the block layer and/or disk can reorder and be efficient.
I added a threadpool for doing wal work (newstore wal threads = 8 by
default) and it makes a big difference. Now I am getting more like
19mb/sec w/ 4 threads and client (smalliobench) qd 16. It's not going up
much from there as I scale threads or qd, strangely; not sure why yet.
But... that's a big improvement over a few days ago (~8mb/sec). And on
this drive filestore with journal on ssd gets ~8.5mb/sec. So we're
winning, yay!
I tabled the libaio patch for now since it was getting spurious EINVAL and
would consistently SIGBUG from io_getevents() when ceph-osd did dlopen()
on the rados plugins (weird!).
Mark, at this point it is probably worth checking that you can reproduce
these results? If so, we can redo the io size sweep. I picked 8 wal
threads since that was enough to help and going higher didn't seem to make
much difference, but at some point we'll want to be more careful about
picking that number. We could also use libaio here, but I'm not sure it's
worth it. And this approach is somewhat orthogonal to the idea of
efficiently passing the kernel things to fdatasync.
Absolutely! I'll get some tests running now. Looks like everyone is
jumping on the libaio bandwagon which naively seems like the right way
to me too. Can you talk a little bit more about how you'd see fdatasync
work in this case though vs the threaded implementation?
Anyway, next up is probably wrangling rocksdb's log!
I jumped on #rocksdb on freenode yesterday to ask about it, but I think
we'll probably just need to hit the mailing list.
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html