On Fri, 17 Apr 2015, Mark Nelson wrote: > On 04/16/2015 07:38 PM, Sage Weil wrote: > > On Thu, 16 Apr 2015, Mark Nelson wrote: > > > On 04/16/2015 01:17 AM, Somnath Roy wrote: > > > > Here is the data with omap separated to another SSD and after 1000GB of > > > > fio > > > > writes (same profile).. > > > > > > > > omap writes: > > > > ------------- > > > > > > > > Total host writes in this period = 551020111 ------ ~2101 GB > > > > > > > > Total flash writes in this period = 1150679336 > > > > > > > > data writes: > > > > ----------- > > > > > > > > Total host writes in this period = 302550388 --- ~1154 GB > > > > > > > > Total flash writes in this period = 600238328 > > > > > > > > So, actual data write WA is ~1.1 but omap overhead is ~2.1 and adding > > > > those > > > > getting ~3.2 WA overall. > > > > This all suggests that getting rocksdb to not rewrite the wal > > entries at all will be the big win. I think Xiaoxi had tunable > > suggestions for that? I didn't grok the rocksdb terms immediately so > > they didn't make a lot of sense at the time.. this is probably a good > > place to focus, though. The rocksdb compaction stats should help out > > there. > > > > But... today I ignored this entirely and put rocksdb in tmpfs and focused > > just on the actual wal IOs done to the fragments files after the fact. > > For simplicity I focused just on 128k random writes into 4mb objects. > > > > fio can get ~18 mb/sec on my disk with iodepth=1. Interestingly, setting > > iodepth=16 makes no different *until* I also set thinktime=10 (us, or > > almost any value really) and thinktime_blocks=16, at which point it goes > > up with the iodepth. I'm not quite sure what is going on there but it > > seems to be preventing the elevator and/or disk from reordering writes and > > make more efficient sweeps across the disk. In any case, though, with > > that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64. > > Similarly, with qa 1 and thinktime of 250us, it drops to like 15mb/sec, > > which is basically what I was getting from newstore. Here's my fio > > config: > > > > http://fpaste.org/212110/42923089/ > > > Yikes! That is a great observation Sage! > > > > > Conclusion: we need multiple threads (or libaio) to get lots of IOs in > > flight so that the block layer and/or disk can reorder and be efficient. > > I added a threadpool for doing wal work (newstore wal threads = 8 by > > default) and it makes a big difference. Now I am getting more like > > 19mb/sec w/ 4 threads and client (smalliobench) qd 16. It's not going up > > much from there as I scale threads or qd, strangely; not sure why yet. > > > > But... that's a big improvement over a few days ago (~8mb/sec). And on > > this drive filestore with journal on ssd gets ~8.5mb/sec. So we're > > winning, yay! > > > > I tabled the libaio patch for now since it was getting spurious EINVAL and > > would consistently SIGBUG from io_getevents() when ceph-osd did dlopen() > > on the rados plugins (weird!). > > > > Mark, at this point it is probably worth checking that you can reproduce > > these results? If so, we can redo the io size sweep. I picked 8 wal > > threads since that was enough to help and going higher didn't seem to make > > much difference, but at some point we'll want to be more careful about > > picking that number. We could also use libaio here, but I'm not sure it's > > worth it. And this approach is somewhat orthogonal to the idea of > > efficiently passing the kernel things to fdatasync. > > Absolutely! I'll get some tests running now. Looks like everyone is jumping > on the libaio bandwagon which naively seems like the right way to me too. Can > you talk a little bit more about how you'd see fdatasync work in this case > though vs the threaded implementation? That I'm not certain about, not sure if I need O_DSYNC or if the libaio fsync hook actually works; the docs are ambiguous. > > Anyway, next up is probably wrangling rocksdb's log! > > I jumped on #rocksdb on freenode yesterday to ask about it, but I think we'll > probably just need to hit the mailing list. This appears to be the place to reach rocksdb folks: https://www.facebook.com/groups/rocksdb.dev/ sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html