RE: Regarding newstore performance

"Chen, Xiaoxi" <xiaoxi.chen@xxxxxxxxx> · Fri, 17 Apr 2015 00:55:44 +0000

Agree....Threadpool/Queue/Locking is in generally bad for latency. Can we just make newstore backend as synchronize as possible and utilize the parallelism by more #OSD_OP_THREAD? Hopefully we could have better latency in low #QD case.

-----Original Message-----
From: Gregory Farnum [mailto:greg@xxxxxxxxxxx] 
Sent: Friday, April 17, 2015 8:48 AM
To: Sage Weil
Cc: Mark Nelson; Somnath Roy; Chen, Xiaoxi; Haomai Wang; ceph-devel
Subject: Re: Regarding newstore performance

On Thu, Apr 16, 2015 at 5:38 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Thu, 16 Apr 2015, Mark Nelson wrote:
>> On 04/16/2015 01:17 AM, Somnath Roy wrote:
>> > Here is the data with omap separated to another SSD and after 
>> > 1000GB of fio writes (same profile)..
>> >
>> > omap writes:
>> > -------------
>> >
>> > Total host writes in this period = 551020111 ------ ~2101 GB
>> >
>> > Total flash writes in this period = 1150679336
>> >
>> > data writes:
>> > -----------
>> >
>> > Total host writes in this period = 302550388 --- ~1154 GB
>> >
>> > Total flash writes in this period = 600238328
>> >
>> > So, actual data write WA is ~1.1 but omap overhead is ~2.1 and 
>> > adding those getting ~3.2 WA overall.
>
> This all suggests that getting rocksdb to not rewrite the wal entries 
> at all will be the big win.  I think Xiaoxi had tunable suggestions 
> for that?  I didn't grok the rocksdb terms immediately so they didn't 
> make a lot of sense at the time.. this is probably a good place to 
> focus, though.  The rocksdb compaction stats should help out there.
>
> But... today I ignored this entirely and put rocksdb in tmpfs and 
> focused just on the actual wal IOs done to the fragments files after the fact.
> For simplicity I focused just on 128k random writes into 4mb objects.
>
> fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly, 
> setting
> iodepth=16 makes no different *until* I also set thinktime=10 (us, or 
> almost any value really) and thinktime_blocks=16, at which point it 
> goes up with the iodepth.  I'm not quite sure what is going on there 
> but it seems to be preventing the elevator and/or disk from reordering 
> writes and make more efficient sweeps across the disk.  In any case, 
> though, with that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64.
> Similarly, with qa 1 and thinktime of 250us, it drops to like 
> 15mb/sec, which is basically what I was getting from newstore.  Here's 
> my fio
> config:
>
>         http://fpaste.org/212110/42923089/
>
> Conclusion: we need multiple threads (or libaio) to get lots of IOs in 
> flight so that the block layer and/or disk can reorder and be efficient.
> I added a threadpool for doing wal work (newstore wal threads = 8 by
> default) and it makes a big difference.  Now I am getting more like 
> 19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not going 
> up much from there as I scale threads or qd, strangely; not sure why yet.
>
> But... that's a big improvement over a few days ago (~8mb/sec).  And 
> on this drive filestore with journal on ssd gets ~8.5mb/sec.  So we're 
> winning, yay!
>
> I tabled the libaio patch for now since it was getting spurious EINVAL 
> and would consistently SIGBUG from io_getevents() when ceph-osd did 
> dlopen() on the rados plugins (weird!).
>
> Mark, at this point it is probably worth checking that you can 
> reproduce these results?  If so, we can redo the io size sweep.  I 
> picked 8 wal threads since that was enough to help and going higher 
> didn't seem to make much difference, but at some point we'll want to 
> be more careful about picking that number.  We could also use libaio 
> here, but I'm not sure it's worth it.  And this approach is somewhat 
> orthogonal to the idea of efficiently passing the kernel things to fdatasync.

Adding another thread switch to the IO path is going to make us very sad in the future, so I think this'd be a bad prototype version to have escape into the wild. I keep hearing Sam's talk about needing to get down to 1 thread switch if we're ever to hope for 100usec writes.

So consider this one vote for making libaio work, and sooner rather than later. :) -Greg
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f