RE: Increasing # Shards vs multi-OSDs per device

"Blinick, Stephen L" <stephen.l.blinick@xxxxxxxxx> · Wed, 11 Nov 2015 22:44:42 +0000

Thanks for taking a look!  

First, the original slides are on the Ceph slideshare here: http://www.slideshare.net/Inktank_Ceph/accelerating-cassandra-workloads-on-ceph-with-allflash-pcie-ssds    That should show the 1/2/4 parition comparison and the overall performance #’s latency, and data set size.   I didn’t provide much context in this quick deck because I meant it to be a companion to that data.   

1 – I agree we’re a long way from the max DC P3700 4K RR IOPS (460K), and also that is at QD128.   In the future (some day) it would be nice though if you could reach that performance with OSD knobs to increase worker threads.   We did try many shards/workers per shards combination but the real purpose of the presentation was mixed r/w performance and the “Cassandra” breakdown of IO sizes @ 50/50 mix.  Every deviation from the default hurt writes or mixed performance so we were not just working for the highest RandRead #’s possible. 

2—We used CBT for all runs, unfortunately our default collectl config didn’t grab NVMe stats we’ll fix that as we move to Infernalis.. I have NVMe Bandwidth only (from zabbix monitoring).  Using IOstat to spot check though the QD, it was always pretty low on the devices, at most 10 with the 4-partition 100% read case.   For 100% RR @ QD32:   1 OSD/NVMe = 34.4K reads/device.  4 OSD/NVMe = 55.5k reads/device (14k per OSD).    What's shown in the other presentation is that with 2OSD/NVMe we hit 44K iops/device, and we did go to 8 OSD's/device but saw no improvement over 4. 
So, we determined 4 NVMe/OSD is worth doing over 2, and the 4 OSD's to 1 flash device failure boundary closely matches the generally accepted ratio of OSD journals to one device. 

3-- Page cache effects should be negated in these numbers. As you can see in the other presentation we did one run with  2TB of data, which showed higher performance (1.35M IOPS).  But the rest of the tests were run with 4.8TB data (replicated twice), and uniform random distribution.  While we  did use 'norandommap' for client performance, the re-referencing of blocks in that size dataset should be low.   Using IOstat, and Zabbix, we correlated the host read/write aggregate performance to the device statistics, so confirm that there wasn't object data coming out of cache.  

4 -- Yeah this multi-partitioning doesn't double or quadruple CPU/Throughput and performance.  But it has significant improvements to a point (4, by experimentation, in this cluster).  For SSD's, our team in Shanghai used 2 partitions/SSD for best performance. 

5 -- Good catch, I typed this in quickly when my mic wasn't working this morning :)   In all cases the # of actual worker threads is double what is stated on slide #7 in the linked presentation below.  This is because every shard by default = 2 workers and we did not change that in any of the published tests.  When we did lower it to 1 it always hurt write performance as well.

6 -- As you can see in the config, we did increase filestore_op_threads to 6  (this gave a 15% boost in mixed r/w performance).  Higher than that didn't help.  I'm not sure if it would have helped in the case of 20 shards/OSD.  

I am still really curious about the scheduler behavior for threads within an OSD.     Given the sleep/signal/wakeup mechanism between the msg dispatcher and the worker threads, is it possible that's causing the scheduler to bump threads up to a higher priority and somehow breaking fairness when there's more runnable threads than CPU's?    Each node here has 72 cpu's (with HT) but as you note 160 worker threads (in addition to the pipe reader/writers and msg dispatchers).

Thanks,

Stephen

From: Somnath Roy [mailto:its.somenath@xxxxxxxxx] 
Sent: Wednesday, November 11, 2015 3:02 PM
To: Blinick, Stephen L
Cc: ceph-devel@xxxxxxxxxxxxxxx; Mark Nelson; Samuel Just; Kyle Bader; Somnath Roy
Subject: Re: Increasing # Shards vs multi-OSDs per device

Thanks for the data Stephen. Some feedback:

1. I don't think single OSD is still there to serve 460K read iops irrespective of how many shards/threads you are running. I didn't have your NVMe data earlier :-)..But, probably for 50/60K SAS SSD iops single OSD per drive is good enough. I hope you tried even increasing the shards/threads to very high value (since you have lot of cpu left) say 40:2 or 80:1 (try one configuration with 1 thread/shard , it should reduce contention per shard) so ? Or even lower ratio like 10:2  or 20:1 ?

2. Do you have any data on disk utilization ? It will be good if we are able to understand how better the single disk utilization becomes when you are running multiple OSDs/drive. I kind of back calculate from your data that , in case of 4 OSds/drive cases each OSD is serving ~14K read iops vs ~42K read iops while having one osd/drive. So, this clearly tells that two OSDs/drive should be good enough to serve similar iops in your environment. You are able to extract ~56K iops per drive with 4 OSDs vs 42K for one OSD case.

3. The above calculation I have discarded all the cache effect , but, that's not realistic. You have total of 128 GB * 5 = 640 GB of RAM. What is the total working set of yours ? If you are having lot of cache effect in this run , 4 OSDs (4 XFS) will be having better effect than one OSD /drive. This could be a total number of OSD effect in the cluster but not so number of OSD needed to saturate a drive.

4. Also, cpu util wise, you have only 20% more cpu util while you are running 4X more OSDs.

5.  BTW, worker thread calculation is incorrect , default is 5:2 , so, each osd is running with 10 worker threads and total 160 worker threads for both 4 OSD/drive and 1 osd/drive (20:2).

6.  Write data is surprising compare to default shard 1 OSD case, may be you need to increase filestore op thrads since you have more data coming to filestore ?

Thanks & Regards
Somnath

On Wed, Nov 11, 2015 at 12:57 PM, Blinick, Stephen L <stephen.l.blinick@xxxxxxxxx> wrote:
Sorry about the microphone issues in the performance meeting today today.   This is a followup to the 11/4 performance meeting where we discussed increasing the worker thread count in the OSD's vs making multiple OSD's (and partitions/filesystems) per device.     We did the high level experiment and have some results which I threw into a ppt/pdf, and shared them here:

http://www.docdroid.net/UbmvGnH/increasing-shards-vs-multiple-osds.pdf.html

Doing 20-shard OSD's vs 4 OSD's per device with default 5 shards yielded about half of the performance improvement for random 4k reads.  For writes performance is actually worse than just 1 OSD per device and the default # of shards.  The throttles should be large enough for the 20-shard use case as they are 10x the defaults, although if you see anything we missed let us know.

I had the cluster moved to Infernalis release (with JEMalloc) yesterday, so hopefully we'll have some early results on the same 5-node cluster soon.

Thanks,

Stephen

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f