RE: Increasing # Shards vs multi-OSDs per device

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi, sure send me the branch and we can do a quick test, at least on the 4k random reads & writes.

Thanks,

Stephen


-----Original Message-----
From: Robert LeBlanc [mailto:robert@xxxxxxxxxxxxx] 
Sent: Wednesday, November 11, 2015 4:30 PM
To: Blinick, Stephen L
Cc: Somnath Roy; ceph-devel@xxxxxxxxxxxxxxx; Mark Nelson; Samuel Just; Kyle Bader; Somnath Roy
Subject: Re: Increasing # Shards vs multi-OSDs per device

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I should have the weighted round robin queue ready in the next few days. I shaking out a few bugs from converting it from my Hammer patch and I need to write a test suite, but I can get you the branch before then. I'd be interested to see what difference there may be as it would help decide if this is a path to continue pursuing.
- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Nov 11, 2015 at 3:44 PM, Blinick, Stephen L  wrote:
> Thanks for taking a look!
>
> First, the original slides are on the Ceph slideshare here: http://www.slideshare.net/Inktank_Ceph/accelerating-cassandra-workloads-on-ceph-with-allflash-pcie-ssds    That should show the 1/2/4 parition comparison and the overall performance #’s latency, and data set size.   I didn’t provide much context in this quick deck because I meant it to be a companion to that data.
>
> 1 – I agree we’re a long way from the max DC P3700 4K RR IOPS (460K), and also that is at QD128.   In the future (some day) it would be nice though if you could reach that performance with OSD knobs to increase worker threads.   We did try many shards/workers per shards combination but the real purpose of the presentation was mixed r/w performance and the “Cassandra” breakdown of IO sizes @ 50/50 mix.  Every deviation from the default hurt writes or mixed performance so we were not just working for the highest RandRead #’s possible.
>
> 2—We used CBT for all runs, unfortunately our default collectl config didn’t grab NVMe stats we’ll fix that as we move to Infernalis.. I have NVMe Bandwidth only (from zabbix monitoring).  Using IOstat to spot check though the QD, it was always pretty low on the devices, at most 10 with the 4-partition 100% read case.   For 100% RR @ QD32:   1 OSD/NVMe = 34.4K reads/device.  4 OSD/NVMe = 55.5k reads/device (14k per OSD).    What's shown in the other presentation is that with 2OSD/NVMe we hit 44K iops/device, and we did go to 8 OSD's/device but saw no improvement over 4.
> So, we determined 4 NVMe/OSD is worth doing over 2, and the 4 OSD's to 1 flash device failure boundary closely matches the generally accepted ratio of OSD journals to one device.
>
> 3-- Page cache effects should be negated in these numbers. As you can see in the other presentation we did one run with  2TB of data, which showed higher performance (1.35M IOPS).  But the rest of the tests were run with 4.8TB data (replicated twice), and uniform random distribution.  While we  did use 'norandommap' for client performance, the re-referencing of blocks in that size dataset should be low.   Using IOstat, and Zabbix, we correlated the host read/write aggregate performance to the device statistics, so confirm that there wasn't object data coming out of cache.
>
> 4 -- Yeah this multi-partitioning doesn't double or quadruple CPU/Throughput and performance.  But it has significant improvements to a point (4, by experimentation, in this cluster).  For SSD's, our team in Shanghai used 2 partitions/SSD for best performance.
>
> 5 -- Good catch, I typed this in quickly when my mic wasn't working this morning :)   In all cases the # of actual worker threads is double what is stated on slide #7 in the linked presentation below.  This is because every shard by default = 2 workers and we did not change that in any of the published tests.  When we did lower it to 1 it always hurt write performance as well.
>
> 6 -- As you can see in the config, we did increase filestore_op_threads to 6  (this gave a 15% boost in mixed r/w performance).  Higher than that didn't help.  I'm not sure if it would have helped in the case of 20 shards/OSD.
>
> I am still really curious about the scheduler behavior for threads within an OSD.     Given the sleep/signal/wakeup mechanism between the msg dispatcher and the worker threads, is it possible that's causing the scheduler to bump threads up to a higher priority and somehow breaking fairness when there's more runnable threads than CPU's?    Each node here has 72 cpu's (with HT) but as you note 160 worker threads (in addition to the pipe reader/writers and msg dispatchers).
>
> Thanks,
>
> Stephen
>
>
>
>
> From: Somnath Roy [mailto:its.somenath@xxxxxxxxx]
> Sent: Wednesday, November 11, 2015 3:02 PM
> To: Blinick, Stephen L
> Cc: ceph-devel@xxxxxxxxxxxxxxx; Mark Nelson; Samuel Just; Kyle Bader; 
> Somnath Roy
> Subject: Re: Increasing # Shards vs multi-OSDs per device
>
> Thanks for the data Stephen. Some feedback:
>
> 1. I don't think single OSD is still there to serve 460K read iops irrespective of how many shards/threads you are running. I didn't have your NVMe data earlier :-)..But, probably for 50/60K SAS SSD iops single OSD per drive is good enough. I hope you tried even increasing the shards/threads to very high value (since you have lot of cpu left) say 40:2 or 80:1 (try one configuration with 1 thread/shard , it should reduce contention per shard) so ? Or even lower ratio like 10:2  or 20:1 ?
>
> 2. Do you have any data on disk utilization ? It will be good if we are able to understand how better the single disk utilization becomes when you are running multiple OSDs/drive. I kind of back calculate from your data that , in case of 4 OSds/drive cases each OSD is serving ~14K read iops vs ~42K read iops while having one osd/drive. So, this clearly tells that two OSDs/drive should be good enough to serve similar iops in your environment. You are able to extract ~56K iops per drive with 4 OSDs vs 42K for one OSD case.
>
> 3. The above calculation I have discarded all the cache effect , but, that's not realistic. You have total of 128 GB * 5 = 640 GB of RAM. What is the total working set of yours ? If you are having lot of cache effect in this run , 4 OSDs (4 XFS) will be having better effect than one OSD /drive. This could be a total number of OSD effect in the cluster but not so number of OSD needed to saturate a drive.
>
> 4. Also, cpu util wise, you have only 20% more cpu util while you are running 4X more OSDs.
>
> 5.  BTW, worker thread calculation is incorrect , default is 5:2 , so, each osd is running with 10 worker threads and total 160 worker threads for both 4 OSD/drive and 1 osd/drive (20:2).
>
> 6.  Write data is surprising compare to default shard 1 OSD case, may be you need to increase filestore op thrads since you have more data coming to filestore ?
>
> Thanks & Regards
> Somnath
>
> On Wed, Nov 11, 2015 at 12:57 PM, Blinick, Stephen L  wrote:
> Sorry about the microphone issues in the performance meeting today today.   This is a followup to the 11/4 performance meeting where we discussed increasing the worker thread count in the OSD's vs making multiple OSD's (and partitions/filesystems) per device.     We did the high level experiment and have some results which I threw into a ppt/pdf, and shared them here:
>
> http://www.docdroid.net/UbmvGnH/increasing-shards-vs-multiple-osds.pdf
> .html
>
> Doing 20-shard OSD's vs 4 OSD's per device with default 5 shards yielded about half of the performance improvement for random 4k reads.  For writes performance is actually worse than just 1 OSD per device and the default # of shards.  The throttles should be large enough for the 20-shard use case as they are 10x the defaults, although if you see anything we missed let us know.
>
> I had the cluster moved to Infernalis release (with JEMalloc) yesterday, so hopefully we'll have some early results on the same 5-node cluster soon.
>
> Thanks,
>
> Stephen
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWQ89oCRDmVDuy+mK58QAAGm0P/igsFdpIZonJ1rnBDzn1
tMPRpVcKHvL/Yqx9UgkZwXnQ+MZP6nUj/roWzhJFM8OnXIMBg7TKvQLNN1uv
NFNX6noWcQjASrUsFbkJC78xv9GOZeltN/9sE5jDZdjPHqNtP7g9n/au1DZP
qPqlJBQxoF4p1qZUijuX3JXMmLRNNvEpdhicve1gz3WG5CdtZb1p1udPECa1
7InsacCqzc9foRv23wqcnQU/cCQyZWLRDgMSFXb4/b5JErqwAV4WNL/C6oTa
hdAIsaRVlQvhj9PlYI86FYCd0sj/B1TZlRaRBKR/Eup7Yyvlo6y+EaNua7Ou
D7ilYZCBOQ+2HUaM6Dv+SRJogK35nkurkthP1hqZi0TLYpxSefzpzf+TJuvg
r/B2f26ha4lX7i023gPTij+GkpLCTJgymKWqbLHfHsNQN1/fwrgwOJ/5ySNL
TXh1iTT8ulB6GwmkPM9MRlIW6jRCoOpjWXHjE6R7wAVOh/cpLb98ie2cmW+6
sXhCllPFwpHogYJklCW+eI6eZ7T2Y26WMA/BbwVKKlPhcaU35LVym77XeqBI
804tLumsYyBVZVIlpsn1Eqk+tgh6/aNSgMXztDTWjdCVwUhhmLDGuzdDYa+1
EM2bW4+ZIeRnhab662v8muFX8ka/ee/HX43St50LeGRcYIEICsxGSCMXhVdt
kTLC
=cj9z
-----END PGP SIGNATURE-----
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux