Hello, On Wed, 01 Feb 2017 15:16:15 +0100 Eugen Block wrote: > > You've told us absolutely nothing about your cluster > > You're right, I'll try to provide as much information as possible. > > Please note that we have kind of a "special" layout... The HDDs on > ndesan01 are in a RAID6, which is bcached by two SSDs in a RAID1: > I've dabbled in RAID backed OSDs with mixed success, but the your setup basically tries to do the IOPS for 5 OSDs on what essentially boils down to actually a single HDD (due to the RAID6). A HW cache or maybe your bcache setup will help with this of course, but once they are saturated and/or your IO patterns have competing reads and writes, it (performance) will essentially go to hell. I have no (successful and recent) experience with bcache, so this may or may not be a factor, too. Again, monitoring things with atop during a benchmark (large window, at least 5s frequency) on your nodes and especially this one will show you where the bottlenecks are very quickly. > ---cut here--- > ndesan01: # cat /proc/mdstat > Personalities : [raid1] [raid6] [raid5] [raid4] > md126 : active raid6 sdd1[3] sde1[4] sdf1[5] sda1[0] sdc1[2] sdb1[1] > 3906521088 blocks super 1.2 level 6, 512k chunk, algorithm 2 > [6/6] [UUUUUU] > bitmap: 6/8 pages [24KB], 65536KB chunk > > md127 : active raid1 sdh1[1] sdg1[0] > 195228864 blocks super 1.2 [2/2] [UU] > bitmap: 2/2 pages [8KB], 65536KB chunk > > unused devices: <none> > ndesan01: # pvs > PV VG Fmt Attr PSize PFree > /dev/bcache0 ndesan01 lvm2 a-- 3.64t 427.46g > ndesan01: # lvs | grep osd > cephosd0 ndesan01 -wi-ao---- 250.00g > > cephosd1 ndesan01 -wi-ao---- 250.00g > > cephosd2 ndesan01 -wi-ao---- 250.00g > > cephosd3 ndesan01 -wi-ao---- 250.00g > > cephosd4 ndesan01 -wi-ao---- 250.00g > ---cut here--- > > ndesan02 has the same layout. nde32 is not bcached > > nde32:~ # lvs | grep osd > cephosd0 ndesan02 -wi-ao---- 250,00g > cephosd1 ndesan02 -wi-ao---- 250,00g > cephosd2 ndesan02 -wi-ao---- 250,00g > cephosd3 ndesan02 -wi-ao---- 250,00g > cephosd4 ndesan02 -wi-ao---- 250,00g > > ndesan03 is a dedicated ceph node and is also bcached: > > ---cut here--- > ndesan03:~ # df -hl | grep osd > /dev/bcache0 932G 286G 646G 31% /var/ceph/osd0 > /dev/bcache1 932G 288G 644G 31% /var/ceph/osd1 > /dev/bcache2 932G 298G 634G 33% /var/ceph/osd2 > /dev/bcache3 932G 289G 644G 31% /var/ceph/osd3 > /dev/bcache4 932G 285G 647G 31% /var/ceph/osd4 > ---cut here--- > > This mixed setup is due to the pre-existing infrastructure on these > servers, we planned to migrate the environment step by step into ceph. > > ceph@ndesan01:~/ceph-deploy> ceph --version > ceph version 0.94.7-84-g8e6f430 > Not that it matters much, but since you may as well go to 0.94.9 aka the latest Hammer. > All 4 nodes have the same OS: > ceph@ndesan01:~/ceph-deploy> cat /etc/os-release > NAME="openSUSE Leap" > VERSION="42.1" > > ndesan01, ndesan02 and nde32 have the same kernel: > ceph@ndesan01:~/ceph-deploy> uname -r > 4.1.13-5-default > The 4.1 to 4.4 kernels have various issues, of course SUSE might have backported fixes. It often pays to run the latest (4.8, 4.9) kernels with Ceph. But again, not the cause of your problem. > ndesan03:~ # uname -r > 4.1.12-1-default > > It's a 1 Gb/s network. > And there's your "problem". I read this as single link "1Gb/s" network, which makes your benchmark numbers totally on par, as I suggested. Congestions by individual OSDs (unlucky roll of the dice that has one OSD being primary for one IO and secondary for other concurrent ones) will not help of course. > ndesan01: 2 x Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (each 6 cores) > > ndesan02: 2 x Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (each 6 cores) > If you're doing 4K sync/direct randwrites you'll notice that faster cores will be better than more cores, but for bandwidth related tests it doesn't matter. Of course that's against SSD/NVMe backed OSDs or at least ones with SSD journals. Everything below looks nominal, the lower memory on one node will make this one inherently slower and thus the overall bottleneck in an otherwise identical cluster due to less pagecache hits and thus more real HDD accesses. Christian > nde32: 2 x Intel(R) Xeon(R) CPU E5410 @ 2.33GHz (each 4 cores) > > ndesan03: 1 x Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz (4 cores) > > ceph@ndesan01:~/ceph-deploy> ceph osd tree > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > -1 8.14980 root default > -2 1.19995 host ndesan01 > 0 0.23999 osd.0 up 1.00000 1.00000 > 1 0.23999 osd.1 up 1.00000 1.00000 > 2 0.23999 osd.2 up 1.00000 1.00000 > 13 0.23999 osd.13 up 1.00000 1.00000 > 19 0.23999 osd.19 up 1.00000 1.00000 > -3 1.19995 host ndesan02 > 3 0.23999 osd.3 up 1.00000 1.00000 > 4 0.23999 osd.4 up 1.00000 1.00000 > 5 0.23999 osd.5 up 1.00000 1.00000 > 9 0.23999 osd.9 up 1.00000 1.00000 > 10 0.23999 osd.10 up 1.00000 1.00000 > -4 1.19995 host nde32 > 6 0.23999 osd.6 up 1.00000 1.00000 > 7 0.23999 osd.7 up 1.00000 1.00000 > 8 0.23999 osd.8 up 1.00000 1.00000 > 12 0.23999 osd.12 up 1.00000 1.00000 > 11 0.23999 osd.11 up 1.00000 1.00000 > -5 4.54994 host ndesan03 > 14 0.90999 osd.14 up 1.00000 1.00000 > 15 0.90999 osd.15 up 1.00000 1.00000 > 16 0.90999 osd.16 up 1.00000 1.00000 > 17 0.90999 osd.17 up 1.00000 1.00000 > 18 0.90999 osd.18 up 1.00000 1.00000 > > ceph@ndesan01:~/ceph-deploy> ceph osd df > ID WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR > 0 0.23999 1.00000 249G 210G 39851M 84.43 1.59 > 1 0.23999 1.00000 249G 206G 44424M 82.64 1.55 > 2 0.23999 1.00000 249G 200G 50084M 80.43 1.51 > 13 0.23999 1.00000 249G 200G 50520M 80.26 1.51 > 19 0.23999 1.00000 249G 187G 63837M 75.05 1.41 > 3 0.23999 1.00000 249G 205G 45945M 82.04 1.54 > 4 0.23999 1.00000 249G 205G 45841M 82.08 1.54 > 5 0.23999 1.00000 249G 206G 44026M 82.79 1.56 > 9 0.23999 1.00000 249G 198G 52348M 79.54 1.50 > 10 0.23999 1.00000 249G 189G 62027M 75.76 1.42 > 6 0.23999 1.00000 249G 191G 59968M 76.56 1.44 > 7 0.23999 1.00000 249G 197G 53410M 79.13 1.49 > 8 0.23999 1.00000 249G 207G 43856M 82.86 1.56 > 12 0.23999 1.00000 249G 199G 51695M 79.80 1.50 > 11 0.23999 1.00000 249G 216G 33766M 86.80 1.63 > 14 0.90999 1.00000 931G 285G 645G 30.70 0.58 > 15 0.90999 1.00000 931G 287G 643G 30.88 0.58 > 16 0.90999 1.00000 931G 297G 633G 32.01 0.60 > 17 0.90999 1.00000 931G 288G 643G 30.93 0.58 > 18 0.90999 1.00000 931G 284G 646G 30.60 0.58 > TOTAL 8403G 4468G 3935G 53.17 > MIN/MAX VAR: 0.58/1.63 STDDEV: 26.41 > > I know that all OSDs should have the same size, but that wasn't my > decision ;-) And it will be changed in the near future. > > Some information on RAM: > > ---cut here--- > ceph@ndesan01:~/ceph-deploy> free -g > total used free shared buffers cached > Mem: 62 61 0 0 3 50 > -/+ buffers/cache: 8 53 > Swap: 0 0 0 > > ndesan02:~ # free -g > total used free shared buffers cached > Mem: 62 61 0 3 2 50 > -/+ buffers/cache: 9 53 > Swap: 0 0 0 > > nde32:~ # free -g > total used free shared buffers cached > Mem: 39 38 0 0 0 29 > -/+ buffers/cache: 8 30 > Swap: 1 1 0 > > ndesan03:~ # free -g > total used free shared buffers cached > Mem: 62 62 0 0 0 48 > -/+ buffers/cache: 13 49 > Swap: 0 0 0 > ---cut here--- > > > Since when does Xen support Ceph? > > I'm not sure since when exactly. We started testing Ceph as our > OpenStack storage backend last year in April or May, according to my > email history it's supported since libvirt 1.3.2 (we had a bug filed > for this issue). > > ceph@ndesan01:~/ceph-deploy> cat ceph.conf > [global] > fsid = 655cb05a-435a-41ba-83d9-8549f7c36167 > mon_initial_members = ndesan01, ndesan02, nde32 > mon_host = 192.168.160.15,192.168.160.16,192.168.160.17 > public_network = 192.168.160.0/24 > osd_pool_default_size = 3 > osd pool default min size = 2 > osd_crush_chooseleaf_type = 1 > mon_pg_warn_max_per_osd = 0 > auth_cluster_required = cephx > auth_service_required = cephx > auth_client_required = cephx > filestore_xattr_use_omap = true > filestore max inline xattr size = 512 > filestore max inline xattrs = 2 > rbd_default_features = 15 > mon data avail warn = 15 > mon data avail crit = 5 > osd_disk_thread_ioprio_class = "idle" > osd_disk_thread_ioprio_priority = 7 > osd scrub begin hour = 0 > osd scrub end hour = 7 > osd_scrub_sleep = 0.1 > osd_deep_scrub_interval = 2419200 > > [mon] > mon osd allow primary affinity = true > > We changed the deep-scrub interval because we use a cronjob to execute > deep scrubs in the evening, to avoid performance impact during working > hours. > > Please let me know if I missed anything. I really appreciate you > looking into this. > > Regards, > Eugen > > > Zitat von Christian Balzer <chibi@xxxxxxx>: > > > Hello, > > > > On Wed, 01 Feb 2017 11:43:02 +0100 Eugen Block wrote: > > > >> Hi, > >> > >> I haven't tracked the slow requests yet, but I ran some performance > >> tests. Although I'm not an expert, I believe the results are quite > >> unsatisfying. > >> > > You've told us absolutely nothing about your cluster that would be > > relevant to determine if your results are actually to be expected or not. > > > > Anything, and I mean anything is relevant, from OS, kernel, Ceph versions > > to a detailed description of ALL HW in there, CPU, RAM, SSD, HDD, network, > > topology, etc. > > > >> I ran a couple of rados bench tests in a pool, with different > >> replication sizes (1 to 3). The tests were executed on one of the ceph > >> nodes, which also is a monitor. > >> > > Benchmarks, including rados bench run from a cluster node that also hosts > > OSDs will give you skewed results. > > > >> Results for size 1: > >> > >> Total time run: 20.428947 > >> Total writes made: 639 > >> Write size: 4194304 > >> Bandwidth (MB/sec): 125.117 > >> > > This and the following results would be on par with a smallish, 1Gb/s > > network HDD based cluster. > > But we don't know if that's what you're running or not... > > > >> Results for size 2: > >> > >> Total time run: 20.741990 > >> Total writes made: 454 > >> Write size: 4194304 > >> Bandwidth (MB/sec): 87.552 > >> > >> Still looks nice, at least we would be totally fine with this throughput. > >> > > As mentioned countless times here, throughput is most likely your least > > worry with Ceph, IOPS are more likely to be your bottleneck. > > > >> Results for size 3: > >> > >> Total time run: 22.934567 > >> Total writes made: 252 > >> Write size: 4194304 > >> Bandwidth (MB/sec): 43.951 > >> > >> I guess this is still ok for a replication size of 3. But the real > >> problem seems to be the client throughput. > >> > >> Running fio on a Xen-VM inside rbd: > >> > > Since when does Xen support Ceph? > > As in, how are you running a Xen VM that uses Ceph? > > > >> Run status group 0 (all jobs): > >> WRITE: io=1024.0MB, aggrb=21776KB/s, minb=21776KB/s, > >> maxb=21776KB/s, mint=48151msec, maxt=48151msec > >> > >> And this is one of the better results, we often have values of 12MB/s > >> or even lower, e.g. on a KVM-VM: > >> > >> Run status group 0 (all jobs): > >> WRITE: io=512000KB, aggrb=6425KB/s, minb=6425KB/s, maxb=6425KB/s, > >> mint=79683msec, maxt=79683msec > >> > > > > Fio results w/o the full command line parameters are basically worthless, > > I'm afraid. > > > >> This unacceptable! > >> > >> I use almost all defaults in our ceph.conf, > > > > What did you change that could be relevant then? > > > >> I've been reading > >> everything I could find on Google, but I don't understand why the > >> clients are so slow. Is it a valid assumption, that the slow requests > >> could be caused by our slow clients? > >> > > > > Slow requests are caused typically when your cluster is not capable of > > keeping up. > > > >> One of the things I tried was > >> > >> vm1:~ # echo 4096 > /sys/block/xvda/queue/read_ahead_kb > >> > > Only affects reads. > > > >> but this had no effect. I tried it with different numbers of PGs, but > >> that also didn't have any effect. > >> > > Again, w/o knowing your cluster and these numbers, that information is of > > little value. > > > >> What else can I do to increase the performance of my instances or to > >> analyse the bottleneck? > >> > > Understand your HW/cluster. > > > > Monitor the nodes with atop etc, use collectd and graphite, etc, to keep > > records of what happened when and to visualize things. > > > > Christian > > > >> Regards, > >> Eugen > >> > >> > >> Zitat von Brad Hubbard <bhubbard@xxxxxxxxxx>: > >> > >> > Check the latency figures in a "perf dump". High numbers in a > >> > particular area may help you nail it. > >> > > >> > I suspect though, that it may come down to enabling debug logging and > >> > tracking a slow request through the logs. > >> > > >> > On Thu, Jan 12, 2017 at 8:41 PM, Eugen Block <eblock@xxxxxx> wrote: > >> >> Hi, > >> >> > >> >>> Looking at the output of dump_historic_ops and dump_ops_in_flight > >> >> > >> >> > >> >> I waited for new slow request messages and dumped the historic_ops into a > >> >> file. The reporting OSD shows lots of "waiting for rw locks" > >> messages and a > >> >> duration of more than 30 secs: > >> >> > >> >> "age": 366.044746, > >> >> "duration": 32.491506, > >> >> "type_data": [ > >> >> "commit sent; apply or cleanup", > >> >> { > >> >> "client": "client.9664429", > >> >> "tid": 130439910 > >> >> }, > >> >> [ > >> >> { > >> >> "time": "2017-01-12 10:38:15.227649", > >> >> "event": "initiated" > >> >> }, > >> >> { > >> >> "time": "2017-01-12 10:38:15.232310", > >> >> "event": "reached_pg" > >> >> }, > >> >> { > >> >> "time": "2017-01-12 10:38:15.232341", > >> >> "event": "waiting for rw locks" > >> >> }, > >> >> { > >> >> "time": "2017-01-12 10:38:15.268819", > >> >> "event": "reached_pg" > >> >> }, > >> >> [ > >> >> . > >> >> . > >> >> . > >> >> ] > >> >> { > >> >> "time": "2017-01-12 10:38:45.515055", > >> >> "event": "waiting for rw locks" > >> >> }, > >> >> { > >> >> "time": "2017-01-12 10:38:46.921095", > >> >> "event": "reached_pg" > >> >> }, > >> >> { > >> >> "time": "2017-01-12 10:38:46.921157", > >> >> "event": "started" > >> >> }, > >> >> { > >> >> "time": "2017-01-12 10:38:46.921342", > >> >> "event": "waiting for subops from 9,15" > >> >> }, > >> >> { > >> >> "time": "2017-01-12 10:38:46.921724", > >> >> "event": "commit_queued_for_journal_write" > >> >> }, > >> >> { > >> >> "time": "2017-01-12 10:38:46.922186", > >> >> "event": "write_thread_in_journal_buffer" > >> >> }, > >> >> { > >> >> "time": "2017-01-12 10:38:46.931103", > >> >> "event": "sub_op_commit_rec" > >> >> }, > >> >> { > >> >> "time": "2017-01-12 10:38:46.968730", > >> >> "event": "sub_op_commit_rec" > >> >> }, > >> >> { > >> >> "time": "2017-01-12 10:38:47.717770", > >> >> "event": "journaled_completion_queued" > >> >> }, > >> >> { > >> >> "time": "2017-01-12 10:38:47.718280", > >> >> "event": "op_commit" > >> >> }, > >> >> { > >> >> "time": "2017-01-12 10:38:47.718359", > >> >> "event": "commit_sent" > >> >> }, > >> >> { > >> >> "time": "2017-01-12 10:38:47.718890", > >> >> "event": "op_applied" > >> >> }, > >> >> { > >> >> "time": "2017-01-12 10:38:47.719154", > >> >> "event": "done" > >> >> } > >> >> > >> >> > >> >> There were about 70 events "waiting for rw locks", I truncated > >> the output. > >> >> Based on the message "waiting for subops from 9,15" I also dumped the > >> >> historic_ops for these two OSDs. > >> >> > >> >> Duration on OSD.9 > >> >> > >> >> "initiated_at": "2017-01-12 10:38:29.258221", > >> >> "age": 54.069919, > >> >> "duration": 20.831568, > >> >> > >> >> Duration on OSD.15 > >> >> > >> >> "initiated_at": "2017-01-12 10:38:23.695098", > >> >> "age": 112.118210, > >> >> "duration": 26.452526, > >> >> > >> >> They also contain many "waiting for rw locks" messages, but not > >> as much as > >> >> the dump from the reporting OSD. > >> >> To me it seems that because two OSDs take a lot of time to process their > >> >> requests (only slightly less than 30 secs), it sums up to more > >> than 30 secs > >> >> on the reporting (primary?) OSD. Is the reporting OSD always the primary? > >> >> > >> >> How can I debug this further? I searched the web for "waiting > >> for rw locks", > >> >> I also found Wido's blog [1] about my exact problem, but I'm not > >> sure how to > >> >> continue. Our admin says our network should be fine, but what can I do to > >> >> rule that out? > >> >> > >> >> I don't think I have provided information about our cluster yet: > >> >> > >> >> 4 nodes, 3 mons, 20 OSDs on > >> >> ceph version 0.94.7-84-g8e6f430 > >> (8e6f430683e4d8293e31fd4eb6cb09be96960cfa) > >> >> > >> >> [1] > >> >> > >> https://blog.widodh.nl/2016/01/slow-requests-with-ceph-waiting-for-rw-locks/ > >> >> > >> >> Thanks! > >> >> Eugen > >> >> > >> >> > >> >> Zitat von Brad Hubbard <bhubbard@xxxxxxxxxx>: > >> >> > >> >> > >> >>> On Thu, Jan 12, 2017 at 2:19 AM, Eugen Block <eblock@xxxxxx> wrote: > >> >>>> > >> >>>> Hi, > >> >>>> > >> >>>> I simply grepped for "slow request" in ceph.log. What exactly > >> do you mean > >> >>>> by > >> >>>> "effective OSD"? > >> >>>> > >> >>>> If I have this log line: > >> >>>> 2017-01-11 [...] osd.16 [...] cluster [WRN] slow request 32.868141 > >> >>>> seconds > >> >>>> old, received at 2017-01-11 [...] ack+ondisk+write+known_if_redirected > >> >>>> e12440) currently waiting for subops from 0,12 > >> >>>> > >> >>>> I assumed that osd.16 is the one causing problems. > >> >>> > >> >>> > >> >>> 0 and/or 12 in this case. 16 is just waiting. Looking at the output of > >> >>> dump_historic_ops and dump_ops_in_flight on the implicated osds may > >> >>> help to narrow down what the issue is. > >> >>> > >> >>>> But now that you mention > >> >>>> the subops, I only noticed them yesterday, but didn't have the time yet > >> >>>> to > >> >>>> investigate further. I'll have a look into the subops messages > >> and report > >> >>>> back. > >> >>>> > >> >>>> Thanks! > >> >>>> Eugen > >> >>>> > >> >>>> Zitat von Burkhard Linke > >> >>>> <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>: > >> >>>> > >> >>>> > >> >>>>> Hi, > >> >>>>> > >> >>>>> > >> >>>>> just for clarity: > >> >>>>> > >> >>>>> > >> >>>>> Did you parse the slow request messages and use the effective > >> OSD in the > >> >>>>> statistics? Some message may refer to other OSDs, e.g. > >> "waiting for sub > >> >>>>> op > >> >>>>> on OSD X,Y". The reporting OSD is not the root cause in that case, but > >> >>>>> one > >> >>>>> of the mentioned OSDs (and I'm currently not aware of a method to > >> >>>>> determine > >> >>>>> which of the both OSD is the cause in case of 3 replicates.....). > >> >>>>> > >> >>>>> > >> >>>>> Regards, > >> >>>>> > >> >>>>> Burkhard > >> >>>>> > >> >>>>> > >> >>>>> _______________________________________________ > >> >>>>> ceph-users mailing list > >> >>>>> ceph-users@xxxxxxxxxxxxxx > >> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> > >> >>>> -- > >> >>>> Eugen Block voice : +49-40-559 51 75 > >> >>>> NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77 > >> >>>> Postfach 61 03 15 > >> >>>> D-22423 Hamburg e-mail : eblock@xxxxxx > >> >>>> > >> >>>> Vorsitzende des Aufsichtsrates: Angelika Mozdzen > >> >>>> Sitz und Registergericht: Hamburg, HRB 90934 > >> >>>> Vorstand: Jens-U. Mozdzen > >> >>>> USt-IdNr. DE 814 013 983 > >> >>>> > >> >>>> _______________________________________________ > >> >>>> ceph-users mailing list > >> >>>> ceph-users@xxxxxxxxxxxxxx > >> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> -- > >> >>> Cheers, > >> >>> Brad > >> >> > >> >> > >> >> > >> >> > >> >> -- > >> >> Eugen Block voice : +49-40-559 51 75 > >> >> NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77 > >> >> Postfach 61 03 15 > >> >> D-22423 Hamburg e-mail : eblock@xxxxxx > >> >> > >> >> Vorsitzende des Aufsichtsrates: Angelika Mozdzen > >> >> Sitz und Registergericht: Hamburg, HRB 90934 > >> >> Vorstand: Jens-U. Mozdzen > >> >> USt-IdNr. DE 814 013 983 > >> >> > >> > > >> > > >> > > >> > -- > >> > Cheers, > >> > Brad > >> > >> > >> > > > > > > -- > > Christian Balzer Network/Systems Engineer > > chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > > http://www.gol.com/ > > > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com