Hello, On Wed, 01 Feb 2017 11:43:02 +0100 Eugen Block wrote: > Hi, > > I haven't tracked the slow requests yet, but I ran some performance > tests. Although I'm not an expert, I believe the results are quite > unsatisfying. > You've told us absolutely nothing about your cluster that would be relevant to determine if your results are actually to be expected or not. Anything, and I mean anything is relevant, from OS, kernel, Ceph versions to a detailed description of ALL HW in there, CPU, RAM, SSD, HDD, network, topology, etc. > I ran a couple of rados bench tests in a pool, with different > replication sizes (1 to 3). The tests were executed on one of the ceph > nodes, which also is a monitor. > Benchmarks, including rados bench run from a cluster node that also hosts OSDs will give you skewed results. > Results for size 1: > > Total time run: 20.428947 > Total writes made: 639 > Write size: 4194304 > Bandwidth (MB/sec): 125.117 > This and the following results would be on par with a smallish, 1Gb/s network HDD based cluster. But we don't know if that's what you're running or not... > Results for size 2: > > Total time run: 20.741990 > Total writes made: 454 > Write size: 4194304 > Bandwidth (MB/sec): 87.552 > > Still looks nice, at least we would be totally fine with this throughput. > As mentioned countless times here, throughput is most likely your least worry with Ceph, IOPS are more likely to be your bottleneck. > Results for size 3: > > Total time run: 22.934567 > Total writes made: 252 > Write size: 4194304 > Bandwidth (MB/sec): 43.951 > > I guess this is still ok for a replication size of 3. But the real > problem seems to be the client throughput. > > Running fio on a Xen-VM inside rbd: > Since when does Xen support Ceph? As in, how are you running a Xen VM that uses Ceph? > Run status group 0 (all jobs): > WRITE: io=1024.0MB, aggrb=21776KB/s, minb=21776KB/s, > maxb=21776KB/s, mint=48151msec, maxt=48151msec > > And this is one of the better results, we often have values of 12MB/s > or even lower, e.g. on a KVM-VM: > > Run status group 0 (all jobs): > WRITE: io=512000KB, aggrb=6425KB/s, minb=6425KB/s, maxb=6425KB/s, > mint=79683msec, maxt=79683msec > Fio results w/o the full command line parameters are basically worthless, I'm afraid. > This unacceptable! > > I use almost all defaults in our ceph.conf, What did you change that could be relevant then? > I've been reading > everything I could find on Google, but I don't understand why the > clients are so slow. Is it a valid assumption, that the slow requests > could be caused by our slow clients? > Slow requests are caused typically when your cluster is not capable of keeping up. > One of the things I tried was > > vm1:~ # echo 4096 > /sys/block/xvda/queue/read_ahead_kb > Only affects reads. > but this had no effect. I tried it with different numbers of PGs, but > that also didn't have any effect. > Again, w/o knowing your cluster and these numbers, that information is of little value. > What else can I do to increase the performance of my instances or to > analyse the bottleneck? > Understand your HW/cluster. Monitor the nodes with atop etc, use collectd and graphite, etc, to keep records of what happened when and to visualize things. Christian > Regards, > Eugen > > > Zitat von Brad Hubbard <bhubbard@xxxxxxxxxx>: > > > Check the latency figures in a "perf dump". High numbers in a > > particular area may help you nail it. > > > > I suspect though, that it may come down to enabling debug logging and > > tracking a slow request through the logs. > > > > On Thu, Jan 12, 2017 at 8:41 PM, Eugen Block <eblock@xxxxxx> wrote: > >> Hi, > >> > >>> Looking at the output of dump_historic_ops and dump_ops_in_flight > >> > >> > >> I waited for new slow request messages and dumped the historic_ops into a > >> file. The reporting OSD shows lots of "waiting for rw locks" messages and a > >> duration of more than 30 secs: > >> > >> "age": 366.044746, > >> "duration": 32.491506, > >> "type_data": [ > >> "commit sent; apply or cleanup", > >> { > >> "client": "client.9664429", > >> "tid": 130439910 > >> }, > >> [ > >> { > >> "time": "2017-01-12 10:38:15.227649", > >> "event": "initiated" > >> }, > >> { > >> "time": "2017-01-12 10:38:15.232310", > >> "event": "reached_pg" > >> }, > >> { > >> "time": "2017-01-12 10:38:15.232341", > >> "event": "waiting for rw locks" > >> }, > >> { > >> "time": "2017-01-12 10:38:15.268819", > >> "event": "reached_pg" > >> }, > >> [ > >> . > >> . > >> . > >> ] > >> { > >> "time": "2017-01-12 10:38:45.515055", > >> "event": "waiting for rw locks" > >> }, > >> { > >> "time": "2017-01-12 10:38:46.921095", > >> "event": "reached_pg" > >> }, > >> { > >> "time": "2017-01-12 10:38:46.921157", > >> "event": "started" > >> }, > >> { > >> "time": "2017-01-12 10:38:46.921342", > >> "event": "waiting for subops from 9,15" > >> }, > >> { > >> "time": "2017-01-12 10:38:46.921724", > >> "event": "commit_queued_for_journal_write" > >> }, > >> { > >> "time": "2017-01-12 10:38:46.922186", > >> "event": "write_thread_in_journal_buffer" > >> }, > >> { > >> "time": "2017-01-12 10:38:46.931103", > >> "event": "sub_op_commit_rec" > >> }, > >> { > >> "time": "2017-01-12 10:38:46.968730", > >> "event": "sub_op_commit_rec" > >> }, > >> { > >> "time": "2017-01-12 10:38:47.717770", > >> "event": "journaled_completion_queued" > >> }, > >> { > >> "time": "2017-01-12 10:38:47.718280", > >> "event": "op_commit" > >> }, > >> { > >> "time": "2017-01-12 10:38:47.718359", > >> "event": "commit_sent" > >> }, > >> { > >> "time": "2017-01-12 10:38:47.718890", > >> "event": "op_applied" > >> }, > >> { > >> "time": "2017-01-12 10:38:47.719154", > >> "event": "done" > >> } > >> > >> > >> There were about 70 events "waiting for rw locks", I truncated the output. > >> Based on the message "waiting for subops from 9,15" I also dumped the > >> historic_ops for these two OSDs. > >> > >> Duration on OSD.9 > >> > >> "initiated_at": "2017-01-12 10:38:29.258221", > >> "age": 54.069919, > >> "duration": 20.831568, > >> > >> Duration on OSD.15 > >> > >> "initiated_at": "2017-01-12 10:38:23.695098", > >> "age": 112.118210, > >> "duration": 26.452526, > >> > >> They also contain many "waiting for rw locks" messages, but not as much as > >> the dump from the reporting OSD. > >> To me it seems that because two OSDs take a lot of time to process their > >> requests (only slightly less than 30 secs), it sums up to more than 30 secs > >> on the reporting (primary?) OSD. Is the reporting OSD always the primary? > >> > >> How can I debug this further? I searched the web for "waiting for rw locks", > >> I also found Wido's blog [1] about my exact problem, but I'm not sure how to > >> continue. Our admin says our network should be fine, but what can I do to > >> rule that out? > >> > >> I don't think I have provided information about our cluster yet: > >> > >> 4 nodes, 3 mons, 20 OSDs on > >> ceph version 0.94.7-84-g8e6f430 (8e6f430683e4d8293e31fd4eb6cb09be96960cfa) > >> > >> [1] > >> https://blog.widodh.nl/2016/01/slow-requests-with-ceph-waiting-for-rw-locks/ > >> > >> Thanks! > >> Eugen > >> > >> > >> Zitat von Brad Hubbard <bhubbard@xxxxxxxxxx>: > >> > >> > >>> On Thu, Jan 12, 2017 at 2:19 AM, Eugen Block <eblock@xxxxxx> wrote: > >>>> > >>>> Hi, > >>>> > >>>> I simply grepped for "slow request" in ceph.log. What exactly do you mean > >>>> by > >>>> "effective OSD"? > >>>> > >>>> If I have this log line: > >>>> 2017-01-11 [...] osd.16 [...] cluster [WRN] slow request 32.868141 > >>>> seconds > >>>> old, received at 2017-01-11 [...] ack+ondisk+write+known_if_redirected > >>>> e12440) currently waiting for subops from 0,12 > >>>> > >>>> I assumed that osd.16 is the one causing problems. > >>> > >>> > >>> 0 and/or 12 in this case. 16 is just waiting. Looking at the output of > >>> dump_historic_ops and dump_ops_in_flight on the implicated osds may > >>> help to narrow down what the issue is. > >>> > >>>> But now that you mention > >>>> the subops, I only noticed them yesterday, but didn't have the time yet > >>>> to > >>>> investigate further. I'll have a look into the subops messages and report > >>>> back. > >>>> > >>>> Thanks! > >>>> Eugen > >>>> > >>>> Zitat von Burkhard Linke > >>>> <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>: > >>>> > >>>> > >>>>> Hi, > >>>>> > >>>>> > >>>>> just for clarity: > >>>>> > >>>>> > >>>>> Did you parse the slow request messages and use the effective OSD in the > >>>>> statistics? Some message may refer to other OSDs, e.g. "waiting for sub > >>>>> op > >>>>> on OSD X,Y". The reporting OSD is not the root cause in that case, but > >>>>> one > >>>>> of the mentioned OSDs (and I'm currently not aware of a method to > >>>>> determine > >>>>> which of the both OSD is the cause in case of 3 replicates.....). > >>>>> > >>>>> > >>>>> Regards, > >>>>> > >>>>> Burkhard > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> ceph-users mailing list > >>>>> ceph-users@xxxxxxxxxxxxxx > >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> Eugen Block voice : +49-40-559 51 75 > >>>> NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77 > >>>> Postfach 61 03 15 > >>>> D-22423 Hamburg e-mail : eblock@xxxxxx > >>>> > >>>> Vorsitzende des Aufsichtsrates: Angelika Mozdzen > >>>> Sitz und Registergericht: Hamburg, HRB 90934 > >>>> Vorstand: Jens-U. Mozdzen > >>>> USt-IdNr. DE 814 013 983 > >>>> > >>>> _______________________________________________ > >>>> ceph-users mailing list > >>>> ceph-users@xxxxxxxxxxxxxx > >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >>> > >>> > >>> > >>> > >>> -- > >>> Cheers, > >>> Brad > >> > >> > >> > >> > >> -- > >> Eugen Block voice : +49-40-559 51 75 > >> NDE Netzdesign und -entwicklung AG fax : +49-40-559 51 77 > >> Postfach 61 03 15 > >> D-22423 Hamburg e-mail : eblock@xxxxxx > >> > >> Vorsitzende des Aufsichtsrates: Angelika Mozdzen > >> Sitz und Registergericht: Hamburg, HRB 90934 > >> Vorstand: Jens-U. Mozdzen > >> USt-IdNr. DE 814 013 983 > >> > > > > > > > > -- > > Cheers, > > Brad > > > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com