Re: slow requests break performance

Christian Balzer <chibi@xxxxxxx> · Wed, 1 Feb 2017 22:04:25 +0900

Hello,

On Wed, 01 Feb 2017 11:43:02 +0100 Eugen Block wrote:

> Hi,
> 
> I haven't tracked the slow requests yet, but I ran some performance  
> tests. Although I'm not an expert, I believe the results are quite  
> unsatisfying.
>
You've told us absolutely nothing about your cluster that would be
relevant to determine if your results are actually to be expected or not.

Anything, and I mean anything is relevant, from OS, kernel, Ceph versions
to a detailed description of ALL HW in there, CPU, RAM, SSD, HDD, network,
topology, etc.

> I ran a couple of rados bench tests in a pool, with different  
> replication sizes (1 to 3). The tests were executed on one of the ceph  
> nodes, which also is a monitor.
> 
Benchmarks, including rados bench run from a cluster node that also hosts
OSDs will give you skewed results.

> Results for size 1:
> 
> Total time run:         20.428947
> Total writes made:      639
> Write size:             4194304
> Bandwidth (MB/sec):     125.117
> 
This and the following results would be on par with a smallish, 1Gb/s
network HDD based cluster.
But we don't know if that's what you're running or not...

> Results for size 2:
> 
> Total time run:         20.741990
> Total writes made:      454
> Write size:             4194304
> Bandwidth (MB/sec):     87.552
> 
> Still looks nice, at least we would be totally fine with this throughput.
> 
As mentioned countless times here, throughput is most likely your least
worry with Ceph, IOPS are more likely to be your bottleneck. 

> Results for size 3:
> 
> Total time run:         22.934567
> Total writes made:      252
> Write size:             4194304
> Bandwidth (MB/sec):     43.951
> 
> I guess this is still ok for a replication size of 3. But the real  
> problem seems to be the client throughput.
> 
> Running fio on a Xen-VM inside rbd:
> 
Since when does Xen support Ceph?
As in, how are you running a Xen VM that uses Ceph?

> Run status group 0 (all jobs):
>    WRITE: io=1024.0MB, aggrb=21776KB/s, minb=21776KB/s,  
> maxb=21776KB/s, mint=48151msec, maxt=48151msec
> 
> And this is one of the better results, we often have values of 12MB/s  
> or even lower, e.g. on a KVM-VM:
> 
> Run status group 0 (all jobs):
>    WRITE: io=512000KB, aggrb=6425KB/s, minb=6425KB/s, maxb=6425KB/s,  
> mint=79683msec, maxt=79683msec
> 

Fio results w/o the full command line parameters are basically worthless,
I'm afraid.

> This unacceptable!
> 
> I use almost all defaults in our ceph.conf, 

What did you change that could be relevant then?

> I've been reading  
> everything I could find on Google, but I don't understand why the  
> clients are so slow. Is it a valid assumption, that the slow requests  
> could be caused by our slow clients?
> 

Slow requests are caused typically when your cluster is not capable of
keeping up.

> One of the things I tried was
> 
> vm1:~ # echo 4096 > /sys/block/xvda/queue/read_ahead_kb
> 
Only affects reads.

> but this had no effect. I tried it with different numbers of PGs, but  
> that also didn't have any effect.
>
Again, w/o knowing your cluster and these numbers, that information is of
little value.

> What else can I do to increase the performance of my instances or to  
> analyse the bottleneck?
> 
Understand your HW/cluster.

Monitor the nodes with atop etc, use collectd and graphite, etc, to keep
records of what happened when and to visualize things.

Christian

> Regards,
> Eugen
> 
> 
> Zitat von Brad Hubbard <bhubbard@xxxxxxxxxx>:
> 
> > Check the latency figures in a "perf dump". High numbers in a
> > particular area may help you nail it.
> >
> > I suspect though, that it may come down to enabling debug logging and
> > tracking a slow request through the logs.
> >
> > On Thu, Jan 12, 2017 at 8:41 PM, Eugen Block <eblock@xxxxxx> wrote:  
> >> Hi,
> >>  
> >>> Looking at the output of dump_historic_ops and dump_ops_in_flight  
> >>
> >>
> >> I waited for new slow request messages and dumped the historic_ops into a
> >> file. The reporting OSD shows lots of "waiting for rw locks" messages and a
> >> duration of more than 30 secs:
> >>
> >>              "age": 366.044746,
> >>             "duration": 32.491506,
> >>             "type_data": [
> >>                 "commit sent; apply or cleanup",
> >>                 {
> >>                     "client": "client.9664429",
> >>                     "tid": 130439910
> >>                 },
> >>                 [
> >>                     {
> >>                         "time": "2017-01-12 10:38:15.227649",
> >>                         "event": "initiated"
> >>                     },
> >>                     {
> >>                         "time": "2017-01-12 10:38:15.232310",
> >>                         "event": "reached_pg"
> >>                     },
> >>                     {
> >>                         "time": "2017-01-12 10:38:15.232341",
> >>                         "event": "waiting for rw locks"
> >>                     },
> >>                     {
> >>                         "time": "2017-01-12 10:38:15.268819",
> >>                         "event": "reached_pg"
> >>                     },
> >> [
> >> .
> >> .
> >> .
> >> ]
> >>                     {
> >>                         "time": "2017-01-12 10:38:45.515055",
> >>                         "event": "waiting for rw locks"
> >>                     },
> >>                     {
> >>                         "time": "2017-01-12 10:38:46.921095",
> >>                         "event": "reached_pg"
> >>                     },
> >>                     {
> >>                         "time": "2017-01-12 10:38:46.921157",
> >>                         "event": "started"
> >>                     },
> >>                     {
> >>                         "time": "2017-01-12 10:38:46.921342",
> >>                         "event": "waiting for subops from 9,15"
> >>                     },
> >>                     {
> >>                         "time": "2017-01-12 10:38:46.921724",
> >>                         "event": "commit_queued_for_journal_write"
> >>                     },
> >>                     {
> >>                         "time": "2017-01-12 10:38:46.922186",
> >>                         "event": "write_thread_in_journal_buffer"
> >>                     },
> >>                     {
> >>                         "time": "2017-01-12 10:38:46.931103",
> >>                         "event": "sub_op_commit_rec"
> >>                     },
> >>                     {
> >>                         "time": "2017-01-12 10:38:46.968730",
> >>                         "event": "sub_op_commit_rec"
> >>                     },
> >>                     {
> >>                         "time": "2017-01-12 10:38:47.717770",
> >>                         "event": "journaled_completion_queued"
> >>                     },
> >>                     {
> >>                         "time": "2017-01-12 10:38:47.718280",
> >>                         "event": "op_commit"
> >>                     },
> >>                     {
> >>                         "time": "2017-01-12 10:38:47.718359",
> >>                         "event": "commit_sent"
> >>                     },
> >>                     {
> >>                         "time": "2017-01-12 10:38:47.718890",
> >>                         "event": "op_applied"
> >>                     },
> >>                     {
> >>                         "time": "2017-01-12 10:38:47.719154",
> >>                         "event": "done"
> >>                     }
> >>
> >>
> >> There were about 70 events "waiting for rw locks", I truncated the output.
> >> Based on the message "waiting for subops from 9,15" I also dumped the
> >> historic_ops for these two OSDs.
> >>
> >> Duration on OSD.9
> >>
> >>             "initiated_at": "2017-01-12 10:38:29.258221",
> >>             "age": 54.069919,
> >>             "duration": 20.831568,
> >>
> >> Duration on OSD.15
> >>
> >>             "initiated_at": "2017-01-12 10:38:23.695098",
> >>             "age": 112.118210,
> >>             "duration": 26.452526,
> >>
> >> They also contain many "waiting for rw locks" messages, but not as much as
> >> the dump from the reporting OSD.
> >> To me it seems that because two OSDs take a lot of time to process their
> >> requests (only slightly less than 30 secs), it sums up to more than 30 secs
> >> on the reporting (primary?) OSD. Is the reporting OSD always the primary?
> >>
> >> How can I debug this further? I searched the web for "waiting for rw locks",
> >> I also found Wido's blog [1] about my exact problem, but I'm not sure how to
> >> continue. Our admin says our network should be fine, but what can I do to
> >> rule that out?
> >>
> >> I don't think I have provided information about our cluster yet:
> >>
> >> 4 nodes, 3 mons, 20 OSDs on
> >> ceph version 0.94.7-84-g8e6f430 (8e6f430683e4d8293e31fd4eb6cb09be96960cfa)
> >>
> >> [1]
> >> https://blog.widodh.nl/2016/01/slow-requests-with-ceph-waiting-for-rw-locks/
> >>
> >> Thanks!
> >> Eugen
> >>
> >>
> >> Zitat von Brad Hubbard <bhubbard@xxxxxxxxxx>:
> >>
> >>  
> >>> On Thu, Jan 12, 2017 at 2:19 AM, Eugen Block <eblock@xxxxxx> wrote:  
> >>>>
> >>>> Hi,
> >>>>
> >>>> I simply grepped for "slow request" in ceph.log. What exactly do you mean
> >>>> by
> >>>> "effective OSD"?
> >>>>
> >>>> If I have this log line:
> >>>> 2017-01-11 [...] osd.16 [...] cluster [WRN] slow request 32.868141
> >>>> seconds
> >>>> old, received at 2017-01-11 [...] ack+ondisk+write+known_if_redirected
> >>>> e12440) currently waiting for subops from 0,12
> >>>>
> >>>> I assumed that osd.16 is the one causing problems.  
> >>>
> >>>
> >>> 0 and/or 12 in this case. 16 is just waiting. Looking at the output of
> >>> dump_historic_ops and dump_ops_in_flight on the implicated osds may
> >>> help to narrow down what the issue is.
> >>>  
> >>>> But now that you mention
> >>>> the subops, I only noticed them yesterday, but didn't have the time yet
> >>>> to
> >>>> investigate further. I'll have a look into the subops messages and report
> >>>> back.
> >>>>
> >>>> Thanks!
> >>>> Eugen
> >>>>
> >>>> Zitat von Burkhard Linke
> >>>> <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>:
> >>>>
> >>>>  
> >>>>> Hi,
> >>>>>
> >>>>>
> >>>>> just for clarity:
> >>>>>
> >>>>>
> >>>>> Did you parse the slow request messages and use the effective OSD in the
> >>>>> statistics? Some message may refer to other OSDs, e.g. "waiting for sub
> >>>>> op
> >>>>> on OSD X,Y". The reporting OSD is not the root cause in that case, but
> >>>>> one
> >>>>> of the mentioned OSDs (and I'm currently not aware of a method to
> >>>>> determine
> >>>>> which of the both OSD is the cause in case of 3 replicates.....).
> >>>>>
> >>>>>
> >>>>> Regards,
> >>>>>
> >>>>> Burkhard
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> ceph-users@xxxxxxxxxxxxxx
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Eugen Block                             voice   : +49-40-559 51 75
> >>>> NDE Netzdesign und -entwicklung AG      fax     : +49-40-559 51 77
> >>>> Postfach 61 03 15
> >>>> D-22423 Hamburg                         e-mail  : eblock@xxxxxx
> >>>>
> >>>>         Vorsitzende des Aufsichtsrates: Angelika Mozdzen
> >>>>           Sitz und Registergericht: Hamburg, HRB 90934
> >>>>                   Vorstand: Jens-U. Mozdzen
> >>>>                    USt-IdNr. DE 814 013 983
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list
> >>>> ceph-users@xxxxxxxxxxxxxx
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Cheers,
> >>> Brad  
> >>
> >>
> >>
> >>
> >> --
> >> Eugen Block                             voice   : +49-40-559 51 75
> >> NDE Netzdesign und -entwicklung AG      fax     : +49-40-559 51 77
> >> Postfach 61 03 15
> >> D-22423 Hamburg                         e-mail  : eblock@xxxxxx
> >>
> >>         Vorsitzende des Aufsichtsrates: Angelika Mozdzen
> >>           Sitz und Registergericht: Hamburg, HRB 90934
> >>                   Vorstand: Jens-U. Mozdzen
> >>                    USt-IdNr. DE 814 013 983
> >>  
> >
> >
> >
> > --
> > Cheers,
> > Brad  
> 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com