Re: slow requests break performance

Eugen Block <eblock@xxxxxx> · Wed, 01 Feb 2017 11:43:02 +0100

Hi,

I haven't tracked the slow requests yet, but I ran some performance  
tests. Although I'm not an expert, I believe the results are quite  
unsatisfying.

I ran a couple of rados bench tests in a pool, with different  
replication sizes (1 to 3). The tests were executed on one of the ceph  
nodes, which also is a monitor.

Results for size 1:

Total time run:         20.428947
Total writes made:      639
Write size:             4194304
Bandwidth (MB/sec):     125.117

Results for size 2:

Total time run:         20.741990
Total writes made:      454
Write size:             4194304
Bandwidth (MB/sec):     87.552

Still looks nice, at least we would be totally fine with this throughput.

Results for size 3:

Total time run:         22.934567
Total writes made:      252
Write size:             4194304
Bandwidth (MB/sec):     43.951

I guess this is still ok for a replication size of 3. But the real  
problem seems to be the client throughput.

Running fio on a Xen-VM inside rbd:

Run status group 0 (all jobs):
  WRITE: io=1024.0MB, aggrb=21776KB/s, minb=21776KB/s,  
maxb=21776KB/s, mint=48151msec, maxt=48151msec

And this is one of the better results, we often have values of 12MB/s  
or even lower, e.g. on a KVM-VM:

Run status group 0 (all jobs):
  WRITE: io=512000KB, aggrb=6425KB/s, minb=6425KB/s, maxb=6425KB/s,  
mint=79683msec, maxt=79683msec

This unacceptable!

I use almost all defaults in our ceph.conf, I've been reading  
everything I could find on Google, but I don't understand why the  
clients are so slow. Is it a valid assumption, that the slow requests  
could be caused by our slow clients?

One of the things I tried was

vm1:~ # echo 4096 > /sys/block/xvda/queue/read_ahead_kb

but this had no effect. I tried it with different numbers of PGs, but  
that also didn't have any effect.

What else can I do to increase the performance of my instances or to  
analyse the bottleneck?

Regards,
Eugen

Zitat von Brad Hubbard <bhubbard@xxxxxxxxxx>:

Check the latency figures in a "perf dump". High numbers in a
particular area may help you nail it.

I suspect though, that it may come down to enabling debug logging and
tracking a slow request through the logs.

On Thu, Jan 12, 2017 at 8:41 PM, Eugen Block <eblock@xxxxxx> wrote:
Hi,

Looking at the output of dump_historic_ops and dump_ops_in_flight

I waited for new slow request messages and dumped the historic_ops into a
file. The reporting OSD shows lots of "waiting for rw locks" messages and a
duration of more than 30 secs:

             "age": 366.044746,
            "duration": 32.491506,
            "type_data": [
                "commit sent; apply or cleanup",
                {
                    "client": "client.9664429",
                    "tid": 130439910
                },
                [
                    {
                        "time": "2017-01-12 10:38:15.227649",
                        "event": "initiated"
                    },
                    {
                        "time": "2017-01-12 10:38:15.232310",
                        "event": "reached_pg"
                    },
                    {
                        "time": "2017-01-12 10:38:15.232341",
                        "event": "waiting for rw locks"
                    },
                    {
                        "time": "2017-01-12 10:38:15.268819",
                        "event": "reached_pg"
                    },
[
.
.
.
]
                    {
                        "time": "2017-01-12 10:38:45.515055",
                        "event": "waiting for rw locks"
                    },
                    {
                        "time": "2017-01-12 10:38:46.921095",
                        "event": "reached_pg"
                    },
                    {
                        "time": "2017-01-12 10:38:46.921157",
                        "event": "started"
                    },
                    {
                        "time": "2017-01-12 10:38:46.921342",
                        "event": "waiting for subops from 9,15"
                    },
                    {
                        "time": "2017-01-12 10:38:46.921724",
                        "event": "commit_queued_for_journal_write"
                    },
                    {
                        "time": "2017-01-12 10:38:46.922186",
                        "event": "write_thread_in_journal_buffer"
                    },
                    {
                        "time": "2017-01-12 10:38:46.931103",
                        "event": "sub_op_commit_rec"
                    },
                    {
                        "time": "2017-01-12 10:38:46.968730",
                        "event": "sub_op_commit_rec"
                    },
                    {
                        "time": "2017-01-12 10:38:47.717770",
                        "event": "journaled_completion_queued"
                    },
                    {
                        "time": "2017-01-12 10:38:47.718280",
                        "event": "op_commit"
                    },
                    {
                        "time": "2017-01-12 10:38:47.718359",
                        "event": "commit_sent"
                    },
                    {
                        "time": "2017-01-12 10:38:47.718890",
                        "event": "op_applied"
                    },
                    {
                        "time": "2017-01-12 10:38:47.719154",
                        "event": "done"
                    }

There were about 70 events "waiting for rw locks", I truncated the output.
Based on the message "waiting for subops from 9,15" I also dumped the
historic_ops for these two OSDs.

Duration on OSD.9

            "initiated_at": "2017-01-12 10:38:29.258221",
            "age": 54.069919,
            "duration": 20.831568,

Duration on OSD.15

            "initiated_at": "2017-01-12 10:38:23.695098",
            "age": 112.118210,
            "duration": 26.452526,

They also contain many "waiting for rw locks" messages, but not as much as
the dump from the reporting OSD.
To me it seems that because two OSDs take a lot of time to process their
requests (only slightly less than 30 secs), it sums up to more than 30 secs
on the reporting (primary?) OSD. Is the reporting OSD always the primary?

How can I debug this further? I searched the web for "waiting for rw locks",
I also found Wido's blog [1] about my exact problem, but I'm not sure how to
continue. Our admin says our network should be fine, but what can I do to
rule that out?

I don't think I have provided information about our cluster yet:

4 nodes, 3 mons, 20 OSDs on
ceph version 0.94.7-84-g8e6f430 (8e6f430683e4d8293e31fd4eb6cb09be96960cfa)

[1]
https://blog.widodh.nl/2016/01/slow-requests-with-ceph-waiting-for-rw-locks/

Thanks!
Eugen

Zitat von Brad Hubbard <bhubbard@xxxxxxxxxx>:

On Thu, Jan 12, 2017 at 2:19 AM, Eugen Block <eblock@xxxxxx> wrote:

Hi,

I simply grepped for "slow request" in ceph.log. What exactly do you mean
by
"effective OSD"?

If I have this log line:
2017-01-11 [...] osd.16 [...] cluster [WRN] slow request 32.868141
seconds
old, received at 2017-01-11 [...] ack+ondisk+write+known_if_redirected
e12440) currently waiting for subops from 0,12

I assumed that osd.16 is the one causing problems.

0 and/or 12 in this case. 16 is just waiting. Looking at the output of
dump_historic_ops and dump_ops_in_flight on the implicated osds may
help to narrow down what the issue is.

But now that you mention
the subops, I only noticed them yesterday, but didn't have the time yet
to
investigate further. I'll have a look into the subops messages and report
back.

Thanks!
Eugen

Zitat von Burkhard Linke
<Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>:

Hi,

just for clarity:

Did you parse the slow request messages and use the effective OSD in the
statistics? Some message may refer to other OSDs, e.g. "waiting for sub
op
on OSD X,Y". The reporting OSD is not the root cause in that case, but
one
of the mentioned OSDs (and I'm currently not aware of a method to
determine
which of the both OSD is the cause in case of 3 replicates.....).

Regards,

Burkhard

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Eugen Block                             voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG      fax     : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg                         e-mail  : eblock@xxxxxx

        Vorsitzende des Aufsichtsrates: Angelika Mozdzen
          Sitz und Registergericht: Hamburg, HRB 90934
                  Vorstand: Jens-U. Mozdzen
                   USt-IdNr. DE 814 013 983

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Cheers,
Brad

--
Eugen Block                             voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG      fax     : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg                         e-mail  : eblock@xxxxxx

        Vorsitzende des Aufsichtsrates: Angelika Mozdzen
          Sitz und Registergericht: Hamburg, HRB 90934
                  Vorstand: Jens-U. Mozdzen
                   USt-IdNr. DE 814 013 983

--
Cheers,
Brad

--
Eugen Block                             voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG      fax     : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg                         e-mail  : eblock@xxxxxx

        Vorsitzende des Aufsichtsrates: Angelika Mozdzen
          Sitz und Registergericht: Hamburg, HRB 90934
                  Vorstand: Jens-U. Mozdzen
                   USt-IdNr. DE 814 013 983

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com