Re: slow requests break performance

Eugen Block <eblock@xxxxxx> · Wed, 01 Feb 2017 15:16:15 +0100

You've told us absolutely nothing about your cluster

You're right, I'll try to provide as much information as possible.

Please note that we have kind of a "special" layout... The HDDs on  
ndesan01 are in a RAID6, which is bcached by two SSDs in a RAID1:

---cut here---
ndesan01: # cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]  
md126 : active raid6 sdd1[3] sde1[4] sdf1[5] sda1[0] sdc1[2] sdb1[1]
      3906521088 blocks super 1.2 level 6, 512k chunk, algorithm 2  
[6/6] [UUUUUU]
      bitmap: 6/8 pages [24KB], 65536KB chunk

md127 : active raid1 sdh1[1] sdg1[0]
      195228864 blocks super 1.2 [2/2] [UU]
      bitmap: 2/2 pages [8KB], 65536KB chunk

unused devices: <none>
ndesan01: # pvs
  PV           VG       Fmt  Attr PSize PFree   
  /dev/bcache0 ndesan01 lvm2 a--  3.64t 427.46g
ndesan01: # lvs | grep osd
  cephosd0             ndesan01 -wi-ao---- 250.00g  

  cephosd1             ndesan01 -wi-ao---- 250.00g  

  cephosd2             ndesan01 -wi-ao---- 250.00g  

  cephosd3             ndesan01 -wi-ao---- 250.00g  

  cephosd4             ndesan01 -wi-ao---- 250.00g
---cut here---

ndesan02 has the same layout. nde32 is not bcached

nde32:~ #  lvs | grep osd
  cephosd0             ndesan02     -wi-ao---- 250,00g
  cephosd1             ndesan02     -wi-ao---- 250,00g
  cephosd2             ndesan02     -wi-ao---- 250,00g
  cephosd3             ndesan02     -wi-ao---- 250,00g
  cephosd4             ndesan02     -wi-ao---- 250,00g

ndesan03 is a dedicated ceph node and is also bcached:

---cut here---
ndesan03:~ #  df -hl | grep osd
/dev/bcache0                     932G  286G  646G  31% /var/ceph/osd0
/dev/bcache1                     932G  288G  644G  31% /var/ceph/osd1
/dev/bcache2                     932G  298G  634G  33% /var/ceph/osd2
/dev/bcache3                     932G  289G  644G  31% /var/ceph/osd3
/dev/bcache4                     932G  285G  647G  31% /var/ceph/osd4
---cut here---

This mixed setup is due to the pre-existing infrastructure on these  
servers, we planned to migrate the environment step by step into ceph.

ceph@ndesan01:~/ceph-deploy>  ceph --version
ceph version 0.94.7-84-g8e6f430

All 4 nodes have the same OS:
ceph@ndesan01:~/ceph-deploy> cat /etc/os-release
NAME="openSUSE Leap"
VERSION="42.1"

ndesan01, ndesan02 and nde32 have the same kernel:
ceph@ndesan01:~/ceph-deploy>  uname -r
4.1.13-5-default

ndesan03:~ #  uname -r
4.1.12-1-default

It's a 1 Gb/s network.

ndesan01: 2 x Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (each 6 cores)

ndesan02: 2 x Intel(R) Xeon(R) CPU E5-2609 v3 @ 1.90GHz (each 6 cores)

nde32: 2 x Intel(R) Xeon(R) CPU E5410  @ 2.33GHz (each 4 cores)

ndesan03: 1 x Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz (4 cores)

ceph@ndesan01:~/ceph-deploy> ceph osd tree
ID WEIGHT  TYPE NAME         UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 8.14980 root default
-2 1.19995     host ndesan01
 0 0.23999         osd.0          up  1.00000          1.00000
 1 0.23999         osd.1          up  1.00000          1.00000
 2 0.23999         osd.2          up  1.00000          1.00000
13 0.23999         osd.13         up  1.00000          1.00000
19 0.23999         osd.19         up  1.00000          1.00000
-3 1.19995     host ndesan02
 3 0.23999         osd.3          up  1.00000          1.00000
 4 0.23999         osd.4          up  1.00000          1.00000
 5 0.23999         osd.5          up  1.00000          1.00000
 9 0.23999         osd.9          up  1.00000          1.00000
10 0.23999         osd.10         up  1.00000          1.00000
-4 1.19995     host nde32
 6 0.23999         osd.6          up  1.00000          1.00000
 7 0.23999         osd.7          up  1.00000          1.00000
 8 0.23999         osd.8          up  1.00000          1.00000
12 0.23999         osd.12         up  1.00000          1.00000
11 0.23999         osd.11         up  1.00000          1.00000
-5 4.54994     host ndesan03
14 0.90999         osd.14         up  1.00000          1.00000
15 0.90999         osd.15         up  1.00000          1.00000
16 0.90999         osd.16         up  1.00000          1.00000
17 0.90999         osd.17         up  1.00000          1.00000
18 0.90999         osd.18         up  1.00000          1.00000

ceph@ndesan01:~/ceph-deploy> ceph osd df
ID WEIGHT  REWEIGHT SIZE  USE   AVAIL  %USE  VAR
 0 0.23999  1.00000  249G  210G 39851M 84.43 1.59
 1 0.23999  1.00000  249G  206G 44424M 82.64 1.55
 2 0.23999  1.00000  249G  200G 50084M 80.43 1.51
13 0.23999  1.00000  249G  200G 50520M 80.26 1.51
19 0.23999  1.00000  249G  187G 63837M 75.05 1.41
 3 0.23999  1.00000  249G  205G 45945M 82.04 1.54
 4 0.23999  1.00000  249G  205G 45841M 82.08 1.54
 5 0.23999  1.00000  249G  206G 44026M 82.79 1.56
 9 0.23999  1.00000  249G  198G 52348M 79.54 1.50
10 0.23999  1.00000  249G  189G 62027M 75.76 1.42
 6 0.23999  1.00000  249G  191G 59968M 76.56 1.44
 7 0.23999  1.00000  249G  197G 53410M 79.13 1.49
 8 0.23999  1.00000  249G  207G 43856M 82.86 1.56
12 0.23999  1.00000  249G  199G 51695M 79.80 1.50
11 0.23999  1.00000  249G  216G 33766M 86.80 1.63
14 0.90999  1.00000  931G  285G   645G 30.70 0.58
15 0.90999  1.00000  931G  287G   643G 30.88 0.58
16 0.90999  1.00000  931G  297G   633G 32.01 0.60
17 0.90999  1.00000  931G  288G   643G 30.93 0.58
18 0.90999  1.00000  931G  284G   646G 30.60 0.58
              TOTAL 8403G 4468G  3935G 53.17
MIN/MAX VAR: 0.58/1.63  STDDEV: 26.41

I know that all OSDs should have the same size, but that wasn't my  
decision ;-) And it will be changed in the near future.

Some information on RAM:

---cut here---
ceph@ndesan01:~/ceph-deploy> free -g
             total       used       free     shared    buffers     cached
Mem:            62         61          0          0          3         50
-/+ buffers/cache:          8         53
Swap:            0          0          0

ndesan02:~ # free -g
             total       used       free     shared    buffers     cached
Mem:            62         61          0          3          2         50
-/+ buffers/cache:          9         53
Swap:            0          0          0

nde32:~ # free -g
             total       used       free     shared    buffers     cached
Mem:            39         38          0          0          0         29
-/+ buffers/cache:          8         30
Swap:            1          1          0

ndesan03:~ # free -g
             total       used       free     shared    buffers     cached
Mem:            62         62          0          0          0         48
-/+ buffers/cache:         13         49
Swap:            0          0          0
---cut here---

Since when does Xen support Ceph?

I'm not sure since when exactly. We started testing Ceph as our  
OpenStack storage backend last year in April or May, according to my  
email history it's supported since libvirt 1.3.2 (we had a bug filed  
for this issue).

ceph@ndesan01:~/ceph-deploy> cat ceph.conf
[global]
fsid = 655cb05a-435a-41ba-83d9-8549f7c36167
mon_initial_members = ndesan01, ndesan02, nde32
mon_host = 192.168.160.15,192.168.160.16,192.168.160.17
public_network = 192.168.160.0/24
osd_pool_default_size = 3
osd pool default min size = 2
osd_crush_chooseleaf_type = 1
mon_pg_warn_max_per_osd = 0
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
filestore max inline xattr size = 512
filestore max inline xattrs = 2
rbd_default_features = 15
mon data avail warn = 15
mon data avail crit = 5
osd_disk_thread_ioprio_class = "idle"
osd_disk_thread_ioprio_priority = 7
osd scrub begin hour = 0
osd scrub end hour = 7
osd_scrub_sleep = 0.1
osd_deep_scrub_interval = 2419200

[mon]
mon osd allow primary affinity = true

We changed the deep-scrub interval because we use a cronjob to execute  
deep scrubs in the evening, to avoid performance impact during working  
hours.

Please let me know if I missed anything. I really appreciate you  
looking into this.

Regards,
Eugen

Zitat von Christian Balzer <chibi@xxxxxxx>:

Hello,

On Wed, 01 Feb 2017 11:43:02 +0100 Eugen Block wrote:

Hi,

I haven't tracked the slow requests yet, but I ran some performance
tests. Although I'm not an expert, I believe the results are quite
unsatisfying.

You've told us absolutely nothing about your cluster that would be
relevant to determine if your results are actually to be expected or not.

Anything, and I mean anything is relevant, from OS, kernel, Ceph versions
to a detailed description of ALL HW in there, CPU, RAM, SSD, HDD, network,
topology, etc.

I ran a couple of rados bench tests in a pool, with different
replication sizes (1 to 3). The tests were executed on one of the ceph
nodes, which also is a monitor.

Benchmarks, including rados bench run from a cluster node that also hosts
OSDs will give you skewed results.

Results for size 1:

Total time run:         20.428947
Total writes made:      639
Write size:             4194304
Bandwidth (MB/sec):     125.117

This and the following results would be on par with a smallish, 1Gb/s
network HDD based cluster.
But we don't know if that's what you're running or not...

Results for size 2:

Total time run:         20.741990
Total writes made:      454
Write size:             4194304
Bandwidth (MB/sec):     87.552

Still looks nice, at least we would be totally fine with this throughput.

As mentioned countless times here, throughput is most likely your least
worry with Ceph, IOPS are more likely to be your bottleneck.

Results for size 3:

Total time run:         22.934567
Total writes made:      252
Write size:             4194304
Bandwidth (MB/sec):     43.951

I guess this is still ok for a replication size of 3. But the real
problem seems to be the client throughput.

Running fio on a Xen-VM inside rbd:

Since when does Xen support Ceph?
As in, how are you running a Xen VM that uses Ceph?

Run status group 0 (all jobs):
   WRITE: io=1024.0MB, aggrb=21776KB/s, minb=21776KB/s,
maxb=21776KB/s, mint=48151msec, maxt=48151msec

And this is one of the better results, we often have values of 12MB/s
or even lower, e.g. on a KVM-VM:

Run status group 0 (all jobs):
   WRITE: io=512000KB, aggrb=6425KB/s, minb=6425KB/s, maxb=6425KB/s,
mint=79683msec, maxt=79683msec

Fio results w/o the full command line parameters are basically worthless,
I'm afraid.

This unacceptable!

I use almost all defaults in our ceph.conf,

What did you change that could be relevant then?

I've been reading
everything I could find on Google, but I don't understand why the
clients are so slow. Is it a valid assumption, that the slow requests
could be caused by our slow clients?

Slow requests are caused typically when your cluster is not capable of
keeping up.

One of the things I tried was

vm1:~ # echo 4096 > /sys/block/xvda/queue/read_ahead_kb

Only affects reads.

but this had no effect. I tried it with different numbers of PGs, but
that also didn't have any effect.

Again, w/o knowing your cluster and these numbers, that information is of
little value.

What else can I do to increase the performance of my instances or to
analyse the bottleneck?

Understand your HW/cluster.

Monitor the nodes with atop etc, use collectd and graphite, etc, to keep
records of what happened when and to visualize things.

Christian

Regards,
Eugen

Zitat von Brad Hubbard <bhubbard@xxxxxxxxxx>:

> Check the latency figures in a "perf dump". High numbers in a
> particular area may help you nail it.
>
> I suspect though, that it may come down to enabling debug logging and
> tracking a slow request through the logs.
>
> On Thu, Jan 12, 2017 at 8:41 PM, Eugen Block <eblock@xxxxxx> wrote:
>> Hi,
>>
>>> Looking at the output of dump_historic_ops and dump_ops_in_flight
>>
>>
>> I waited for new slow request messages and dumped the historic_ops into a
>> file. The reporting OSD shows lots of "waiting for rw locks"  
messages and a
>> duration of more than 30 secs:
>>
>>              "age": 366.044746,
>>             "duration": 32.491506,
>>             "type_data": [
>>                 "commit sent; apply or cleanup",
>>                 {
>>                     "client": "client.9664429",
>>                     "tid": 130439910
>>                 },
>>                 [
>>                     {
>>                         "time": "2017-01-12 10:38:15.227649",
>>                         "event": "initiated"
>>                     },
>>                     {
>>                         "time": "2017-01-12 10:38:15.232310",
>>                         "event": "reached_pg"
>>                     },
>>                     {
>>                         "time": "2017-01-12 10:38:15.232341",
>>                         "event": "waiting for rw locks"
>>                     },
>>                     {
>>                         "time": "2017-01-12 10:38:15.268819",
>>                         "event": "reached_pg"
>>                     },
>> [
>> .
>> .
>> .
>> ]
>>                     {
>>                         "time": "2017-01-12 10:38:45.515055",
>>                         "event": "waiting for rw locks"
>>                     },
>>                     {
>>                         "time": "2017-01-12 10:38:46.921095",
>>                         "event": "reached_pg"
>>                     },
>>                     {
>>                         "time": "2017-01-12 10:38:46.921157",
>>                         "event": "started"
>>                     },
>>                     {
>>                         "time": "2017-01-12 10:38:46.921342",
>>                         "event": "waiting for subops from 9,15"
>>                     },
>>                     {
>>                         "time": "2017-01-12 10:38:46.921724",
>>                         "event": "commit_queued_for_journal_write"
>>                     },
>>                     {
>>                         "time": "2017-01-12 10:38:46.922186",
>>                         "event": "write_thread_in_journal_buffer"
>>                     },
>>                     {
>>                         "time": "2017-01-12 10:38:46.931103",
>>                         "event": "sub_op_commit_rec"
>>                     },
>>                     {
>>                         "time": "2017-01-12 10:38:46.968730",
>>                         "event": "sub_op_commit_rec"
>>                     },
>>                     {
>>                         "time": "2017-01-12 10:38:47.717770",
>>                         "event": "journaled_completion_queued"
>>                     },
>>                     {
>>                         "time": "2017-01-12 10:38:47.718280",
>>                         "event": "op_commit"
>>                     },
>>                     {
>>                         "time": "2017-01-12 10:38:47.718359",
>>                         "event": "commit_sent"
>>                     },
>>                     {
>>                         "time": "2017-01-12 10:38:47.718890",
>>                         "event": "op_applied"
>>                     },
>>                     {
>>                         "time": "2017-01-12 10:38:47.719154",
>>                         "event": "done"
>>                     }
>>
>>
>> There were about 70 events "waiting for rw locks", I truncated  
the output.
>> Based on the message "waiting for subops from 9,15" I also dumped the
>> historic_ops for these two OSDs.
>>
>> Duration on OSD.9
>>
>>             "initiated_at": "2017-01-12 10:38:29.258221",
>>             "age": 54.069919,
>>             "duration": 20.831568,
>>
>> Duration on OSD.15
>>
>>             "initiated_at": "2017-01-12 10:38:23.695098",
>>             "age": 112.118210,
>>             "duration": 26.452526,
>>
>> They also contain many "waiting for rw locks" messages, but not  
as much as
>> the dump from the reporting OSD.
>> To me it seems that because two OSDs take a lot of time to process their
>> requests (only slightly less than 30 secs), it sums up to more  
than 30 secs
>> on the reporting (primary?) OSD. Is the reporting OSD always the primary?
>>
>> How can I debug this further? I searched the web for "waiting  
for rw locks",
>> I also found Wido's blog [1] about my exact problem, but I'm not  
sure how to
>> continue. Our admin says our network should be fine, but what can I do to
>> rule that out?
>>
>> I don't think I have provided information about our cluster yet:
>>
>> 4 nodes, 3 mons, 20 OSDs on
>> ceph version 0.94.7-84-g8e6f430  
(8e6f430683e4d8293e31fd4eb6cb09be96960cfa)
>>
>> [1]
>>  
https://blog.widodh.nl/2016/01/slow-requests-with-ceph-waiting-for-rw-locks/
>>
>> Thanks!
>> Eugen
>>
>>
>> Zitat von Brad Hubbard <bhubbard@xxxxxxxxxx>:
>>
>>
>>> On Thu, Jan 12, 2017 at 2:19 AM, Eugen Block <eblock@xxxxxx> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I simply grepped for "slow request" in ceph.log. What exactly  
do you mean
>>>> by
>>>> "effective OSD"?
>>>>
>>>> If I have this log line:
>>>> 2017-01-11 [...] osd.16 [...] cluster [WRN] slow request 32.868141
>>>> seconds
>>>> old, received at 2017-01-11 [...] ack+ondisk+write+known_if_redirected
>>>> e12440) currently waiting for subops from 0,12
>>>>
>>>> I assumed that osd.16 is the one causing problems.
>>>
>>>
>>> 0 and/or 12 in this case. 16 is just waiting. Looking at the output of
>>> dump_historic_ops and dump_ops_in_flight on the implicated osds may
>>> help to narrow down what the issue is.
>>>
>>>> But now that you mention
>>>> the subops, I only noticed them yesterday, but didn't have the time yet
>>>> to
>>>> investigate further. I'll have a look into the subops messages  
and report
>>>> back.
>>>>
>>>> Thanks!
>>>> Eugen
>>>>
>>>> Zitat von Burkhard Linke
>>>> <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>:
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> just for clarity:
>>>>>
>>>>>
>>>>> Did you parse the slow request messages and use the effective  
OSD in the
>>>>> statistics? Some message may refer to other OSDs, e.g.  
"waiting for sub
>>>>> op
>>>>> on OSD X,Y". The reporting OSD is not the root cause in that case, but
>>>>> one
>>>>> of the mentioned OSDs (and I'm currently not aware of a method to
>>>>> determine
>>>>> which of the both OSD is the cause in case of 3 replicates.....).
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> Burkhard
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@xxxxxxxxxxxxxx
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Eugen Block                             voice   : +49-40-559 51 75
>>>> NDE Netzdesign und -entwicklung AG      fax     : +49-40-559 51 77
>>>> Postfach 61 03 15
>>>> D-22423 Hamburg                         e-mail  : eblock@xxxxxx
>>>>
>>>>         Vorsitzende des Aufsichtsrates: Angelika Mozdzen
>>>>           Sitz und Registergericht: Hamburg, HRB 90934
>>>>                   Vorstand: Jens-U. Mozdzen
>>>>                    USt-IdNr. DE 814 013 983
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>>
>>> --
>>> Cheers,
>>> Brad
>>
>>
>>
>>
>> --
>> Eugen Block                             voice   : +49-40-559 51 75
>> NDE Netzdesign und -entwicklung AG      fax     : +49-40-559 51 77
>> Postfach 61 03 15
>> D-22423 Hamburg                         e-mail  : eblock@xxxxxx
>>
>>         Vorsitzende des Aufsichtsrates: Angelika Mozdzen
>>           Sitz und Registergericht: Hamburg, HRB 90934
>>                   Vorstand: Jens-U. Mozdzen
>>                    USt-IdNr. DE 814 013 983
>>
>
>
>
> --
> Cheers,
> Brad

--
Christian Balzer        Network/Systems Engineer
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/

--
Eugen Block                             voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG      fax     : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg                         e-mail  : eblock@xxxxxx

        Vorsitzende des Aufsichtsrates: Angelika Mozdzen
          Sitz und Registergericht: Hamburg, HRB 90934
                  Vorstand: Jens-U. Mozdzen
                   USt-IdNr. DE 814 013 983

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com