Re: slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)

BASSAGET Cédric <cedric.bassaget.ml@xxxxxxxxx> · Mon, 10 Jun 2019 10:00:37 +0200

Hello Robert,My disks did not reach 100% on the last warning, they climb to 70-80% usage. But I see rrqm / wrqm counters increasing...

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util

sda               0.00     4.00    0.00   16.00     0.00   104.00    13.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     2.00    1.00 3456.00     8.00 25996.00    15.04     5.76    1.67    0.00    1.67   0.03   9.20
sdd               4.00     0.00 41462.00 1119.00 331272.00  7996.00    15.94    19.89    0.47    0.48    0.21   0.02  66.00

dm-0              0.00     0.00 6825.00  503.00 330856.00  7996.00    92.48     4.00    0.55    0.56    0.30   0.09  66.80
dm-1              0.00     0.00    1.00 1129.00     8.00 25996.00    46.02     1.03    0.91    0.00    0.91   0.09  10.00

sda is my system disk (SAMSUNG   MZILS480HEGR/007  GXL0), sdb and sdd are my OSDs

would "osd op queue = wpq" help in this case ?
Regards

Le sam. 8 juin 2019 à 07:44, Robert LeBlanc <robert@xxxxxxxxxxxxx> a écrit :
With the low number of OSDs, you are probably satuarting the disks. Check with `iostat -xd 2` and see what the utilization of your disks are. A lot of SSDs don't perform well with Ceph's heavy sync writes and performance is terrible.
If some of your drives are 100% while others are lower utilization, you can possibly get more performance and greatly reduce the blocked I/O with the WPQ scheduler. In the ceph.conf add this to the [osd] section and restart the processes:

osd op queue = wpq
osd op queue cut off = high

This has helped our clusters with fairness between OSDs and making backfills not so disruptive.
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Thu, Jun 6, 2019 at 1:43 AM BASSAGET Cédric <cedric.bassaget.ml@xxxxxxxxx> wrote:
Hello,
I see messages related to REQUEST_SLOW a few times per day.

here's my ceph -s  :

root@ceph-pa2-1:/etc/ceph# ceph -s
  cluster:
    id:     72d94815-f057-4127-8914-448dfd25f5bc
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum ceph-pa2-1,ceph-pa2-2,ceph-pa2-3
    mgr: ceph-pa2-3(active), standbys: ceph-pa2-1, ceph-pa2-2
    osd: 6 osds: 6 up, 6 in

  data:
    pools:   1 pools, 256 pgs
    objects: 408.79k objects, 1.49TiB
    usage:   4.44TiB used, 37.5TiB / 41.9TiB avail
    pgs:     256 active+clean

  io:
    client:   8.00KiB/s rd, 17.2MiB/s wr, 1op/s rd, 546op/s wr

Running ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) luminous (stable)

I've check : 
- all my network stack : OK ( 2*10G LAG )
- memory usage : ok (256G on each host, about 2% used per osd)
- cpu usage : OK (Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz)
- disk status : OK (SAMSUNG   AREA7680S5xnNTRI  3P04 => samsung DC series)

I heard on IRC that it can be related to samsung PM / SM series.

Do anybody here is facing the same problem ? What can I do to solve that ?
Regards,
Cédric
_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com