Re: Recovery very slow after upgrade to quincy

Satoru Takeuchi <satoru.takeuchi@xxxxxxxxx> · Mon, 15 Aug 2022 15:24:21 +0900

2022年8月13日(土) 1:35 Robert W. Eckert <rob@xxxxxxxxxxxxxxx>:

> Interesting, a few weeks ago I added a new disk to each of my 3 node
> cluster and saw the same 2 Mb/s recovery.    What I had noticed was that
> one OSD was using very high CPU and seems to have been the primary node on
> the affected PGs.    I couldn’t find anything overly wrong with the OSD,
> network , etc.
>
> You may want to look at the output of
>
> ceph pg ls
>
> to see if the recovery is sourced from one specific OSD or one host, then
> check that host /osd for high CPU/memory.

Probably you hit this bug.

https://tracker.ceph.com/issues/56530

It can be bypassed by setting "osd_op_queue=wpq" configuration.

Best,
Satoru

>
>
>
>
>
> -----Original Message-----
> From: Torkil Svensgaard <torkil@xxxxxxxx>
> Sent: Friday, August 12, 2022 7:50 AM
> To: ceph-users@xxxxxxx
> Cc: Ruben Vestergaard <rkv@xxxxxxxx>
> Subject:  Recovery very slow after upgrade to quincy
>
> 6 hosts with 2 x 10G NICs, data in 2+2 EC pool. 17.2.0, upgrade from
> pacific.
>
> cluster:
>      id:
>      health: HEALTH_WARN
>              2 host(s) running different kernel versions
>              2071 pgs not deep-scrubbed in time
>              837 pgs not scrubbed in time
>
>    services:
>      mon:        5 daemons, quorum
> test-ceph-03,test-ceph-04,dcn-ceph-03,dcn-ceph-02,dcn-ceph-01 (age 116s)
>      mgr:        dcn-ceph-01.dzercj(active, since 6h), standbys:
> dcn-ceph-03.lrhaxo
>      mds:        1/1 daemons up, 2 standby
>      osd:        118 osds: 118 up (since 6d), 118 in (since 6d); 66
> remapped pgs
>      rbd-mirror: 2 daemons active (2 hosts)
>
>    data:
>      volumes: 1/1 healthy
>      pools:   9 pools, 2737 pgs
>      objects: 246.02M objects, 337 TiB
>      usage:   665 TiB used, 688 TiB / 1.3 PiB avail
>      pgs:     42128281/978408875 objects misplaced (4.306%)
>               2332 active+clean
>               281  active+clean+snaptrim_wait
>               66   active+remapped+backfilling
>               36   active+clean+snaptrim
>               11   active+clean+scrubbing+deep
>               8    active+clean+scrubbing
>               1    active+clean+scrubbing+deep+snaptrim_wait
>               1    active+clean+scrubbing+deep+snaptrim
>               1    active+clean+scrubbing+snaptrim
>
>    io:
>      client:   159 MiB/s rd, 86 MiB/s wr, 17.14k op/s rd, 326 op/s wr
>      recovery: 2.0 MiB/s, 3 objects/s
>
>
> Low load, low latency, low network traffic. Tried
> osd_mclock_profile=high_recovery_ops, no difference. Disabling scrubs and
> snaptrim, no difference.
>
> Am I missing something obvious I should have done after the upgrade?
>
> Mvh.
>
> Torkil
>
> --
> Torkil Svensgaard
> Sysadmin
> MR-Forskningssektionen, afs. 714
> DRCMR, Danish Research Centre for Magnetic Resonance Hvidovre Hospital
> Kettegård Allé 30
> DK-2650 Hvidovre
> Denmark
> Tel: +45 386 22828
> E-mail: torkil@xxxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
> email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx