Re: Recovery very slow after upgrade to quincy

"Robert W. Eckert" <rob@xxxxxxxxxxxxxxx> · Fri, 12 Aug 2022 16:35:06 +0000

Interesting, a few weeks ago I added a new disk to each of my 3 node cluster and saw the same 2 Mb/s recovery.    What I had noticed was that one OSD was using very high CPU and seems to have been the primary node on the affected PGs.    I couldn’t find anything overly wrong with the OSD, network , etc.

You may want to look at the output of 

ceph pg ls

to see if the recovery is sourced from one specific OSD or one host, then check that host /osd for high CPU/memory.

-----Original Message-----
From: Torkil Svensgaard <torkil@xxxxxxxx> 
Sent: Friday, August 12, 2022 7:50 AM
To: ceph-users@xxxxxxx
Cc: Ruben Vestergaard <rkv@xxxxxxxx>
Subject:  Recovery very slow after upgrade to quincy

6 hosts with 2 x 10G NICs, data in 2+2 EC pool. 17.2.0, upgrade from pacific.

cluster:
     id:
     health: HEALTH_WARN
             2 host(s) running different kernel versions
             2071 pgs not deep-scrubbed in time
             837 pgs not scrubbed in time

   services:
     mon:        5 daemons, quorum 
test-ceph-03,test-ceph-04,dcn-ceph-03,dcn-ceph-02,dcn-ceph-01 (age 116s)
     mgr:        dcn-ceph-01.dzercj(active, since 6h), standbys: 
dcn-ceph-03.lrhaxo
     mds:        1/1 daemons up, 2 standby
     osd:        118 osds: 118 up (since 6d), 118 in (since 6d); 66 
remapped pgs
     rbd-mirror: 2 daemons active (2 hosts)

   data:
     volumes: 1/1 healthy
     pools:   9 pools, 2737 pgs
     objects: 246.02M objects, 337 TiB
     usage:   665 TiB used, 688 TiB / 1.3 PiB avail
     pgs:     42128281/978408875 objects misplaced (4.306%)
              2332 active+clean
              281  active+clean+snaptrim_wait
              66   active+remapped+backfilling
              36   active+clean+snaptrim
              11   active+clean+scrubbing+deep
              8    active+clean+scrubbing
              1    active+clean+scrubbing+deep+snaptrim_wait
              1    active+clean+scrubbing+deep+snaptrim
              1    active+clean+scrubbing+snaptrim

   io:
     client:   159 MiB/s rd, 86 MiB/s wr, 17.14k op/s rd, 326 op/s wr
     recovery: 2.0 MiB/s, 3 objects/s

Low load, low latency, low network traffic. Tried osd_mclock_profile=high_recovery_ops, no difference. Disabling scrubs and snaptrim, no difference.

Am I missing something obvious I should have done after the upgrade?

Mvh.

Torkil

--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance Hvidovre Hospital Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: torkil@xxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx