2022年8月13日(土) 1:35 Robert W. Eckert <rob@xxxxxxxxxxxxxxx>: > Interesting, a few weeks ago I added a new disk to each of my 3 node > cluster and saw the same 2 Mb/s recovery. What I had noticed was that > one OSD was using very high CPU and seems to have been the primary node on > the affected PGs. I couldn’t find anything overly wrong with the OSD, > network , etc. > > You may want to look at the output of > > ceph pg ls > > to see if the recovery is sourced from one specific OSD or one host, then > check that host /osd for high CPU/memory. Probably you hit this bug. https://tracker.ceph.com/issues/56530 It can be bypassed by setting "osd_op_queue=wpq" configuration. Best, Satoru > > > > > > -----Original Message----- > From: Torkil Svensgaard <torkil@xxxxxxxx> > Sent: Friday, August 12, 2022 7:50 AM > To: ceph-users@xxxxxxx > Cc: Ruben Vestergaard <rkv@xxxxxxxx> > Subject: Recovery very slow after upgrade to quincy > > 6 hosts with 2 x 10G NICs, data in 2+2 EC pool. 17.2.0, upgrade from > pacific. > > cluster: > id: > health: HEALTH_WARN > 2 host(s) running different kernel versions > 2071 pgs not deep-scrubbed in time > 837 pgs not scrubbed in time > > services: > mon: 5 daemons, quorum > test-ceph-03,test-ceph-04,dcn-ceph-03,dcn-ceph-02,dcn-ceph-01 (age 116s) > mgr: dcn-ceph-01.dzercj(active, since 6h), standbys: > dcn-ceph-03.lrhaxo > mds: 1/1 daemons up, 2 standby > osd: 118 osds: 118 up (since 6d), 118 in (since 6d); 66 > remapped pgs > rbd-mirror: 2 daemons active (2 hosts) > > data: > volumes: 1/1 healthy > pools: 9 pools, 2737 pgs > objects: 246.02M objects, 337 TiB > usage: 665 TiB used, 688 TiB / 1.3 PiB avail > pgs: 42128281/978408875 objects misplaced (4.306%) > 2332 active+clean > 281 active+clean+snaptrim_wait > 66 active+remapped+backfilling > 36 active+clean+snaptrim > 11 active+clean+scrubbing+deep > 8 active+clean+scrubbing > 1 active+clean+scrubbing+deep+snaptrim_wait > 1 active+clean+scrubbing+deep+snaptrim > 1 active+clean+scrubbing+snaptrim > > io: > client: 159 MiB/s rd, 86 MiB/s wr, 17.14k op/s rd, 326 op/s wr > recovery: 2.0 MiB/s, 3 objects/s > > > Low load, low latency, low network traffic. Tried > osd_mclock_profile=high_recovery_ops, no difference. Disabling scrubs and > snaptrim, no difference. > > Am I missing something obvious I should have done after the upgrade? > > Mvh. > > Torkil > > -- > Torkil Svensgaard > Sysadmin > MR-Forskningssektionen, afs. 714 > DRCMR, Danish Research Centre for Magnetic Resonance Hvidovre Hospital > Kettegård Allé 30 > DK-2650 Hvidovre > Denmark > Tel: +45 386 22828 > E-mail: torkil@xxxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an > email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx