Hi, We run 3 production clusters in a multi-site setup. They were deployed with Ceph-Ansible but recently switched to cephadm while on the Pacific release. Shortly after migrating to cephadm they were upgraded to Quincy. Since moving to Quincy, the recovery on one of the replica sites has tanked quite severely. We use 8Tb SAS HDDs for OSDs with WAL and DB on nVME drives. Before the upgrade it would take 1-2 days to resilver a OSD, but recently we replaced a drive and it took 6 days to resilver. Nothing else has changed with the configuration of the cluster, and the other 2 are performing recovery fine as far as we can see. Does anyone have any ideas of what could be the issue here or anywhere we can check what is going on?? Also while recovering, we've noticed inaccurate recovery information in ceph -s. We've seen the recovery section of ceph -s reporting its running at less than 10 keys/s, but looking at the Degraded data redundancy we see this drop multiple 100s of keys per second. Does anyone have any advice they can offer on this too please?? Cheers Iain [ceph: root@gb4-li-cephgw-001 /]# ceph -s; sleep 3; ceph -s cluster: id: 6dabcf41-90d7-4e90-b259-1cc0bf298052 health: HEALTH_WARN noout,norebalance flag(s) set Degraded data redundancy: 59610/371260626 objects degraded (0.016%), 1 pg degraded, 1 pg undersized services: mon: 3 daemons, quorum gb4-li-cephgw-001,gb4-li-cephgw-002,gb4-li-cephgw-003 (age 2h) mgr: gb4-li-cephgw-003(active, since 2h), standbys: gb4-li-cephgw-002.iqmxgu, gb4-li-cephgw-001 osd: 72 osds: 72 up (since 39m), 72 in (since 3d); 1 remapped pgs flags noout,norebalance rgw: 3 daemons active (3 hosts, 1 zones) data: pools: 11 pools, 1457 pgs objects: 63.76M objects, 173 TiB usage: 251 TiB used, 275 TiB / 526 TiB avail pgs: 59610/371260626 objects degraded (0.016%) 1452 active+clean 4 active+clean+scrubbing+deep 1 active+undersized+degraded+remapped+backfilling io: client: 249 MiB/s rd, 611 KiB/s wr, 355 op/s rd, 402 op/s wr recovery: 13 KiB/s, 3 keys/s, 2 objects/s progress: Global Recovery Event (39m) [===========================.] (remaining: 1s) cluster: id: 6dabcf41-90d7-4e90-b259-1cc0bf298052 health: HEALTH_WARN noout,norebalance flag(s) set Degraded data redundancy: 59116/371260644 objects degraded (0.016%), 1 pg degraded, 1 pg undersized services: mon: 3 daemons, quorum gb4-li-cephgw-001,gb4-li-cephgw-002,gb4-li-cephgw-003 (age 2h) mgr: gb4-li-cephgw-003(active, since 2h), standbys: gb4-li-cephgw-002.iqmxgu, gb4-li-cephgw-001 osd: 72 osds: 72 up (since 39m), 72 in (since 3d); 1 remapped pgs flags noout,norebalance rgw: 3 daemons active (3 hosts, 1 zones) data: pools: 11 pools, 1457 pgs objects: 63.76M objects, 173 TiB usage: 251 TiB used, 275 TiB / 526 TiB avail pgs: 59116/371260644 objects degraded (0.016%) 1452 active+clean 4 active+clean+scrubbing+deep 1 active+undersized+degraded+remapped+backfilling io: client: 258 MiB/s rd, 595 KiB/s wr, 346 op/s rd, 387 op/s wr recovery: 15 KiB/s, 2 keys/s, 2 objects/s progress: Global Recovery Event (39m) [===========================.] (remaining: 1s) [ceph: root@gb4-li-cephgw-001 /]# ceph -s; sleep 3; ceph -s cluster: id: 6dabcf41-90d7-4e90-b259-1cc0bf298052 health: HEALTH_WARN noout,norebalance flag(s) set Degraded data redundancy: 58503/371260638 objects degraded (0.016%), 1 pg degraded, 1 pg undersized services: mon: 3 daemons, quorum gb4-li-cephgw-001,gb4-li-cephgw-002,gb4-li-cephgw-003 (age 2h) mgr: gb4-li-cephgw-003(active, since 2h), standbys: gb4-li-cephgw-002.iqmxgu, gb4-li-cephgw-001 osd: 72 osds: 72 up (since 39m), 72 in (since 3d); 1 remapped pgs flags noout,norebalance rgw: 3 daemons active (3 hosts, 1 zones) data: pools: 11 pools, 1457 pgs objects: 63.76M objects, 173 TiB usage: 251 TiB used, 275 TiB / 526 TiB avail pgs: 58503/371260638 objects degraded (0.016%) 1452 active+clean 4 active+clean+scrubbing+deep 1 active+undersized+degraded+remapped+backfilling io: client: 245 MiB/s rd, 278 KiB/s wr, 247 op/s rd, 183 op/s wr recovery: 16 KiB/s, 2 keys/s, 2 objects/s progress: Global Recovery Event (39m) [===========================.] (remaining: 1s) cluster: id: 6dabcf41-90d7-4e90-b259-1cc0bf298052 health: HEALTH_WARN noout,norebalance flag(s) set Degraded data redundancy: 58157/371260644 objects degraded (0.016%), 1 pg degraded, 1 pg undersized services: mon: 3 daemons, quorum gb4-li-cephgw-001,gb4-li-cephgw-002,gb4-li-cephgw-003 (age 2h) mgr: gb4-li-cephgw-003(active, since 2h), standbys: gb4-li-cephgw-002.iqmxgu, gb4-li-cephgw-001 osd: 72 osds: 72 up (since 39m), 72 in (since 3d); 1 remapped pgs flags noout,norebalance rgw: 3 daemons active (3 hosts, 1 zones) data: pools: 11 pools, 1457 pgs objects: 63.76M objects, 173 TiB usage: 251 TiB used, 275 TiB / 526 TiB avail pgs: 58157/371260644 objects degraded (0.016%) 1452 active+clean 4 active+clean+scrubbing+deep 1 active+undersized+degraded+remapped+backfilling io: client: 243 MiB/s rd, 285 KiB/s wr, 252 op/s rd, 197 op/s wr recovery: 13 KiB/s, 0 keys/s, 1 objects/s progress: Global Recovery Event (39m) [===========================.] (remaining: 1s) [ceph: root@gb4-li-cephgw-001 /]# Iain Stott OpenStack Engineer Iain.Stott@xxxxxxxxxxxxxxx [THG Ingenuity Logo]<https://www.thg.com> [https://i.imgur.com/wbpVRW6.png]<https://www.linkedin.com/company/thgplc/?originalSubdomain=uk> [https://i.imgur.com/c3040tr.png] <https://twitter.com/thgplc?lang=en> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx