Hi Stefan, many thanks for your good advice. We are using ceph version 14.2.11 There is an issue with full osds - I'm not sure it's causing this misplaced jump problem; I've reweighting the most full osds on several consecutive days to reduce the number of nearfull osds, and it seems to have no effect on the misplaced jump. I've not done a reweight for a few days, so we have a lot of very full osds. The osd balance is way off, so something is amis. Out of 550 OSD we see this spread in use (sorted least full to most full): ID WEIGHT REWEIGHT %USE VAR PGS 430 7.27730 1.00000 46.81 0.68 171 73 7.27730 1.00000 46.91 0.68 170 189 7.27730 1.00000 47.15 0.68 173 199 7.27730 1.00000 47.24 0.69 172 86 7.27730 1.00000 48.62 0.71 176 234 7.27730 1.00000 48.73 0.71 178 437 7.27730 1.00000 49.65 0.72 182 288 7.27730 1.00000 50.12 0.73 184 (SNIP) ID WEIGHT REWEIGHT %USE VAR PGS 455 14.55299 1.00000 84.39 1.23 619 541 14.55299 1.00000 84.40 1.23 620 456 14.55299 1.00000 84.73 1.23 621 487 14.55299 0.90002 85.56 1.24 620 527 14.55299 1.00000 86.61 1.26 638 466 14.55299 0.90002 86.78 1.26 639 501 14.55299 1.00000 87.39 1.27 645 542 14.55299 1.00000 88.06 1.28 645 462 14.55299 0.95001 91.23 1.32 670 549 14.55299 1.00000 91.45 1.33 676 I like your idea of remapping the pgs to their original location, and then re-balancing the osds to a sensible arrangement. will see if this works and report back.... best regards, Jake On 28/09/2020 11:08, Stefan Kooman wrote: > On 2020-09-28 11:45, Jake Grimmett wrote: > >> To show the cluster before and immediately after an "episode" >> >> *************************************************** >> >> [root@ceph7 ceph]# ceph -s >> cluster: >> id: 36ed7113-080c-49b8-80e2-4947cc456f2a >> health: HEALTH_WARN >> 7 nearfull osd(s) >> 2 pool(s) nearfull >> Low space hindering backfill (add storage if this doesn't >> resolve itself): 11 pgs backfill_toofull > > What version are you running? I'm worried the nearfull OSDs might be the > culprit here. There has been a bug with respect to neafull OSDs [1] that > has been fixed since. You might or might not hit that. Check with "ceph > osd df" to see if there are OSDs really too full or not. > > You can use Dan's upmap-remapped.py [2] to remap the PGs back to their > original location and get the cluster in HEALTH_OK again. You might want > to select deep-scrub by hand to make sure you get the most efficient way > of deep-scrubbing (instead of randomly choosing a PG to deep-scrub). > > Gr. Stefan > > [1]: https://tracker.ceph.com/issues/39555 > [2]: > https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py > -- Dr Jake Grimmett Head Of Scientific Computing MRC Laboratory of Molecular Biology Francis Crick Avenue, Cambridge CB2 0QH, UK. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx