Hi all, Do the Mons store any crushmap history, and if so how does one get at it please? I ask because we've recently encountered an issue in a medium scale (~5PB raw) EC based RGW focused cluster where "something" happened, which we still don't know, that suddenly caused us to see 94% of objects (5.4 billion of them) misplaced. We've tracked down the first log message of that pgmap state change: Mar 29 10:30:31 mon1 bash\[5804\]: debug 2024-03-29T10:30:31.152+0000 7f3b6e378700 0 log\_channel(cluster) log \[DBG\] : pgmap v44327: 2273 pgs: 225 active+clean, 2038 active+remapped+backfill\_wait, 10 active+remapped+backfilling; 1.6 PiB data, 2.1 PiB used, 2.2 PiB / 4.3 PiB avail; 5426274136/5752755429 objects misplaced (94.325%); 248 MiB/s, 109 objects/s recovering This appears to have been preceded (aside from a single HTTP HEAD request coming into RGW) by a 5 minute gap in logs where either journald couldn't keep up with debug messages or the Mons were stuck. The last log before that occurs seems to be a compaction event kicking off: mon1 bash\[25927\]: Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.00 0 0.000 0 0 Mar 29 10:24:14 mon1 bash\[25927\]: \*\* Compaction Stats \[L\] \*\* Mar 29 10:24:14 mon1 bash\[25927\]: Priority Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Mar 29 10:24:14 mon1 bash\[25927\]: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Mar 29 10:24:14 mon1 bash\[25927\]: Low 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 116.0 11.4 0.02 0.01 7 0.003 490 462 Mar 29 10:24:14 mon1 bash\[25927\]: High 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.9 1.23 1.20 28 0.044 0 0 Mar 29 10:24:14 mon1 bash\[25927\]: User 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 16.4 0.00 0.00 1 0.001 0 0 We're left wondering what the heck has happened to cause such a huge redistribution of data in the cluster when we've not made any corresponding changes, so wanting to see if there's any breadcrumbs we can find. Appreciate any pointers! -- Cheers, ~Blairo _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx