Thanks Eugen, That indeed looks like it should be relevant. Will take a look at what it gives us on our cluster/s. Cheers, Blair On Wed, 17 Apr 2024, 18:29 Eugen Block, <eblock@xxxxxx> wrote: > Hi, > > I'm not sure if and how that could help, there's a get-crushmap > command for the ceph-monstore-tool: > > [ceph: root@host1 /]# ceph-monstore-tool /var/lib/ceph/mon/ceph-host1/ > show-versions -- --map-type crushmap > show-versions > > [ceph: root@host1 /]# cat show-versions > first committed: 0 > last committed: 0 > > [ceph: root@host1 /]# ceph-monstore-tool /var/lib/ceph/mon/ceph-host1/ > get-crushmap --version 0 > crushmap-version-0 > > [ceph: root@host1 /]# cat crushmap-version-0 > ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy > (stable) > > I don't have the option to shut down a MON in production right now to > compare if there are more committed versions or something. And > obviously, the result is not what I would usually expect from a > crushmap. I also injected a modified monmap to provoke a new version: > > # ceph osd setcrushmap -i 20240417-crushmap.new > 363 > > But the result doesn't really change, so I'm not sure how that can help: > > [ceph: root@host1 /]# ceph-monstore-tool /var/lib/ceph/mon/ceph-host1/ > get-crushmap --version 363 > ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy > (stable) > > It seems that all the commands print the same output: > > [ceph: root@host1 /]# ceph-monstore-tool /var/lib/ceph/mon/ceph-host1/ > get-crushmap --version 5885 > ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy > (stable) > [ceph: root@host1 /]# ceph-monstore-tool /var/lib/ceph/mon/ceph-host1/ > get-osdmap --version 5885 > ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy > (stable) > [ceph: root@host1 /]# ceph-monstore-tool /var/lib/ceph/mon/ceph-host1/ > get-monmap --version 5885 > ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy > (stable) > > > Maybe one of the devs can shed some light if there's a way. > > Regards, > Eugen > > Zitat von Blair Bethwaite <blair.bethwaite@xxxxxxxxx>: > > > Hi all, > > > > Do the Mons store any crushmap history, and if so how does one get at it > > please? > > > > I ask because we've recently encountered an issue in a medium scale (~5PB > > raw) EC based RGW focused cluster where "something" happened, which we > > still don't know, that suddenly caused us to see 94% of objects (5.4 > > billion of them) misplaced. We've tracked down the first log message of > > that pgmap state change: > > > > Mar 29 10:30:31 mon1 bash\[5804\]: debug 2024-03-29T10:30:31.152+0000 > > 7f3b6e378700 0 log\_channel(cluster) log \[DBG\] : pgmap v44327: 2273 > pgs: > > 225 active+clean, 2038 active+remapped+backfill\_wait, 10 > > active+remapped+backfilling; 1.6 PiB data, 2.1 PiB used, 2.2 PiB / 4.3 > PiB > > avail; 5426274136/5752755429 objects misplaced (94.325%); 248 MiB/s, 109 > > objects/s recovering > > > > This appears to have been preceded (aside from a single HTTP HEAD request > > coming into RGW) by a 5 minute gap in logs where either journald couldn't > > keep up with debug messages or the Mons were stuck. The last log before > > that occurs seems to be a compaction event kicking off: > > > > mon1 bash\[25927\]: Int 0/0 0.00 KB 0.0 0.0 0.0 > 0.0 > > 0.0 0.0 0.0 0.0 0.0 0.0 0.00 > > 0.00 0 0.000 0 0 > > Mar 29 10:24:14 mon1 bash\[25927\]: \*\* Compaction Stats \[L\] \*\* > > Mar 29 10:24:14 mon1 bash\[25927\]: Priority Files Size Score > > Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) > > Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop > > Mar 29 10:24:14 mon1 bash\[25927\]: > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Mar 29 10:24:14 mon1 bash\[25927\]: Low 0/0 0.00 KB 0.0 > 0.0 > > 0.0 0.0 0.0 0.0 0.0 0.0 116.0 11.4 > > 0.02 0.01 7 0.003 490 462 > > Mar 29 10:24:14 mon1 bash\[25927\]: High 0/0 0.00 KB 0.0 > 0.0 > > 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.9 > > 1.23 1.20 28 0.044 0 0 > > Mar 29 10:24:14 mon1 bash\[25927\]: User 0/0 0.00 KB 0.0 > 0.0 > > 0.0 0.0 0.0 0.0 0.0 0.0 0.0 16.4 > > 0.00 0.00 1 0.001 0 0 > > > > We're left wondering what the heck has happened to cause such a huge > > redistribution of data in the cluster when we've not made any > corresponding > > changes, so wanting to see if there's any breadcrumbs we can find. > > > > Appreciate any pointers! > > > > -- > > Cheers, > > ~Blairo > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx