Hi Wido, thanks for the explanation. I think the root cause is the disks are too slow for campaction. I add two new mon with ssd to the cluter to speed it up and the issue resolved. That's a good advice and I have plan to migrate my mon to bigger SSD disks. Thanks again. Wido den Hollander <wido@xxxxxxxx> 于2020年10月30日周五 下午4:39写道: > > > On 29/10/2020 19:29, Zhenshi Zhou wrote: > > Hi Alex, > > > > We found that there were a huge number of keys in the "logm" and "osdmap" > > table > > while using ceph-monstore-tool. I think that could be the root cause. > > > > But that is exactly how Ceph works. It might need that very old OSDMap > to get all the PGs clean again. An OSD which has been gone for a very > long time and needs to catch up to make a PG clean. > > If not all PGs are active+clean you will and can see the MON databases > grow rapidly. > > Therefor I always deploy 1TB SSDs in all Monitors. Not expensive anymore > and they give breathing room. > > I always deploy physical and dedicated machines for Monitors just to > prevent these cases. > > Wido > > > Well, some pages also say that disable 'insight' module can resolve this > > issue, but > > I checked our cluster and we didn't enable this module. check this page > > <https://tracker.ceph.com/issues/39955>. > > > > Anyway, our cluster is unhealthy though, it just need time keep > recovering > > data :) > > > > Thanks > > > > Alex Gracie <alexandergracie17@xxxxxxxxx> 于2020年10月29日周四 下午10:57写道: > > > >> We hit this issue over the weekend on our HDD backed EC Nautilus cluster > >> while removing a single OSD. We also did not have any luck using > >> compaction. The mon-logs filled up our entire root disk on the mon > servers > >> and we were running on a single monitor for hours while we tried to > finish > >> recovery and reclaim space. The past couple weeks we also noticed "pg > not > >> scubbed in time" errors but are unsure if they are related. I'm still > the > >> exact cause of this(other than the general misplaced/degraded objects) > and > >> what kind of growth is acceptable for these store.db files. > >> > >> In order to get our downed mons restarted, we ended up backing up and > >> coping the /var/lib/ceph/mon/* contents to a remote host, setting up an > >> sshfs mount to that new host with large NVME and SSDs, ensuring the > mount > >> paths were owned by ceph, then clearing up enough space on the monitor > host > >> to start the service. This allowed our store.db directory to grow freely > >> until the misplaced/degraded objects could recover and monitors all > >> rejoined eventually. > >> _______________________________________________ > >> ceph-users mailing list -- ceph-users@xxxxxxx > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx