Re: monitor sst files continue growing

Zhenshi Zhou <deaderzzs@xxxxxxxxx> · Sat, 14 Nov 2020 12:44:50 +0800

Hi Wido,

thanks for the explanation. I think the root cause is the disks are too
slow for campaction.
I add two new mon with ssd to the cluter to speed it up and the issue
resolved.

That's a good advice and I have plan to migrate my mon to bigger SSD disks.

Thanks again.

Wido den Hollander <wido@xxxxxxxx> 于2020年10月30日周五 下午4:39写道：

>
>
> On 29/10/2020 19:29, Zhenshi Zhou wrote:
> > Hi Alex,
> >
> > We found that there were a huge number of keys in the "logm" and "osdmap"
> > table
> > while using ceph-monstore-tool. I think that could be the root cause.
> >
>
> But that is exactly how Ceph works. It might need that very old OSDMap
> to get all the PGs clean again. An OSD which has been gone for a very
> long time and needs to catch up to make a PG clean.
>
> If not all PGs are active+clean you will and can see the MON databases
> grow rapidly.
>
> Therefor I always deploy 1TB SSDs in all Monitors. Not expensive anymore
> and they give breathing room.
>
> I always deploy physical and dedicated machines for Monitors just to
> prevent these cases.
>
> Wido
>
> > Well, some pages also say that disable 'insight' module can resolve this
> > issue, but
> > I checked our cluster and we didn't enable this module. check this page
> > <https://tracker.ceph.com/issues/39955>.
> >
> > Anyway, our cluster is unhealthy though, it just need time keep
> recovering
> > data :)
> >
> > Thanks
> >
> > Alex Gracie <alexandergracie17@xxxxxxxxx> 于2020年10月29日周四 下午10:57写道：
> >
> >> We hit this issue over the weekend on our HDD backed EC Nautilus cluster
> >> while removing a single OSD. We also did not have any luck using
> >> compaction. The mon-logs filled up our entire root disk on the mon
> servers
> >> and we were running on a single monitor for hours while we tried to
> finish
> >> recovery and reclaim space. The past couple weeks we also noticed "pg
> not
> >> scubbed in time" errors but are unsure if they are related. I'm still
> the
> >> exact cause of this(other than the general misplaced/degraded objects)
> and
> >> what kind of growth is acceptable for these store.db files.
> >>
> >> In order to get our downed mons restarted, we ended up backing up and
> >> coping the /var/lib/ceph/mon/* contents to a remote host, setting up an
> >> sshfs mount to that new host with large NVME and SSDs, ensuring the
> mount
> >> paths were owned by ceph, then clearing up enough space on the monitor
> host
> >> to start the service. This allowed our store.db directory to grow freely
> >> until the misplaced/degraded objects could recover and monitors all
> >> rejoined eventually.
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx