Re: monitor sst files continue growing

Zhenshi Zhou <deaderzzs@xxxxxxxxx> · Thu, 29 Oct 2020 20:29:47 +0800

Hi,

I was so anxious a few hours ago cause the sst files were growing so fast
and I don't think
the space on mon servers could afford it.

Let me talk it from the beginning. I have a cluster with OSD deployed on
SATA(7200rpm).
10T each OSD and I used ec pool for more space.I added new OSDs into the
cluster last
week and it has recovered well so far. After that, while the cluster is
still recovering, I increased the pg_num.
Besides that, the clients still write data to the server all the time.

And the cluster became unhealthy last night. Some osds were down and one
mon was down.
Then I found the mon servers' root directories were lack of free space. The
sst files in /var/lib/ceph/mon/ceph-xxx/store.db/
were growing rapidly.

Frank Schilder <frans@xxxxxx> 于2020年10月29日周四 下午7:15写道：

> I think you really need to sit down and explain the full story. Dropping
> one-liners with new information will not work via e-mail.
>
> I have never heard of the problem you are facing, so you did something
> that possibly no-one else has done before. Unless we know the full history
> from the last time the cluster was health_ok until now, it will almost
> certainly not be possible to figure out what is going on via e-mail.
>
> Usually, setting "norebalance" and "norecovery" should stop any recovery
> IO and allow the PGs to peer. If they do not become active, something is
> wrong and the information we got so far does not give a clue what this
> could be.
>
> Please post the output of "ceph health detail", "ceph osd pool stats" and
> "ceph osd pool ls detail" and a log of actions and results since last
> health_ok status here, maybe it gives a clue what is going on.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Zhenshi Zhou <deaderzzs@xxxxxxxxx>
> Sent: 29 October 2020 09:44:14
> To: Frank Schilder
> Cc: ceph-users
> Subject: Re:  monitor sst files continue growing
>
> I reset the pg_num after adding osd, it made some pg inactive(in
> activating state)
>
> Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> 于2020年10月29日周四
> 下午3:56写道：
> This does not explain incomplete and inactive PGs. Are you hitting
> https://tracker.ceph.com/issues/46847 (see also thread "Ceph does not
> recover from OSD restart"? In that case, temporarily stopping and
> restarting all new OSDs might help.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Zhenshi Zhou <deaderzzs@xxxxxxxxx<mailto:deaderzzs@xxxxxxxxx>>
> Sent: 29 October 2020 08:30:25
> To: Frank Schilder
> Cc: ceph-users
> Subject: Re:  monitor sst files continue growing
>
> After add OSDs into the cluster, the recovery and backfill progress has
> not finished yet
>
> Zhenshi Zhou <deaderzzs@xxxxxxxxx<mailto:deaderzzs@xxxxxxxxx><mailto:
> deaderzzs@xxxxxxxxx<mailto:deaderzzs@xxxxxxxxx>>> 于2020年10月29日周四 下午3:29写道：
> MGR is stopped by me cause it took too much memories.
> For pg status, I added some OSDs in this cluster, and it
>
> Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx
> <mailto:frans@xxxxxx>>> 于2020年10月29日周四 下午3:27写道：
> Your problem is the overall cluster health. The MONs store cluster history
> information that will be trimmed once it reaches HEALTH_OK. Restarting the
> MONs only makes things worse right now. The health status is a mess, no
> MGR, a bunch of PGs inactive, etc. This is what you need to resolve. How
> did your cluster end up like this?
>
> It looks like all OSDs are up and in. You need to find out
>
> - why there are inactive PGs
> - why there are incomplete PGs
>
> This usually happens when OSDs go missing.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Zhenshi Zhou <deaderzzs@xxxxxxxxx<mailto:deaderzzs@xxxxxxxxx
> ><mailto:deaderzzs@xxxxxxxxx<mailto:deaderzzs@xxxxxxxxx>>>
> Sent: 29 October 2020 07:37:19
> To: ceph-users
> Subject:  monitor sst files continue growing
>
> Hi all,
>
> My cluster is in wrong state. SST files in /var/lib/ceph/mon/xxx/store.db
> continue growing. It claims mon are using a lot of disk space.
>
> I set "mon compact on start = true" and restart one of the monitors. But
> it started and campacting for a long time, seems it has no end.
>
> [image.png]
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx