Hi, I was so anxious a few hours ago cause the sst files were growing so fast and I don't think the space on mon servers could afford it. Let me talk it from the beginning. I have a cluster with OSD deployed on SATA(7200rpm). 10T each OSD and I used ec pool for more space.I added new OSDs into the cluster last week and it has recovered well so far. After that, while the cluster is still recovering, I increased the pg_num. Besides that, the clients still write data to the server all the time. And the cluster became unhealthy last night. Some osds were down and one mon was down. Then I found the mon servers' root directories were lack of free space. The sst files in /var/lib/ceph/mon/ceph-xxx/store.db/ were growing rapidly. Frank Schilder <frans@xxxxxx> 于2020年10月29日周四 下午7:15写道: > I think you really need to sit down and explain the full story. Dropping > one-liners with new information will not work via e-mail. > > I have never heard of the problem you are facing, so you did something > that possibly no-one else has done before. Unless we know the full history > from the last time the cluster was health_ok until now, it will almost > certainly not be possible to figure out what is going on via e-mail. > > Usually, setting "norebalance" and "norecovery" should stop any recovery > IO and allow the PGs to peer. If they do not become active, something is > wrong and the information we got so far does not give a clue what this > could be. > > Please post the output of "ceph health detail", "ceph osd pool stats" and > "ceph osd pool ls detail" and a log of actions and results since last > health_ok status here, maybe it gives a clue what is going on. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Zhenshi Zhou <deaderzzs@xxxxxxxxx> > Sent: 29 October 2020 09:44:14 > To: Frank Schilder > Cc: ceph-users > Subject: Re: monitor sst files continue growing > > I reset the pg_num after adding osd, it made some pg inactive(in > activating state) > > Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx>> 于2020年10月29日周四 > 下午3:56写道: > This does not explain incomplete and inactive PGs. Are you hitting > https://tracker.ceph.com/issues/46847 (see also thread "Ceph does not > recover from OSD restart"? In that case, temporarily stopping and > restarting all new OSDs might help. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Zhenshi Zhou <deaderzzs@xxxxxxxxx<mailto:deaderzzs@xxxxxxxxx>> > Sent: 29 October 2020 08:30:25 > To: Frank Schilder > Cc: ceph-users > Subject: Re: monitor sst files continue growing > > After add OSDs into the cluster, the recovery and backfill progress has > not finished yet > > Zhenshi Zhou <deaderzzs@xxxxxxxxx<mailto:deaderzzs@xxxxxxxxx><mailto: > deaderzzs@xxxxxxxxx<mailto:deaderzzs@xxxxxxxxx>>> 于2020年10月29日周四 下午3:29写道: > MGR is stopped by me cause it took too much memories. > For pg status, I added some OSDs in this cluster, and it > > Frank Schilder <frans@xxxxxx<mailto:frans@xxxxxx><mailto:frans@xxxxxx > <mailto:frans@xxxxxx>>> 于2020年10月29日周四 下午3:27写道: > Your problem is the overall cluster health. The MONs store cluster history > information that will be trimmed once it reaches HEALTH_OK. Restarting the > MONs only makes things worse right now. The health status is a mess, no > MGR, a bunch of PGs inactive, etc. This is what you need to resolve. How > did your cluster end up like this? > > It looks like all OSDs are up and in. You need to find out > > - why there are inactive PGs > - why there are incomplete PGs > > This usually happens when OSDs go missing. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Zhenshi Zhou <deaderzzs@xxxxxxxxx<mailto:deaderzzs@xxxxxxxxx > ><mailto:deaderzzs@xxxxxxxxx<mailto:deaderzzs@xxxxxxxxx>>> > Sent: 29 October 2020 07:37:19 > To: ceph-users > Subject: monitor sst files continue growing > > Hi all, > > My cluster is in wrong state. SST files in /var/lib/ceph/mon/xxx/store.db > continue growing. It claims mon are using a lot of disk space. > > I set "mon compact on start = true" and restart one of the monitors. But > it started and campacting for a long time, seems it has no end. > > [image.png] > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx