Re: CephFS metadata outgrow DISASTER during recovery

Jakub Petrzilka <jakub.petrzilka@xxxxxxxxx> · Wed, 16 Aug 2023 14:36:39 +0200 (CEST)

Seems it could be the same issue somewhere deep under the hood. 

Do you remember anything abnormal before this issue please? Any reweighting, balancer runs, osds restart - anything what can cause PG peering? 

However we were far from nearfull state (as visible on the screenshot). We were on about 10% used and it went EXPONENTIALLY to 100% in minimal time. 

I neither know what happen. 

Thank you very much for letting me know that i am not the only one with issue like this :) 

Kind regards, 

Jakub Petrzilka 

From: "Anh Phan Tuan" <anhphan.net@xxxxxxxxx> 
To: "Jakub Petrzilka" <jakub.petrzilka@xxxxxxxxx> 
Cc: "Kotresh Hiremath Ravishankar" <khiremat@xxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxx> 
Sent: Wednesday, August 9, 2023 2:49:29 PM 
Subject: Re:  Re: CephFS metadata outgrow DISASTER during recovery 

Hi All, 
It seems I also faced a similar case last year. I have about 160 x HDD mixed size and 12 x 480GB nvme ssd for the metadata pool. 

I am aware of incidents when ssd osd go to near full state, I increase nearfull ratio but these osd continue to grow for unknown reason. 

This is production so I changed all the rules to use HDD class osd. After the migration rule, ssd osd still continues to grow. I must take out all these osd. 

The cluster worked well from that to now (have not fully restarted the cluster from that). 

As I remember the cluster ran octopus 15.2.11 at the time of the incident (now 15.2.15). 

(I also recreate some of the ssd osd in the cluster before I remove all ssd osd, the recreated one also grows fast - After I remove all ssd osd, and recreate some, the osd keep the size 0 until now). 

Still don't know what happened. 

PS: my metadata pool is now only 4.6GB stored with 4.4GB omap (my store is mostly media so the metadata pool is small), the total data pool is about 500TB. 

Regards, 

On Fri, Jul 28, 2023 at 12:46 AM Jakub Petrzilka < [ mailto:jakub.petrzilka@xxxxxxxxx | jakub.petrzilka@xxxxxxxxx ] > wrote: 

Hi Kotresh, 

seems like screenshot didnt get to the website or got lost somewhere on the way. 
Can be downloaded from: [ [ https://uloz.to/tam/0d273456-a02e-41a8-9fae-f06caba0a71d | https://uloz.to/tam/0d273456-a02e-41a8-9fae-f06caba0a71d ] | [ https://uloz.to/tam/0d273456-a02e-41a8-9fae-f06caba0a71d | https://uloz.to/tam/0d273456-a02e-41a8-9fae-f06caba0a71d ] ] . 

None of metadata disks failed. One of data HDDs failed and cluster started recovering. Or one of data HDDs was failing and one of our operators did 'ceph osd crush reweight <osd> 0' to drain it. 
I think it does not really matter how the recovery started. It is happening every time any recovery is running but on a much smaller scale (+5-10% space). On this incident the grow was huge. 

You mean to say that the size of the mds metadata pool grew exponentially > YES, it grew more and more as shown on the screenshot until no free space left in crush root dedicated for metadata and all osds died on -ENOSPC (not sure about exact code, i mean the one where bluestore cant allocate space). 

Kind regards, 
Jakub 

From: "Kotresh Hiremath Ravishankar" < [ mailto:khiremat@xxxxxxxxxx | khiremat@xxxxxxxxxx ] > 
To: "Jakub Petrzilka" < [ mailto:jakub.petrzilka@xxxxxxxxx | jakub.petrzilka@xxxxxxxxx ] > 
Cc: "ceph-users" < [ mailto:ceph-users@xxxxxxx | ceph-users@xxxxxxx ] >, "jakub petrzilka" < [ mailto:jakub.petrzilka@xxxxxxxxxxxxx | jakub.petrzilka@xxxxxxxxxxxxx ] > 
Sent: Wednesday, July 26, 2023 7:47:08 AM 
Subject: Re:  CephFS metadata outgrow DISASTER during recovery 

Hi Jakub, 

Comments inline. 

On Tue, Jul 25, 2023 at 11:03 PM Jakub Petrzilka < [ mailto: [ mailto:jakub.petrzilka@xxxxxxxxx | jakub.petrzilka@xxxxxxxxx ] | [ mailto:jakub.petrzilka@xxxxxxxxx | jakub.petrzilka@xxxxxxxxx ] ] > wrote: 

Hello everyone! 

Recently we had a very nasty incident with one of our CEPH storages. 

During basic backfill recovery operation due to faulty disk CephFS metadata started growing exponentially until they used all available space and whole cluster DIED. Usage graph screenshot in attachment. 

Missed attaching screenshot? 
So there were 12 * 240g SSD disks backing the metadata pool, one of these disks failed? 
Could you please share the recovery steps you did after the faulty disk ? 

BQ_BEGIN 

Everything was very fast and even when the OSDs were marked full they tripped failsafe and ate all the free blocks, still trying to allocate space and completely died without possibility to even start them again. 

You mean to say that the size of the mds metadata pool grew exponentially than the allocated size and mds process eventually died ? 

BQ_BEGIN 

Only solution was to copy whole bluestore to bigger SSD and resize underlying BS device. Just about 1/3 was able to start after moving but it was enough since we have very redundant settings for cephfs metadata. Basically metadata were moved from 12x 240g SSD to 12x 500GB SSD to have enough space to start again. 

Brief info about the cluster: 
- CephFS data are stored on ~500x 8TB SAS HDD using 10+2 ECC in 18 hosts. 
- CephFS metadata are stored on ~12x 500GB SAS/SATA SSD using 5x replication on 6 hosts. 
- Version was one of the latest 16.x.x Pacific at the time of the incident. 
- 3x Mon+mgr and 2 active and 2 hot standby MDS are on separate virtual servers. 
- typical file size to be stored is from hundreds of MBs to tens of GBs. 
- this cluster is not the biggest, not having the most HDDs, no special config, I simply see nothing special about this cluster. 

During investigation I found out the following: 
- Metadata are outgrowing any time recovery is running on any of maintained clusters (~15 clusters of different usages and sizes) but not this much, this was an extreme situation. 
- after recovery finish size went fine again. 
- i think there is slight correlation with recovery width (objects to be touched by recovery in order to recovery everything) and recovery (time) length. But i have no proof. 
- nothing much else 

I would like to find out why this happened because i think this can happen again sometime and someone might lose some data if they have less luck. 
Any ideas are appreciated, or even info if anyone have seen any similar behavior or if i am the only one struggling with issue like this :) 

Kind regards, 

Jakub Petrzilka 
_______________________________________________ 
ceph-users mailing list -- [ mailto: [ mailto:ceph-users@xxxxxxx | ceph-users@xxxxxxx ] | [ mailto:ceph-users@xxxxxxx | ceph-users@xxxxxxx ] ] 
To unsubscribe send an email to [ mailto: [ mailto:ceph-users-leave@xxxxxxx | ceph-users-leave@xxxxxxx ] | [ mailto:ceph-users-leave@xxxxxxx | ceph-users-leave@xxxxxxx ] ] 

BQ_END 

Thanks and Regards, 
Kotresh H R 

_______________________________________________ 
ceph-users mailing list -- [ mailto:ceph-users@xxxxxxx | ceph-users@xxxxxxx ] 
To unsubscribe send an email to [ mailto:ceph-users-leave@xxxxxxx | ceph-users-leave@xxxxxxx ] 

BQ_END

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx