Re: CephFS metadata outgrow DISASTER during recovery

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Kotresh, 

seems like screenshot didnt get to the website or got lost somewhere on the way. 
Can be downloaded from: [ https://uloz.to/tam/0d273456-a02e-41a8-9fae-f06caba0a71d | https://uloz.to/tam/0d273456-a02e-41a8-9fae-f06caba0a71d ] . 

None of metadata disks failed. One of data HDDs failed and cluster started recovering. Or one of data HDDs was failing and one of our operators did 'ceph osd crush reweight <osd> 0' to drain it. 
I think it does not really matter how the recovery started. It is happening every time any recovery is running but on a much smaller scale (+5-10% space). On this incident the grow was huge. 

You mean to say that the size of the mds metadata pool grew exponentially > YES, it grew more and more as shown on the screenshot until no free space left in crush root dedicated for metadata and all osds died on -ENOSPC (not sure about exact code, i mean the one where bluestore cant allocate space). 

Kind regards, 
Jakub 


From: "Kotresh Hiremath Ravishankar" <khiremat@xxxxxxxxxx> 
To: "Jakub Petrzilka" <jakub.petrzilka@xxxxxxxxx> 
Cc: "ceph-users" <ceph-users@xxxxxxx>, "jakub petrzilka" <jakub.petrzilka@xxxxxxxxxxxxx> 
Sent: Wednesday, July 26, 2023 7:47:08 AM 
Subject: Re:  CephFS metadata outgrow DISASTER during recovery 

Hi Jakub, 

Comments inline. 

On Tue, Jul 25, 2023 at 11:03 PM Jakub Petrzilka < [ mailto:jakub.petrzilka@xxxxxxxxx | jakub.petrzilka@xxxxxxxxx ] > wrote: 


Hello everyone! 

Recently we had a very nasty incident with one of our CEPH storages. 

During basic backfill recovery operation due to faulty disk CephFS metadata started growing exponentially until they used all available space and whole cluster DIED. Usage graph screenshot in attachment. 



Missed attaching screenshot? 
So there were 12 * 240g SSD disks backing the metadata pool, one of these disks failed? 
Could you please share the recovery steps you did after the faulty disk ? 


BQ_BEGIN

Everything was very fast and even when the OSDs were marked full they tripped failsafe and ate all the free blocks, still trying to allocate space and completely died without possibility to even start them again. 

BQ_END

You mean to say that the size of the mds metadata pool grew exponentially than the allocated size and mds process eventually died ? 


BQ_BEGIN

Only solution was to copy whole bluestore to bigger SSD and resize underlying BS device. Just about 1/3 was able to start after moving but it was enough since we have very redundant settings for cephfs metadata. Basically metadata were moved from 12x 240g SSD to 12x 500GB SSD to have enough space to start again. 

Brief info about the cluster: 
- CephFS data are stored on ~500x 8TB SAS HDD using 10+2 ECC in 18 hosts. 
- CephFS metadata are stored on ~12x 500GB SAS/SATA SSD using 5x replication on 6 hosts. 
- Version was one of the latest 16.x.x Pacific at the time of the incident. 
- 3x Mon+mgr and 2 active and 2 hot standby MDS are on separate virtual servers. 
- typical file size to be stored is from hundreds of MBs to tens of GBs. 
- this cluster is not the biggest, not having the most HDDs, no special config, I simply see nothing special about this cluster. 

During investigation I found out the following: 
- Metadata are outgrowing any time recovery is running on any of maintained clusters (~15 clusters of different usages and sizes) but not this much, this was an extreme situation. 
- after recovery finish size went fine again. 
- i think there is slight correlation with recovery width (objects to be touched by recovery in order to recovery everything) and recovery (time) length. But i have no proof. 
- nothing much else 

I would like to find out why this happened because i think this can happen again sometime and someone might lose some data if they have less luck. 
Any ideas are appreciated, or even info if anyone have seen any similar behavior or if i am the only one struggling with issue like this :) 

Kind regards, 

Jakub Petrzilka 
_______________________________________________ 
ceph-users mailing list -- [ mailto:ceph-users@xxxxxxx | ceph-users@xxxxxxx ] 
To unsubscribe send an email to [ mailto:ceph-users-leave@xxxxxxx | ceph-users-leave@xxxxxxx ] 


BQ_END

Thanks and Regards, 
Kotresh H R 

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux