CephFS metadata outgrow DISASTER during recovery

Jakub Petrzilka <jakub.petrzilka@xxxxxxxxx> · Tue, 25 Jul 2023 19:31:16 +0200 (CEST)

Hello everyone!

Recently we had a very nasty incident with one of our CEPH storages.

During basic backfill recovery operation due to faulty disk CephFS metadata started growing exponentially until they used all available space and whole cluster DIED. Usage graph screenshot in attachment.

Everything was very fast and even when the OSDs were marked full they tripped failsafe and ate all the free blocks, still trying to allocate space and completely died without possibility to even start them again.

Only solution was to copy whole bluestore to bigger SSD and resize underlying BS device. Just about 1/3 was able to start after moving but it was enough since we have very redundant settings for cephfs metadata. Basically metadata were moved from 12x 240g SSD to 12x 500GB SSD to have enough space to start again.

Brief info about the cluster:
- CephFS data are stored on ~500x 8TB SAS HDD using 10+2 ECC in 18 hosts.
- CephFS metadata are stored on ~12x 500GB SAS/SATA SSD using 5x replication on 6 hosts.
- Version was one of the latest 16.x.x Pacific at the time of the incident.
- 3x Mon+mgr and 2 active and 2 hot standby MDS are on separate virtual servers.
- typical file size to be stored is from hundreds of MBs to tens of GBs.
- this cluster is not the biggest, not having the most HDDs, no special config, I simply see nothing special about this cluster.

During investigation I found out the following:
- Metadata are outgrowing any time recovery is running on any of maintained clusters (~15 clusters of different usages and sizes) but not this much, this was an extreme situation.
- after recovery finish size went fine again.
- i think there is slight correlation with recovery width (objects to be touched by recovery in order to recovery everything) and recovery (time) length. But i have no proof.
- nothing much else

I would like to find out why this happened because i think this can happen again sometime and someone might lose some data if they have less luck.
Any ideas are appreciated, or even info if anyone have seen any similar behavior or if i am the only one struggling with issue like this :)

Kind regards,

Jakub Petrzilka
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx