Re: CephFS metadata outgrow DISASTER during recovery

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


Hi Jakub,

Comments inline.

On Tue, Jul 25, 2023 at 11:03 PM Jakub Petrzilka <jakub.petrzilka@xxxxxxxxx>

> Hello everyone!
> Recently we had a very nasty incident with one of our CEPH storages.
> During basic backfill recovery operation due to faulty disk CephFS
> metadata started growing exponentially until they used all available space
> and whole cluster DIED. Usage graph screenshot in attachment.

Missed attaching screenshot?
So there were 12 * 240g SSD disks backing the metadata pool, one of these
disks failed?
Could you please share the recovery steps you did after the faulty disk ?

> Everything was very fast and even when the OSDs were marked full they
> tripped failsafe and ate all the free blocks, still trying to allocate
> space and completely died without possibility to even start them again.

You mean to say that the size of the mds metadata pool grew exponentially
than the allocated size and mds process eventually died ?

> Only solution was to copy whole bluestore to bigger SSD and resize
> underlying BS device. Just about 1/3 was able to start after moving but it
> was enough since we have very redundant settings for cephfs metadata.
> Basically metadata were moved from 12x 240g SSD to 12x 500GB SSD to have
> enough space to start again.
> Brief info about the cluster:
> - CephFS data are stored on ~500x 8TB SAS HDD using 10+2 ECC in 18 hosts.
> - CephFS metadata are stored on ~12x 500GB SAS/SATA SSD using 5x
> replication on 6 hosts.
> - Version was one of the latest 16.x.x Pacific at the time of the incident.
> - 3x Mon+mgr and 2 active and 2 hot standby MDS are on separate virtual
> servers.
> - typical file size to be stored is from hundreds of MBs to tens of GBs.
> - this cluster is not the biggest, not having the most HDDs, no special
> config, I simply see nothing special about this cluster.
> During investigation I found out the following:
> - Metadata are outgrowing any time recovery is running on any of
> maintained clusters (~15 clusters of different usages and sizes) but not
> this much, this was an extreme situation.
> - after recovery finish size went fine again.
> - i think there is slight correlation with recovery width (objects to be
> touched by recovery in order to recovery everything) and recovery (time)
> length. But i have no proof.
> - nothing much else
> I would like to find out why this happened because i think this can happen
> again sometime and someone might lose some data if they have less luck.
> Any ideas are appreciated, or even info if anyone have seen any similar
> behavior or if i am the only one struggling with issue like this :)
> Kind regards,
> Jakub Petrzilka
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
Thanks and Regards,
Kotresh H R
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]

  Powered by Linux