Re: CephFS metadata outgrow DISASTER during recovery

Anh Phan Tuan <anhphan.net@xxxxxxxxx> · Wed, 9 Aug 2023 19:49:29 +0700

Hi All,

It seems I also faced a similar case last year. I have about 160 x HDD
mixed size and 12 x 480GB nvme ssd for the metadata pool.

I am aware of incidents when ssd osd go to near full state, I increase
nearfull ratio but these osd continue to grow for unknown reason.

This is production so I changed all the rules to use HDD class osd. After
the migration rule, ssd osd still continues to grow. I must take out all
these osd.

The cluster worked well from that to now (have not fully restarted the
cluster from that).

As I remember the cluster ran octopus 15.2.11 at the time of the incident
(now 15.2.15).

(I also recreate some of the ssd osd in the cluster before I remove all ssd
osd, the recreated one also grows fast - After I remove all ssd osd, and
recreate some, the osd keep the size 0 until now).

Still don't know what happened.

PS: my metadata pool is now only 4.6GB stored with 4.4GB omap (my store is
mostly media so the metadata pool is small), the total data pool is about
500TB.

Regards,

On Fri, Jul 28, 2023 at 12:46 AM Jakub Petrzilka <jakub.petrzilka@xxxxxxxxx>
wrote:

> Hi Kotresh,
>
> seems like screenshot didnt get to the website or got lost somewhere on
> the way.
> Can be downloaded from: [
> https://uloz.to/tam/0d273456-a02e-41a8-9fae-f06caba0a71d |
> https://uloz.to/tam/0d273456-a02e-41a8-9fae-f06caba0a71d ] .
>
> None of metadata disks failed. One of data HDDs failed and cluster started
> recovering. Or one of data HDDs was failing and one of our operators did
> 'ceph osd crush reweight <osd> 0' to drain it.
> I think it does not really matter how the recovery started. It is
> happening every time any recovery is running but on a much smaller scale
> (+5-10% space). On this incident the grow was huge.
>
> You mean to say that the size of the mds metadata pool grew exponentially
> > YES, it grew more and more as shown on the screenshot until no free space
> left in crush root dedicated for metadata and all osds died on -ENOSPC (not
> sure about exact code, i mean the one where bluestore cant allocate space).
>
> Kind regards,
> Jakub
>
>
> From: "Kotresh Hiremath Ravishankar" <khiremat@xxxxxxxxxx>
> To: "Jakub Petrzilka" <jakub.petrzilka@xxxxxxxxx>
> Cc: "ceph-users" <ceph-users@xxxxxxx>, "jakub petrzilka" <
> jakub.petrzilka@xxxxxxxxxxxxx>
> Sent: Wednesday, July 26, 2023 7:47:08 AM
> Subject: Re:  CephFS metadata outgrow DISASTER during recovery
>
> Hi Jakub,
>
> Comments inline.
>
> On Tue, Jul 25, 2023 at 11:03 PM Jakub Petrzilka < [ mailto:
> jakub.petrzilka@xxxxxxxxx | jakub.petrzilka@xxxxxxxxx ] > wrote:
>
>
> Hello everyone!
>
> Recently we had a very nasty incident with one of our CEPH storages.
>
> During basic backfill recovery operation due to faulty disk CephFS
> metadata started growing exponentially until they used all available space
> and whole cluster DIED. Usage graph screenshot in attachment.
>
>
>
> Missed attaching screenshot?
> So there were 12 * 240g SSD disks backing the metadata pool, one of these
> disks failed?
> Could you please share the recovery steps you did after the faulty disk ?
>
>
> BQ_BEGIN
>
> Everything was very fast and even when the OSDs were marked full they
> tripped failsafe and ate all the free blocks, still trying to allocate
> space and completely died without possibility to even start them again.
>
> BQ_END
>
> You mean to say that the size of the mds metadata pool grew exponentially
> than the allocated size and mds process eventually died ?
>
>
> BQ_BEGIN
>
> Only solution was to copy whole bluestore to bigger SSD and resize
> underlying BS device. Just about 1/3 was able to start after moving but it
> was enough since we have very redundant settings for cephfs metadata.
> Basically metadata were moved from 12x 240g SSD to 12x 500GB SSD to have
> enough space to start again.
>
> Brief info about the cluster:
> - CephFS data are stored on ~500x 8TB SAS HDD using 10+2 ECC in 18 hosts.
> - CephFS metadata are stored on ~12x 500GB SAS/SATA SSD using 5x
> replication on 6 hosts.
> - Version was one of the latest 16.x.x Pacific at the time of the
> incident.
> - 3x Mon+mgr and 2 active and 2 hot standby MDS are on separate virtual
> servers.
> - typical file size to be stored is from hundreds of MBs to tens of GBs.
> - this cluster is not the biggest, not having the most HDDs, no special
> config, I simply see nothing special about this cluster.
>
> During investigation I found out the following:
> - Metadata are outgrowing any time recovery is running on any of
> maintained clusters (~15 clusters of different usages and sizes) but not
> this much, this was an extreme situation.
> - after recovery finish size went fine again.
> - i think there is slight correlation with recovery width (objects to be
> touched by recovery in order to recovery everything) and recovery (time)
> length. But i have no proof.
> - nothing much else
>
> I would like to find out why this happened because i think this can happen
> again sometime and someone might lose some data if they have less luck.
> Any ideas are appreciated, or even info if anyone have seen any similar
> behavior or if i am the only one struggling with issue like this :)
>
> Kind regards,
>
> Jakub Petrzilka
> _______________________________________________
> ceph-users mailing list -- [ mailto:ceph-users@xxxxxxx |
> ceph-users@xxxxxxx ]
> To unsubscribe send an email to [ mailto:ceph-users-leave@xxxxxxx |
> ceph-users-leave@xxxxxxx ]
>
>
> BQ_END
>
> Thanks and Regards,
> Kotresh H R
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx