Re: Cephfs metadta pool suddenly full (100%) ! [SOLVED but no explanation at this time!]

Hervé Ballans <herve.ballans@xxxxxxxxxxxxx> · Tue, 1 Jun 2021 16:57:04 +0200

Thank you Dan and Sebastian for trying to help me.

We managed to get back to a normal situation but we still didn't 
understood how the problem happened...

How do we get back to an optimal situation ?
"Fortunately", we had on the cluster, 3 other "spare" NVMe that we 
didn't use yet. We added them in the metadata pool to spread the 
metadata data. Once this done, there was no still OSD full but mds were 
in failed state (1 up, 1 replay and 1 failed). There was a very 
significant trimming activity, and when finished, we restarted one of 
the mds serevr, then the MDS status was OK (active/active/standby). 
After that, the occupation of metadata OSD decreased to get back to a 
normal amount (close to 3%...) ! Ok that was really fine but...

...what happened at the beginning ? (and crucial issue: how can we be 
sure that it will not happen again ?)
no answer yet!
We do not have explanation why, in few hours, metadata pool has grown to 
100% (without specific activities in data pool)

Actually, indeed, the Ceph log size of today is huge (comparing to other 
day).

The today's mds log seems to show something unusual at 04:10 am (see 
here: https://pastebin.com/0CCdLMat)

We currently run a Nautilus 14.2.16.
We quickly plan to update it to the latest version of Nautilus 14.2.21 
and after to upgrade to a newer Ceph release (Octopus, or even Pacific ?)

If you are inspired by this issue, don't hesitate to comment, thanks.

Regards,
Hervé

Le 01/06/2021 à 12:24, Hervé Ballans a écrit :
Hi all,

Ceph  Nautilus 14.2.16.

We encounter a strange and critical poblem since this morning.

Our cephfs metadata pool suddenly grew from 2,7% to 100%! (in less 
than 5 hours) while there is no significant activities on the OSD data !

Here are some numbers:

# ceph df
RAW STORAGE:
    CLASS     SIZE        AVAIL       USED        RAW USED %RAW USED
    hdd       205 TiB     103 TiB     102 TiB      102 TiB 49.68
    nvme      4.4 TiB     2.2 TiB     2.1 TiB      2.2 TiB 49.63
    TOTAL     210 TiB     105 TiB     104 TiB      104 TiB 49.68

POOLS:
    POOL                     ID     PGS      STORED OBJECTS 
USED        %USED      MAX AVAIL
    cephfs_data_home          7      512      11 TiB 22.58M 11 
TiB      18.31        17 TiB
    cephfs_metadata_home      8      128     724 GiB 2.32M     724 
GiB     100.00           0 B
    rbd_backup_vms            9     1024      19 TiB 5.00M      19 
TiB      37.08        11 TiB

The cephfs_data uses less than the half of the storage space, and 
there was no significant increase during the period (and before) where 
metadata became full.

Is someone already encounter that ?

Currently, I have no idea how I can solve this problem. The restart of 
associated OSD and mds services have not been useful.

Let me know if you want more informations or logs.

Thank you for your help.

Regards,
Hervé

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx