Re: Full OSD's on cephfs_metadata pool

Igor Fedotov <ifedotov@xxxxxxx> · Thu, 19 Mar 2020 14:15:46 +0300

Hi Robert,

there was a thread named "bluefs enospc" a couple day ago where Derek 
shared steps to bring in a standalone DB volume and get rid of "enospc" 
error.

I'm currently working on a fix which hopefully will allow to recover 
from this failure but it might take some time before it lands to Nautilus.

Thanks,

Igor

On 3/19/2020 6:10 AM, Robert Ruge wrote:
Hi All.

Nautilus 14.2.8.

I came in this morning to find that six of my eight NVME OSD's that were housing the cephfs_metadata pool had mysteriously filled up and crashed overnight and they won't come back up. These OSD's are all single logical volume devices with no separate WAL or DB.
I have tried extending the LV of one of the OSD's but it can't make use of it and I have added a separate db volume but that didn't help.
In the meantime I have told the cluster to move cephfs_metadata back to HDD which it has kindly done and emptied my two live OSD's but I am left with 10 pgs inactive.

BLUEFS_SPILLOVER BlueFS spillover detected on 6 OSD(s)
      osd.93 spilled over 521 MiB metadata from 'db' device (26 GiB used of 50 GiB) to slow device
      osd.95 spilled over 456 MiB metadata from 'db' device (26 GiB used of 50 GiB) to slow device
      osd.100 spilled over 2.1 GiB metadata from 'db' device (26 GiB used of 50 GiB) to slow device
      osd.107 spilled over 782 MiB metadata from 'db' device (26 GiB used of 50 GiB) to slow device
      osd.112 spilled over 1.3 GiB metadata from 'db' device (27 GiB used of 50 GiB) to slow device
      osd.115 spilled over 1.4 GiB metadata from 'db' device (27 GiB used of 50 GiB) to slow device
PG_AVAILABILITY Reduced data availability: 10 pgs inactive, 10 pgs down
     pg 2.4e is down, acting [60,6,120]
     pg 2.60 is down, acting [105,132,15]
     pg 2.61 is down, acting [8,13,112]
     pg 2.72 is down, acting [93,112,0]
     pg 2.9f is down, acting [117,1,35]
     pg 2.b9 is down, acting [95,25,6]
     pg 2.c3 is down, acting [97,139,5]
     pg 2.c6 is down, acting [95,7,127]
     pg 2.d1 is down, acting [36,107,17]
     pg 2.f4 is down, acting [23,117,138]

Can I backup and recreate an OSD on a larger volume?
Can I remove a good pg from an offline OSD to remove some space?

Ceph-bluestore-tool repair fails.
"bluefs enospc" seems to be the critical error.

So currently my cephfs is unavailable so any help would be greatly appreciated.

Regards
Robert Ruge

Important Notice: The contents of this email are intended solely for the named addressee and are confidential; any unauthorised use, reproduction or storage of the contents is expressly prohibited. If you have received this email in error, please delete it and any attachments immediately and advise the sender by return email or telephone.

Deakin University does not warrant that this email and any attachments are error or virus free.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx