OSDs in pool full : can't restart to clean

Paul Mezzanini <pfmeec@xxxxxxx> · Wed, 13 Jan 2021 21:56:45 +0000

Hey all

We landed in a bad place (tm) with our nvme metadata tier.  I'll root cause how we got here after it's all back up.  I suspect it was a pool got misconfigured and just filled it all up.

Short version, the OSDs are all full (or full enough) that I can't get them to spin back up.  They crash with enospc.  Average fragmentation for block is in the .8 range and bluefs-db is slightly better (using ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-412 free-score).  I've tried all sorts of things.  I was able to get a few to spin up but once they came up and rejoined they tried to pull MORE data in and crashed out again.   

I changed the crush_rule for the pool I care about to a much larger (and slower) set of disks.  That way if I get anything else to come up I'm not just making it worse.

I increased the size of the backing LV for one of the OSDs to see if I could get ceph-bluestore-tool to expand it, but that too crashes out enospc.   

In theory, there are a few pools I don't care about as much on there and I could delete them to make space, but I can't get them up enough -or- get the offline tools to do it.  

Some logs from the attempted expansion that fails:

[root@ceph-b-07 ceph-412]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-412  bluefs-bdev-expand
inferring bluefs devices from bluestore path
1 : device size 0x44aa000000 : own 0x[520000~20000,23e0000~620000,2ae0000~4d20000,78d0000~f30000,8900000~1600000,9fc0000~30000,a000000~5d00000,fe00000~3b00000,139e0000~5420000,19000000~100000,
::snip::
4f0000~20000,25c17c0000~10000,25c2ea0000~20000,25c9f20000~10000,25d0860000~10000,25d50e0000~20000,25d5170000~10000,25ded20000~20000,25f4fc0000~20000] = 0x59c5b0000 : using 0x58f220000(22 GiB) : bluestore has 0x10260000(258 MiB) available
Expanding DB/WAL...
Expanding Main...
2021-01-13 16:40:46.481 7f33d1998ec0 -1 bluestore(/var/lib/ceph/osd/ceph-412) allocate_bluefs_freespace failed to allocate on 0x32c70000 min_size 0xf700000 > allocated total 0x1e80000 bluefs_shared_alloc_size 0x10000 allocated 0x1e80000 available 0x 90210000
2021-01-13 16:40:46.482 7f33d1998ec0 -1 bluefs _allocate failed to expand slow device to fit +0xf6f0def
2021-01-13 16:40:46.482 7f33d1998ec0 -1 bluefs _flush_range allocated: 0x0 offset: 0x0 length: 0xf6f0def
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.15/rpm/el7/BUILD/ceph-14.2.15/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f33d1998ec0 time 2021-01-13 16:40:46.482978
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.15/rpm/el7/BUILD/ceph-14.2.15/src/os/bluestore/BlueFS.cc: 2351: ceph_abort_msg("bluefs enospc")
 ceph version 14.2.15 (afdd217ae5fb1ed3f60e16bd62357ca58cc650e5) nautilus (stable)

The orignal LV under that is 172g and the new LV size is double that.  

I'm going to keep poking at this, but I'm really hoping for some new info.  Either to increase the size of the OSDs to get it back up enough so I can then rebuild them with a different layout,  delete some data I don't care about, pull the data off and put it back to defrag... I don't care which so long as I get it back up.

Thanks
-paul
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx