Re: bluefs enospc

Igor Fedotov <ifedotov@xxxxxxx> · Wed, 18 Mar 2020 12:30:04 +0300

Hi Derek,

On 3/16/2020 7:17 PM, Derek Yarnell wrote:
Hi Igor,

On 3/16/20 10:34 AM, Igor Fedotov wrote:
I can suggest the following non-straightforward way for now:

1) Check osd startup log for the following line:

2020-03-15 14:43:27.845 7f41bb6baa80  1
bluestore(/var/lib/ceph/osd/ceph-681) _open_alloc loaded 23 GiB in 97
extents

Note 23GiB loaded.

2)  Then retriever bluefs usage  space for main device from
"bluefs-bdev-sizes" output:

1 : device size 0x1a80000000 : own
...

= 0x582550000 : using 0x56d090000(22 GiB)

3) Actual available space would be around: 1 GiB = 23GiB - 22 GiB
Ok so essentially the data is then the delta of what it reports in the
following line?  Why the descrepency from the _open_alloc loaded
information and the bluefs-bdev-sizes output?
Probably - due to value rounding.

bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-681/block
size 106 GiB

[root@obj21 ceph-709]# ceph-bluestore-tool --log-level 30 --path
/var/lib/ceph/osd/ceph-709 --command fsck
2020-03-16 08:02:16.590 7f5faaa11c00 -1
bluestore(/var/lib/ceph/osd/ceph-709) fsck error: bluefs_extents
inconsistency, downgrade to previous releases might be broken.
fsck found 1 error(s)

[0] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.709-fsck-deep
fsck-deep suffers from the same lack of space. Could you please collect
log for regular fsck?
That is the log for the regular fsck it only says that it founds 1
error(s) and doesn't give any further information (even with the
--log-level 30 specified).
Missed attachment?
So looks like checksum errors appeared after the initial failure. And
trigger recovery which requires additional space...

I think the summary of the issue is as follows:

Cluster had been in 'near full' state when some OSDs started to crash
due to lack of free space.

An attempt to extend device can succeed or not depending on the state
RocksDB is in when the first crash happened.

a) If DB is corrupted and needs for recovery (which is triggered on each
non read-only DB open) it asks for more space which fails again and OSD
fall into a "deadlock" state:

to extend main device one needs DB access which in turn needs more space.

b) If DB isn't corrupted expansion succeeds and OSD starts to get more
data due to peering. Which finally fills it up and OSD tend to get into a).

Some OSDs  will presumably allow another expansion though.

Unfortunately I don't know any fix/workaround for the "deadlock" case at
the moment.
I am trying to find creative ways to allow to increase the space
significantly on the OSD but not strand it so I can continue to provide
new space to more of the OSDs.  LVM is helpful here

Probably migrating DB to a standalone volume (using
ceph-bluestore-tool's bluefs-bdev-migrate commands) will help but I need
to double check that.

And it will definitely expose data to risk of loss so please hold on
until my additional recommendations.

Most probably you will need additional 30GB of free space per each OSD
if going this way. So please let me know if you can afford this.
Well I had already increased 709's initial space from 106GB to 200GB and
now I gave it 10GB more but it still can not actually resize.  Here is
the relevant information I think but the full logs is here[0].  I then
did it with 30G (now total of 240G) and it still failed[1].  I am out of
space without some additional hardware in this node though I have an
idea.  If I knew what size it is (and what space it needs for recovery
this would be very helpful).

There is no much sense in increasing main device for this specific OSD 
(and similarly failing one, i.e. OSDs mentioning RocksDB recovery in 
back trace) at this point.

It's in the "deadlock" state I mentioned before. And hence expand is 
unable to proceed.

I'm checking some workarounds to get out of this state at the moment. 
Still in progress though.

What I meant before is that you need more available space if workaround 
would be the assignment of a  new standalone DB volume. It's a 
questionable way so I'm trying other ways for now.

# ceph-bluestore-tool --log-level 30 --path /var/lib/ceph/osd/ceph-709
--command bluefs-bdev-expand

     -4> 2020-03-16 11:33:34.181 7f41d5940c00 -1
bluestore(/var/lib/ceph/osd/ceph-709) allocate_bluefs_freespace failed
to allocate on 0xb000000 min_size 0xb000000 > allocated total 0x80000
bluefs_shared_alloc_size 0x10000 allocated 0x80000 available 0x 8000
     -3> 2020-03-16 11:33:34.181 7f41d5940c00 -1 bluefs _allocate failed
to expand slow device to fit +0xaffa895
     -2> 2020-03-16 11:33:34.181 7f41d5940c00 -1 bluefs _flush_range
allocated: 0x0 offset: 0x0 length: 0xaffa895
     -1> 2020-03-16 11:33:34.184 7f41d5940c00 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.6/rpm/el7/BUILD/ceph-14.2.6/src/os/bluestore/BlueFS.cc:
In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t,
uint64_t)' thread 7f41d5940c00 time 2020-03-16 11:33:34.181884
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.6/rpm/el7/BUILD/ceph-14.2.6/src/os/bluestore/BlueFS.cc:
2269: ceph_abort_msg("bluefs enospc")

[0] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.709.bluefs-bdev-expand
[1] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.709.bluefs-bdev-expand-2

Thanks,
derek

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx