Re: bluefs enospc

Derek Yarnell <derek@xxxxxxxxxxxxxx> · Mon, 16 Mar 2020 12:17:20 -0400

Hi Igor,

On 3/16/20 10:34 AM, Igor Fedotov wrote:
> I can suggest the following non-straightforward way for now:
> 
> 1) Check osd startup log for the following line:
> 
> 2020-03-15 14:43:27.845 7f41bb6baa80  1
> bluestore(/var/lib/ceph/osd/ceph-681) _open_alloc loaded 23 GiB in 97
> extents
> 
> Note 23GiB loaded.
> 
> 2)  Then retriever bluefs usage  space for main device from
> "bluefs-bdev-sizes" output:
> 
> 1 : device size 0x1a80000000 : own
> ...
> 
> = 0x582550000 : using 0x56d090000(22 GiB)
> 
> 3) Actual available space would be around: 1 GiB = 23GiB - 22 GiB

Ok so essentially the data is then the delta of what it reports in the
following line?  Why the descrepency from the _open_alloc loaded
information and the bluefs-bdev-sizes output?

bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-681/block
size 106 GiB

>> [root@obj21 ceph-709]# ceph-bluestore-tool --log-level 30 --path
>> /var/lib/ceph/osd/ceph-709 --command fsck
>> 2020-03-16 08:02:16.590 7f5faaa11c00 -1
>> bluestore(/var/lib/ceph/osd/ceph-709) fsck error: bluefs_extents
>> inconsistency, downgrade to previous releases might be broken.
>> fsck found 1 error(s)
>>
>> [0] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.709-fsck-deep
> 
> fsck-deep suffers from the same lack of space. Could you please collect
> log for regular fsck?

That is the log for the regular fsck it only says that it founds 1
error(s) and doesn't give any further information (even with the
--log-level 30 specified).

> So looks like checksum errors appeared after the initial failure. And
> trigger recovery which requires additional space...
> 
> 
> I think the summary of the issue is as follows:
> 
> Cluster had been in 'near full' state when some OSDs started to crash
> due to lack of free space.
> 
> An attempt to extend device can succeed or not depending on the state
> RocksDB is in when the first crash happened.
> 
> a) If DB is corrupted and needs for recovery (which is triggered on each
> non read-only DB open) it asks for more space which fails again and OSD
> fall into a "deadlock" state:
> 
> to extend main device one needs DB access which in turn needs more space.
> 
> b) If DB isn't corrupted expansion succeeds and OSD starts to get more
> data due to peering. Which finally fills it up and OSD tend to get into a).
> 
> Some OSDs  will presumably allow another expansion though.
> 
> 
> Unfortunately I don't know any fix/workaround for the "deadlock" case at
> the moment.

I am trying to find creative ways to allow to increase the space
significantly on the OSD but not strand it so I can continue to provide
new space to more of the OSDs.  LVM is helpful here

> Probably migrating DB to a standalone volume (using
> ceph-bluestore-tool's bluefs-bdev-migrate commands) will help but I need
> to double check that.
> 
> And it will definitely expose data to risk of loss so please hold on
> until my additional recommendations.
> 
> Most probably you will need additional 30GB of free space per each OSD
> if going this way. So please let me know if you can afford this.

Well I had already increased 709's initial space from 106GB to 200GB and
now I gave it 10GB more but it still can not actually resize.  Here is
the relevant information I think but the full logs is here[0].  I then
did it with 30G (now total of 240G) and it still failed[1].  I am out of
space without some additional hardware in this node though I have an
idea.  If I knew what size it is (and what space it needs for recovery
this would be very helpful).

# ceph-bluestore-tool --log-level 30 --path /var/lib/ceph/osd/ceph-709
--command bluefs-bdev-expand

    -4> 2020-03-16 11:33:34.181 7f41d5940c00 -1
bluestore(/var/lib/ceph/osd/ceph-709) allocate_bluefs_freespace failed
to allocate on 0xb000000 min_size 0xb000000 > allocated total 0x80000
bluefs_shared_alloc_size 0x10000 allocated 0x80000 available 0x 8000
    -3> 2020-03-16 11:33:34.181 7f41d5940c00 -1 bluefs _allocate failed
to expand slow device to fit +0xaffa895
    -2> 2020-03-16 11:33:34.181 7f41d5940c00 -1 bluefs _flush_range
allocated: 0x0 offset: 0x0 length: 0xaffa895
    -1> 2020-03-16 11:33:34.184 7f41d5940c00 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.6/rpm/el7/BUILD/ceph-14.2.6/src/os/bluestore/BlueFS.cc:
In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t,
uint64_t)' thread 7f41d5940c00 time 2020-03-16 11:33:34.181884
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.6/rpm/el7/BUILD/ceph-14.2.6/src/os/bluestore/BlueFS.cc:
2269: ceph_abort_msg("bluefs enospc")

[0] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.709.bluefs-bdev-expand
[1] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.709.bluefs-bdev-expand-2

> 
>> Thanks,
>> derek
>>

-- 
Derek T. Yarnell
Director of Computing Facilities
University of Maryland
Institute for Advanced Computer Studies
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx