Re: bluefs enospc

Igor Fedotov <ifedotov@xxxxxxx> · Mon, 16 Mar 2020 17:34:29 +0300

Derek,

On 3/16/2020 3:26 PM, Derek Yarnell wrote:
Hi Igor,

Thank you for the help.

On 3/16/20 7:47 AM, Igor Fedotov wrote:
OSD-709 has been already expanded, right?
Correct with 'ceph-bluestore-tool --log-level 30 --path
/var/lib/ceph/osd/ceph-709 --command bluefs-bdev-expand'.  Does this
expand bluefs and the data allocation?

Is there a way to ask how much data is used by the non-bluefs portion?

I can suggest the following non-straightforward way for now:

1) Check osd startup log for the following line:

2020-03-15 14:43:27.845 7f41bb6baa80  1 
bluestore(/var/lib/ceph/osd/ceph-681) _open_alloc loaded 23 GiB in 97 
extents

Note 23GiB loaded.

2)  Then retriever bluefs usage  space for main device from 
"bluefs-bdev-sizes" output:

1 : device size 0x1a80000000 : own
...

= 0x582550000 : using 0x56d090000(22 GiB)

3) Actual available space would be around: 1 GiB = 23GiB - 22 GiB

[root@obj21 ceph-709]# pwd
/var/lib/ceph/osd/ceph-709
[root@obj21 ceph-709]# ls -la
total 28
drwxrwxrwt.  2 ceph ceph  180 Mar 15 22:21 .
drwxr-x---. 52 ceph ceph 4096 Jan  8 14:44 ..
lrwxrwxrwx.  1 ceph ceph   30 Mar 15 22:21 block ->
/dev/ceph-data-vol20/data-20-0
-rw-------.  1 ceph ceph   37 Mar 15 22:21 ceph_fsid
-rw-------.  1 ceph ceph   37 Mar 15 22:21 fsid
-rw-------.  1 ceph ceph   57 Mar 15 22:21 keyring
-rw-------.  1 ceph ceph    6 Mar 15 22:21 ready
-rw-------.  1 ceph ceph   10 Mar 15 22:21 type
-rw-------.  1 ceph ceph    4 Mar 15 22:21 whoami
[root@obj21 ceph-709]# lvdisplay /dev/ceph-data-vol20/data-20-0
   --- Logical volume ---
   LV Path                /dev/ceph-data-vol20/data-20-0
   LV Name                data-20-0
   VG Name                ceph-data-vol20
   LV UUID                idjDxJ-CbzN-fb1n-Bhzx-86vP-mSAD-HvIq5n
   LV Write Access        read/write
   LV Creation host, time obj21.umiacs.umd.edu, 2018-10-10 08:15:20 -0400
   LV Status              available
   # open                 0
   LV Size                200.00 GiB
   Current LE             51200
   Segments               4
   Allocation             inherit
   Read ahead sectors     auto
   - currently set to     8192
   Block device           253:33

What's the error reported by fsck?
It doesn't say, but when I try to run the deep fsck it produces

[root@obj21 ceph-709]# ceph-bluestore-tool --log-level 30 --path
/var/lib/ceph/osd/ceph-709 --command fsck
2020-03-16 08:02:16.590 7f5faaa11c00 -1
bluestore(/var/lib/ceph/osd/ceph-709) fsck error: bluefs_extents
inconsistency, downgrade to previous releases might be broken.
fsck found 1 error(s)

[0] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.709-fsck-deep

fsck-deep suffers from the same lack of space. Could you please collect 
log for regular fsck?

The latter opens DB in read-only and hence doesn't need additional space.

4) OSD.681 has a number of checksum verification errors when reading DB
data:

2020-03-15 14:03:52.890 7f6311ffa700  3 rocksdb:
[table/block_based_table_reader.cc:1117] Encountered error while reading
data from compression dictionary block Corruption: block checksum
mismatch: expected 0, got 2324967102  in db/012948.sst offset
18446744073709551615 size 18446744073709551615

Can't say if this is bound to space shortage or not. Wondering if other
OSDs reported(-ing) something similar?
Here is another node which at around '2020-03-15 13:51' starts looking
like peering a few pgs and then at '2020-03-15 14:40' on 716 fails, and
then for example 719 it fails 1 min later at '2020-03-15 14:41'.

[1] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.716.log-20200316.gz
[2] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.717.log-20200316.gz
[3] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.718.log-20200316.gz
[4] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.719.log-20200316.gz
[5] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.720.log-20200316.gz
[6] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.721.log-20200316.gz
[7] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.722.log-20200316.gz
[8] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.722.log-20200316.gz

So looks like checksum errors appeared after the initial failure. And 
trigger recovery which requires additional space...

I think the summary of the issue is as follows:

Cluster had been in 'near full' state when some OSDs started to crash 
due to lack of free space.

An attempt to extend device can succeed or not depending on the state 
RocksDB is in when the first crash happened.

a) If DB is corrupted and needs for recovery (which is triggered on each 
non read-only DB open) it asks for more space which fails again and OSD 
fall into a "deadlock" state:

to extend main device one needs DB access which in turn needs more space.

b) If DB isn't corrupted expansion succeeds and OSD starts to get more 
data due to peering. Which finally fills it up and OSD tend to get into a).

Some OSDs  will presumably allow another expansion though.

Unfortunately I don't know any fix/workaround for the "deadlock" case at 
the moment.

Probably migrating DB to a standalone volume (using 
ceph-bluestore-tool's bluefs-bdev-migrate commands) will help but I need 
to double check that.

And it will definitely expose data to risk of loss so please hold on 
until my additional recommendations.

Most probably you will need additional 30GB of free space per each OSD 
if going this way. So please let me know if you can afford this.

Thanks,
derek

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx