Derek,
On 3/16/2020 3:26 PM, Derek Yarnell wrote:
Hi Igor,
Thank you for the help.
On 3/16/20 7:47 AM, Igor Fedotov wrote:
OSD-709 has been already expanded, right?
Correct with 'ceph-bluestore-tool --log-level 30 --path
/var/lib/ceph/osd/ceph-709 --command bluefs-bdev-expand'. Does this
expand bluefs and the data allocation?
Is there a way to ask how much data is used by the non-bluefs portion?
I can suggest the following non-straightforward way for now:
1) Check osd startup log for the following line:
2020-03-15 14:43:27.845 7f41bb6baa80 1
bluestore(/var/lib/ceph/osd/ceph-681) _open_alloc loaded 23 GiB in 97
extents
Note 23GiB loaded.
2) Then retriever bluefs usage space for main device from
"bluefs-bdev-sizes" output:
1 : device size 0x1a80000000 : own
...
= 0x582550000 : using 0x56d090000(22 GiB)
3) Actual available space would be around: 1 GiB = 23GiB - 22 GiB
[root@obj21 ceph-709]# pwd
/var/lib/ceph/osd/ceph-709
[root@obj21 ceph-709]# ls -la
total 28
drwxrwxrwt. 2 ceph ceph 180 Mar 15 22:21 .
drwxr-x---. 52 ceph ceph 4096 Jan 8 14:44 ..
lrwxrwxrwx. 1 ceph ceph 30 Mar 15 22:21 block ->
/dev/ceph-data-vol20/data-20-0
-rw-------. 1 ceph ceph 37 Mar 15 22:21 ceph_fsid
-rw-------. 1 ceph ceph 37 Mar 15 22:21 fsid
-rw-------. 1 ceph ceph 57 Mar 15 22:21 keyring
-rw-------. 1 ceph ceph 6 Mar 15 22:21 ready
-rw-------. 1 ceph ceph 10 Mar 15 22:21 type
-rw-------. 1 ceph ceph 4 Mar 15 22:21 whoami
[root@obj21 ceph-709]# lvdisplay /dev/ceph-data-vol20/data-20-0
--- Logical volume ---
LV Path /dev/ceph-data-vol20/data-20-0
LV Name data-20-0
VG Name ceph-data-vol20
LV UUID idjDxJ-CbzN-fb1n-Bhzx-86vP-mSAD-HvIq5n
LV Write Access read/write
LV Creation host, time obj21.umiacs.umd.edu, 2018-10-10 08:15:20 -0400
LV Status available
# open 0
LV Size 200.00 GiB
Current LE 51200
Segments 4
Allocation inherit
Read ahead sectors auto
- currently set to 8192
Block device 253:33
What's the error reported by fsck?
It doesn't say, but when I try to run the deep fsck it produces
[root@obj21 ceph-709]# ceph-bluestore-tool --log-level 30 --path
/var/lib/ceph/osd/ceph-709 --command fsck
2020-03-16 08:02:16.590 7f5faaa11c00 -1
bluestore(/var/lib/ceph/osd/ceph-709) fsck error: bluefs_extents
inconsistency, downgrade to previous releases might be broken.
fsck found 1 error(s)
[0] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.709-fsck-deep
fsck-deep suffers from the same lack of space. Could you please collect
log for regular fsck?
The latter opens DB in read-only and hence doesn't need additional space.
4) OSD.681 has a number of checksum verification errors when reading DB
data:
2020-03-15 14:03:52.890 7f6311ffa700 3 rocksdb:
[table/block_based_table_reader.cc:1117] Encountered error while reading
data from compression dictionary block Corruption: block checksum
mismatch: expected 0, got 2324967102 in db/012948.sst offset
18446744073709551615 size 18446744073709551615
Can't say if this is bound to space shortage or not. Wondering if other
OSDs reported(-ing) something similar?
Here is another node which at around '2020-03-15 13:51' starts looking
like peering a few pgs and then at '2020-03-15 14:40' on 716 fails, and
then for example 719 it fails 1 min later at '2020-03-15 14:41'.
[1] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.716.log-20200316.gz
[2] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.717.log-20200316.gz
[3] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.718.log-20200316.gz
[4] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.719.log-20200316.gz
[5] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.720.log-20200316.gz
[6] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.721.log-20200316.gz
[7] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.722.log-20200316.gz
[8] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.722.log-20200316.gz
So looks like checksum errors appeared after the initial failure. And
trigger recovery which requires additional space...
I think the summary of the issue is as follows:
Cluster had been in 'near full' state when some OSDs started to crash
due to lack of free space.
An attempt to extend device can succeed or not depending on the state
RocksDB is in when the first crash happened.
a) If DB is corrupted and needs for recovery (which is triggered on each
non read-only DB open) it asks for more space which fails again and OSD
fall into a "deadlock" state:
to extend main device one needs DB access which in turn needs more space.
b) If DB isn't corrupted expansion succeeds and OSD starts to get more
data due to peering. Which finally fills it up and OSD tend to get into a).
Some OSDs will presumably allow another expansion though.
Unfortunately I don't know any fix/workaround for the "deadlock" case at
the moment.
Probably migrating DB to a standalone volume (using
ceph-bluestore-tool's bluefs-bdev-migrate commands) will help but I need
to double check that.
And it will definitely expose data to risk of loss so please hold on
until my additional recommendations.
Most probably you will need additional 30GB of free space per each OSD
if going this way. So please let me know if you can afford this.
Thanks,
derek
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx