Hi,
We have a production cluster that just suffered an issue with multiple
of our NVMe OSDs. Multiple of them died (>12) with errors that they no
longer had space with a 'ENOSPC from bluestore, misconfigured cluster'
error over 4 nodes. These are all simple one device bluestore osds.
ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus
(stable)
This is an example[0] of one of the logs. In this case each of 8 NVMe
OSDs on a node have 106GB of space allocated to each bluestore NVMe OSD.
The ceph-bluestore-tool bluefs-bdev-sizes output only lists 22GiB for
osd 681. I extended the space of bluestore on a few of the OSDs via LVM
and then the bluefs-bdev-expand command. This worked for a few and not
for others.
Some of the ones that it did work for recovered for a bit then
re-entered the error state. Trying to extend the allocation didn't work
after that. When they failed again I ran the fsck which reported that
it found 1 error and then running repair I got a rather long stack trace[1].
# ceph-bluestore-tool --log-level 30 --command bluefs-bdev-sizes --path
/var/lib/ceph/osd/ceph-681
inferring bluefs devices from bluestore path
slot 1 /var/lib/ceph/osd/ceph-681/block -> /dev/dm-33
1 : device size 0x1a80000000 : own
0x[2480000~10000,24a0000~10000,2520000~60000,25f0000~c0000,2720000~50000,28a0000~110000,2a20000~230000,2cc0000~260000,2f30000~220000,31c0000~6b0000,38a0000~10000,3990000~3e0000,3d80000~530000,42d0000~590000,48d0000~400000,4d00000~7d0000,54f0000~c50000,6150000~10000,6190000~150000,6350000~c0000,6480000~160000,6640000~1e0000,6870000~c0000,6a00000~30000,6a40000~240000,6dd0000~310000,7210000~b0000,73a0000~b0000,76a0000~180000,7830000~80000,78e0000~240000,7b70000~90000,7c50000~b0000,7ef0000~140000,8040000~30000,8180000~250000,8440000~50000,84b0000~110000,8610000~c0000,9e20000~20000,9e50000~b0000,9f10000~60000,9f80000~30000,dd80000~180000,df70000~6a0000,e620000~5ae0000,15200000~3510000,187f0000~bf0000,19490000~1070000,1ab70000~4c0000,1b400000~7d0000,1bbe0000~c20000,1cd10000~340000,1d3a0000~860000,1dd00000~2e00000,20c00000~3f00000,24d00000~700000,25600000~700000,26100000~200000,26400000~300000,26b00000~600000,27400000~400000,27ba0000~6e0000,28500000~1d00000,2a400000~700000,2ac00000~100000,2
b100000~300000,2b470000~120000,2b700000~500000,2c000000~200000,2c400000~400000,2ca00000~100000,2cf00000~300000,2d340000~39b0000,30d00000~1f00000,32e00000~4bf0000,380a0000~3c0000,38500000~c0000,38bd0000~400000,390b0000~340000,39400000~100000,39900000~1000000,3ac00000~5d00000,40b90000~400000,41280000~db50000,4ee00000~700000,4f900000~4500000,54390000~100000,54e00000~18400000,6d800000~20d0000,6f8f0000~1a10000,71400000~4500000,76100000~300000,766e0000~6860000,7dd00000~c00000,7eac0000~a0000,7ef90000~f190000,8e1f0000~80000,8e410000~60000,8e480000~20000,8e4b0000~20000,8e5c0000~50000,8e7e0000~50000,8f160000~60000,8f240000~a0000,90000000~15e90000,a6200000~c3a0000,b25d0000~630000,b3000000~c00000,b3ee0000~90000,b4200000~d00000,b5a70000~160000,b63f0000~2a0000,b6720000~2820000,bab00000~400000,bbf60000~10ad0000,ccb90000~2300000,cf000000~2b00000,d1ca0000~10000,d1e00000~1400000,d3230000~1df0000,d5200000~1a00000,d6d00000~800000,d75e0000~6f0000,d7f00000~d00000,d9100000~400000,d9900000~d00000,da800000~
600000,daf10000~400000,db700000~1600000,dd280000~20000,dd670000~390000,dda30000~400000,de190000~70000,de2a0000~370000,de660000~20000,de700000~14770000,f3600000~700000,f3db0000~960000,f49e0000~5b00000,fa600000~c00000,fb300000~510000,fbb00000~100000,fbeb0000~450000,fc400000~2b0000,fd400000~400000,fde00000~c00000,ff0b0000~50000,ff200000~800000,ffd60000~10000,fff00000~a0000,100200000~300000,101600000~100000,101750000~300000,102120000~1e0000,1027f0000~a00000,103600000~330000,103b00000~200000,103e60000~4a0000,104310000~c00000,105030000~1200000,106800000~100000,106b20000~400000,107000000~300000,1073e0000~400000,107950000~86b0000,110140000~d0000,110350000~2e0000,110e20000~20000,110eb0000~a0000,110f60000~60000,110fd0000~1f0000,1112a0000~f0000,111420000~30000,1115b0000~30000,111620000~150000,111790000~40000,112560000~180000,112730000~180000,1129b0000~50000,112f90000~4a0000,113840000~c0000,113ea0000~40000,113fb0000~130000,114100000~310000,114470000~10000,114620000~120000,114810000~120000,114a0
0000~20000,114a90000~f0000,114c60000~e0000,114e80000~20000,114f70000~140000,1150c0000~50000,1151f0000~320000,1155f0000~10000,115670000~226f0000,137e30000~800000,138b50000~400000,139400000~1500000,13ae00000~4500000,13f480000~400000,13f950000~1b6e0000,15b700000~400000,15c000000~300000,15c600000~700000,15d9e0000~1820000,15f400000~c00000,160400000~d00000,1613a0000~630000,1619e0000~d20000,162800000~1b00000,164600000~7550000,170d00000~1800000,172580000~100000,172d70000~1c190000,18f100000~40e0000,193700000~400000,193c90000~6970000,19aa90000~188d0000,1b3770000~c220000,1bff00000~1200000,1c12f0000~400000,1c2c00000~400000,1c3f80000~400000,1c5300000~4200000,1c95c0000~b9a40000,283800000~1800000,289000000~3800000,28d000000~6000000,293800000~2000000,296000000~e000000,2a4800000~465d0000,4b6100000~4c560000,502800000~32500000,cb8500000~10f600000,1139800000~7a000000,1668d30000~3d290000,1794000000~3e800000,191c000000~3ee00000,1a49800000~4110000,1a4f800000~40f0000,1a70800000~4100000,1a76000000~3c10000]
= 0x582550000 : using 0x56d090000(22 GiB)
Any help here would be appreciated, I have stopped out CephFS file
system but our radosgw is also impacted.
[0] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.681.log
[1] - ftp://ftp.umiacs.umd.edu/pub/derek/ceph-osd.709.repair
Thanks,
derek