Simon, Harald
thanks for the information. Got your log offline too.
Here is a brief analysis:
1) Your DB is pretty large - 27GB at DB device (making it full) and
279GB at main spinning one. I.e. RocksDB is experiencing huge spillover
to slow main device - expect performance drop. And generally DB is
highly under-provisioned.
2) Main device space is highly fragmented - 0.84012572151981013 where
1.0 is the maximum. Can't say for sure but I presume it's pretty full as
well.
The above are indirect factors resulting in the current failure though,
primarily I just want to make you aware since this might cause other
issues later on.
The major reason preventing OSD from proper starting is BlueFS attempt
to claim additional space (~52GB), see in the log:
2020-05-29 16:26:53.507 7f0a9f78ec00 10 bluefs _expand_slow_device
expanding slow device by 0xc36040000
which results in a lot of pretty short allocations (remember fragmented
space as per above) which in its turn cause pretty large write to bluefs
log. The latter has a preallocated spare space of 4MB which isn't enough
to keep all the update and hence asserts.
I can suggest the following workarounds to start the OSD for now:
1) switch allocator to stupid by setting 'bluestore allocator' parameter
to 'stupid'. Presume you have default setting of 'bitmap' now.. This
will allow more continuous allocations for bluefs space claim. and hence
shorter log write. But given high main disk fragmentation this might be
not enough. 'stupid' allocator has some issues (e.g. high RAM
utilization over time in some cases) as well but they're rather
irrelevant for OSD startup.
2) Increase 'bluefs_max_log_runway' parameter to 8-12 MB (with the
default value at 4MB).
Suggest to start with 1) and then additionally proceed with 2) if the
first one doesn't help.
Once OSD is up and cluster is healthy please consider adding more DB
space and/or OSDs to your cluster to fight dangerous factors I started with.
BTW wondering what payload is the primarily one for you cluster - RGW or
something else?
Hope this helps.
Thanks,
Igor
On 5/29/2020 5:38 PM, Simon Leinen wrote:
Dear Igor,
thanks a lot for your assistance. We're still trying to bring OSDs back
up... the cluster is not in a great shape right now.
so the log from the ticket I can see a huge ((400+ MB) bluefs log
kept over many small non-adjustent extents.
Presumably it was caused by either setting small bluefs_alloc_size or
high disk space fragmentation or both. Now I'd like more details on
your OSDs.
Could you please collect OSD startup log with debug_bluefs set to 20?
Yes, I now have such a log from an OSD that crashed with the assertion
in the subject after about 30 seconds. The log file is about 850'000
lines / 100 MB in size. How can I make it available to you?
Also please run the following commands for broken OSD (need results
only, no need to collect the log unless they're failing):
ceph-bluestore-tool --path <path-to-osd> --command bluefs-bdev-sizes
----------------------------------------------------------------------
inferring bluefs devices from bluestore path
slot 2 /var/lib/ceph/osd/ceph-46/block -> /dev/dm-7
slot 1 /var/lib/ceph/osd/ceph-46/block.db -> /dev/dm-17
1 : device size 0xa74c00000 : own 0x[2000~6b4bfe000] = 0x6b4bfe000 : using 0x6b4bfe000(27 GiB)
2 : device size 0x74702000000 : own 0x[37e3e600000~4a85400000] = 0x4a85400000 : using 0x4a85400000(298 GiB)
----------------------------------------------------------------------
ceph-bluestore-tool --path <path-to-osd> --command free-score
----------------------------------------------------------------------
block:
{
"fragmentation_rating": 0.84012572151981013
}
bluefs-db:
{
"fragmentation_rating": -nan
}
failure querying 'bluefs-wal'
2020-05-29 16:31:54.882 7fec3c89cd80 -1 asok(0x55c4ec574000) AdminSocket: request '{"prefix": "bluestore allocator score bluefs-wal"}' not defined
----------------------------------------------------------------------
See anything interesting?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx