Re: crashing OSDs: ceph_assert(h->file->fnode.ino != 1)

Igor Fedotov <ifedotov@xxxxxxx> · Fri, 29 May 2020 19:20:38 +0300

Simon, Harald

thanks for the information. Got your log offline too.

Here is a brief analysis:

1) Your DB is pretty large - 27GB at DB device (making it full) and 
279GB at main spinning one. I.e. RocksDB  is experiencing huge spillover 
to slow main device - expect performance drop. And generally DB is 
highly under-provisioned.

2) Main device space is highly fragmented - 0.84012572151981013 where 
1.0 is the maximum. Can't say for sure but I presume it's pretty full as 
well.

The above are indirect factors resulting in the current failure though, 
primarily I just want to make you aware since this might cause other 
issues later on.

The major reason preventing OSD from proper starting is BlueFS attempt 
to claim additional space (~52GB), see in the log:

2020-05-29 16:26:53.507 7f0a9f78ec00 10 bluefs _expand_slow_device 
expanding slow device by 0xc36040000

which results in a lot of pretty short allocations (remember fragmented 
space as per above) which in its turn cause pretty large write to bluefs 
log. The latter has a preallocated spare space of 4MB which isn't enough 
to keep all the update and hence asserts.

I can suggest the following workarounds to start the OSD for now:

1) switch allocator to stupid by setting 'bluestore allocator' parameter 
to 'stupid'. Presume you have default setting of 'bitmap' now.. This 
will allow more continuous allocations for bluefs space claim. and hence 
shorter log write. But given high main disk fragmentation this might be 
not enough. 'stupid' allocator has some issues (e.g. high RAM 
utilization over time in some cases) as well but they're rather 
irrelevant for OSD startup.

2) Increase 'bluefs_max_log_runway' parameter to 8-12 MB (with the 
default value at 4MB).

Suggest to start with 1) and then additionally proceed with 2) if the 
first one doesn't help.

Once OSD is up and cluster is healthy please consider adding more DB 
space and/or OSDs to your cluster to fight dangerous factors I started with.

BTW wondering what payload is the primarily one for you cluster - RGW or 
something else?

Hope this helps.

Thanks,

Igor

On 5/29/2020 5:38 PM, Simon Leinen wrote:
Dear Igor,

thanks a lot for your assistance.  We're still trying to bring OSDs back
up... the cluster is not in a great shape right now.

so the log from the ticket I can see a huge ((400+ MB) bluefs log
kept  over many small non-adjustent extents.
Presumably it was caused by either setting small bluefs_alloc_size or
high disk space fragmentation or both. Now I'd like more details on
your OSDs.
Could you please collect OSD startup log with debug_bluefs set to 20?
Yes, I now have such a log from an OSD that crashed with the assertion
in the subject after about 30 seconds.  The log file is about 850'000
lines / 100 MB in size.  How can I make it available to you?

Also please run the following commands for broken OSD (need results
only, no need to collect the log unless they're failing):
ceph-bluestore-tool --path <path-to-osd> --command bluefs-bdev-sizes
----------------------------------------------------------------------
inferring bluefs devices from bluestore path
  slot 2 /var/lib/ceph/osd/ceph-46/block -> /dev/dm-7
  slot 1 /var/lib/ceph/osd/ceph-46/block.db -> /dev/dm-17
1 : device size 0xa74c00000 : own 0x[2000~6b4bfe000] = 0x6b4bfe000 : using 0x6b4bfe000(27 GiB)
2 : device size 0x74702000000 : own 0x[37e3e600000~4a85400000] = 0x4a85400000 : using 0x4a85400000(298 GiB)
----------------------------------------------------------------------

ceph-bluestore-tool --path <path-to-osd> --command free-score
----------------------------------------------------------------------
block:
{
     "fragmentation_rating": 0.84012572151981013
}

bluefs-db:
{
     "fragmentation_rating": -nan
}

failure querying 'bluefs-wal'
2020-05-29 16:31:54.882 7fec3c89cd80 -1 asok(0x55c4ec574000) AdminSocket: request '{"prefix": "bluestore allocator score bluefs-wal"}' not defined
----------------------------------------------------------------------

See anything interesting?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx