Re: crashing OSDs: ceph_assert(h->file->fnode.ino != 1)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Simon, Harald

thanks for the information. Got your log offline too.

Here is a brief analysis:

1) Your DB is pretty large - 27GB at DB device (making it full) and 279GB at main spinning one. I.e. RocksDB  is experiencing huge spillover to slow main device - expect performance drop. And generally DB is highly under-provisioned.

2) Main device space is highly fragmented - 0.84012572151981013 where 1.0 is the maximum. Can't say for sure but I presume it's pretty full as well.

The above are indirect factors resulting in the current failure though, primarily I just want to make you aware since this might cause other issues later on.


The major reason preventing OSD from proper starting is BlueFS attempt to claim additional space (~52GB), see in the log:

2020-05-29 16:26:53.507 7f0a9f78ec00 10 bluefs _expand_slow_device expanding slow device by 0xc36040000

which results in a lot of pretty short allocations (remember fragmented space as per above) which in its turn cause pretty large write to bluefs log. The latter has a preallocated spare space of 4MB which isn't enough to keep all the update and hence asserts.

I can suggest the following workarounds to start the OSD for now:

1) switch allocator to stupid by setting 'bluestore allocator' parameter to 'stupid'. Presume you have default setting of 'bitmap' now.. This will allow more continuous allocations for bluefs space claim. and hence shorter log write. But given high main disk fragmentation this might be not enough. 'stupid' allocator has some issues (e.g. high RAM utilization over time in some cases) as well but they're rather irrelevant for OSD startup.

2) Increase 'bluefs_max_log_runway' parameter to 8-12 MB (with the default value at 4MB).

Suggest to start with 1) and then additionally proceed with 2) if the first one doesn't help.


Once OSD is up and cluster is healthy please consider adding more DB space and/or OSDs to your cluster to fight dangerous factors I started with.

BTW wondering what payload is the primarily one for you cluster - RGW or something else?


Hope this helps.

Thanks,

Igor

On 5/29/2020 5:38 PM, Simon Leinen wrote:
Dear Igor,

thanks a lot for your assistance.  We're still trying to bring OSDs back
up... the cluster is not in a great shape right now.

so the log from the ticket I can see a huge ((400+ MB) bluefs log
kept  over many small non-adjustent extents.
Presumably it was caused by either setting small bluefs_alloc_size or
high disk space fragmentation or both. Now I'd like more details on
your OSDs.
Could you please collect OSD startup log with debug_bluefs set to 20?
Yes, I now have such a log from an OSD that crashed with the assertion
in the subject after about 30 seconds.  The log file is about 850'000
lines / 100 MB in size.  How can I make it available to you?

Also please run the following commands for broken OSD (need results
only, no need to collect the log unless they're failing):
ceph-bluestore-tool --path <path-to-osd> --command bluefs-bdev-sizes
----------------------------------------------------------------------
inferring bluefs devices from bluestore path
  slot 2 /var/lib/ceph/osd/ceph-46/block -> /dev/dm-7
  slot 1 /var/lib/ceph/osd/ceph-46/block.db -> /dev/dm-17
1 : device size 0xa74c00000 : own 0x[2000~6b4bfe000] = 0x6b4bfe000 : using 0x6b4bfe000(27 GiB)
2 : device size 0x74702000000 : own 0x[37e3e600000~4a85400000] = 0x4a85400000 : using 0x4a85400000(298 GiB)
----------------------------------------------------------------------

ceph-bluestore-tool --path <path-to-osd> --command free-score
----------------------------------------------------------------------
block:
{
     "fragmentation_rating": 0.84012572151981013
}

bluefs-db:
{
     "fragmentation_rating": -nan
}

failure querying 'bluefs-wal'
2020-05-29 16:31:54.882 7fec3c89cd80 -1 asok(0x55c4ec574000) AdminSocket: request '{"prefix": "bluestore allocator score bluefs-wal"}' not defined
----------------------------------------------------------------------

See anything interesting?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux