Re: [EXTERNAL] Re: OSDs flapping with "_open_alloc loaded 132 GiB in 2930776 extents available 113 GiB"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Dave,

Some notes first:

1) The following behavior is fine, BlueStore mounts in two stages - the first one is read-only and among other things it loads allocation map from DB. And that's exactly the case here.

Jul 26 08:55:31 condor_sc0 docker[15282]: 2021-07-26T08:55:31.703+0000 7f0e15b3df40  1 bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc loaded 132 GiB in 2930776 extents available 113 GiB Jul 26 08:55:31 condor_sc0 docker[15282]: 2021-07-26T08:55:31.703+0000 7f0e15b3df40  4 rocksdb: [db/db_impl.cc:390] Shutdown: canceling all background work Jul 26 08:55:31 condor_sc0 docker[15282]: 2021-07-26T08:55:31.704+0000 7f0e15b3df40  4 rocksdb: [db/db_impl.cc:563] Shutdown complete

2) What's really broken is the following allocation attempt:

Jul 26 08:55:34 condor_sc0 docker[15282]: 2021-07-26T08:55:34.767+0000 7f0e15b3df40  1 bluefs _allocate failed to allocate 0x100716 on bdev 1, free 0xd0000; fallback to bdev 2 Jul 26 08:55:34 condor_sc0 docker[15282]: 2021-07-26T08:55:34.767+0000 7f0e15b3df40  1 bluefs _allocate unable to allocate 0x100716 on bdev 2, free 0xffffffffffffffff; fallback to slow device expander Jul 26 08:55:35 condor_sc0 docker[15282]: 2021-07-26T08:55:35.042+0000 7f0e15b3df40 -1 bluestore(/var/lib/ceph/osd/ceph-1) allocate_bluefs_freespace failed to allocate on 0x40000000 min_size 0x110000 > allocated total 0x0 bluefs_shared_alloc_size 0x10000 allocated 0x0 available 0x 1c09738000 Jul 26 08:55:35 condor_sc0 docker[15282]: 2021-07-26T08:55:35.044+0000 7f0e15b3df40 -1 bluefs _allocate failed to expand slow device to fit +0x100716 Jul 26 08:55:35 condor_sc0 docker[15282]: 2021-07-26T08:55:35.044+0000 7f0e15b3df40 -1 bluefs _flush_range allocated: 0x0 offset: 0x0 length: 0x100716

This occurs during BlueFS recovery and that's an attempt to get more space to write out the bluefs log. This shouldn't fail given the plenty of free space:

... available 0x 1c09738000 ...


So to get more verbose but less log one can set both debug-bluestore and debug-bluefs to 1/20. This way just last 10000 lines of the log preceeding the crash would be at level 20. Which seems sufficient for the troubleshooting.

It would be also great to collect the output for the following commands:

ceph-bluestore-tool --path <osd-dir> --command bluefs-bdev-sizes

ceph-bluestore-tool --path <osd-dir> --command bluefs-stats


And finally you can try to switch to bitmap allocator as a workaround - we've fixed a couple of issues in Hybrid one which prevented from proper allcoations under some circumstances. The fixes were made after v15.2.11 release hence this might be the case. So please try setting:

bluestore_allocator = bitmap

bluefs_allocator = bitmap


Thanks,

Igor


On 7/26/2021 12:14 PM, Dave Piper wrote:
Hi Igor,

Thanks for your time looking into this.

I've attached a 5 minute window of OSD logs, which includes several restart attempt (each one takes ~25 seconds).

When I said it looked like we were starting up in a different state, I'm referring to how "Recovered from manifest file" log appears twice, with different logs afterwards. This behaviour seems to repeat reliably on each restart of the OSD. My interpretation of this was that when the initial recovery attempt leads to the rocksdb shutdown, ceph is automatically trying to start the OSD in some alternative state but that this is also failing (with the bdev errors I copied in).  Possibly I'm inferring too much.

I tried turning up the logging levels for rocksdb and bluestore but they're both very spammy so I've not included this in the attached logs. Let me know if you think that would be helpful.

My ceph version is 15.2.11. We're running a containerized deployment using docker image ceph-daemon:v5.0.10-stable-5.0-octopus-centos-8 .

[qs-admin@condor_sc0 metaswitch]$ sudo docker exec b732f9135b42 ceph version
ceph version 15.2.11 (e3523634d9c2227df9af89a4eac33d16738c49cb) octopus (stable)

Cheers,

Dave



-----Original Message-----
From: Igor Fedotov <ifedotov@xxxxxxx>
Sent: 23 July 2021 20:45
To: Dave Piper <david.piper@xxxxxxxxxxxxx>; ceph-users@xxxxxxx
Subject: [EXTERNAL] Re:  OSDs flapping with "_open_alloc loaded 132 GiB in 2930776 extents available 113 GiB"

Hi Dave,

The follow log line indicates that allocator has just completed loading information about free disk blocks into memory.  And it looks perfectly fine.

_open_alloc loaded 132 GiB in 2930776 extents available 113 GiB
Subsequent rocksdb shutdown looks weird without any other log output indicating the issue.
Curious what do you mean under "

After that we seem to try starting up in a slightly different state and get a different set of errors:

"
The resulted errors show lack of disk space at some point but I'd definitely like to get the full startup log.

Please also specify which Octopus version do you have?

Thanks,
Igor

On 7/23/2021 6:48 PM, Dave Piper wrote:
Hi all,

We've got a containerized test cluster with 3 OSDs and ~ 220GiB of data. Shortly after upgrading from nautilus -> octopus, 2 of the 3 OSDs have started flapping. I've also got alarms about the MDS being damaged, which we've seen elsewhere and have a recovery process for, but I'm unable to run this (I suspect because I've only got 1 functioning OSD). My RGWs are also failing to start, again I suspect because of the bad state of OSDs. I've tried restarting all OSDs, rebooting all servers, checked auth (all looks fine) - but I'm still in the same state.

My OSDs seem to be failing at the  "_open_alloc opening allocation metadata" step; looking at logs for each OSD restart, the OSD writes this log, then no logs for a few minutes and then logs:

      bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc loaded 132 GiB in 2930776 extents available 113 GiB
      rocksdb: [db/db_impl.cc:390] Shutdown: canceling all background
work

After that we seem to try starting up in a slightly different state and get a different set of errors:

      bluefs _allocate failed to allocate 0x100716 on bdev 1, free 0xd0000; fallback to bdev 2
      bluefs _allocate unable to allocate 0x100716 on bdev 2, free
0xffffffffffffffff; fallback to slow device expander

and eventually crash and log a heap of stack dumps.

I don't know what extents are but I seem to have a lot of them, and more than I've got capacity for? Maybe I'm running out of RAM or disk space somewhere, but I've got 21GB of free RAM on the server, and each OSD has a 350GiB device attached to it.



I'm wondering if anyone has seen anything like this before or can suggest next debug steps to take?

Cheers,

Dave



Full OSD logs surrounding the "_open_alloc opening allocation metadata" step:


Jul 23 00:07:13 condor_sc0 container_name/ceph-osd-1[1709]:
2021-07-23T00:07:13.818+0000 7f3de111bf40  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1626998833819439, "job": 1, "event":
"recovery_started", "log_files": [392088, 392132]}

Jul 23 00:07:13 condor_sc0 container_name/ceph-osd-1[1709]:
2021-07-23T00:07:13.818+0000 7f3de111bf40  4 rocksdb:
[db/db_impl_open.cc:583] Recovering log #392088 mode 0

Jul 23 00:07:17 condor_sc0 container_name/ceph-osd-1[1709]:
2021-07-23T00:07:17.240+0000 7f3de111bf40  4 rocksdb:
[db/db_impl_open.cc:583] Recovering log #392132 mode 0

Jul 23 00:07:17 condor_sc0 container_name/ceph-osd-1[1709]:
2021-07-23T00:07:17.486+0000 7f3de111bf40  4 rocksdb: EVENT_LOG_v1
{"time_micros": 1626998837486404, "job": 1, "event":
"recovery_finished"}

Jul 23 00:07:17 condor_sc0 container_name/ceph-osd-1[1709]:
2021-07-23T00:07:17.486+0000 7f3de111bf40  1
bluestore(/var/lib/ceph/osd/ceph-1) _open_db opened rocksdb path db
options
compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_
number_to_merge=1,recycle_log_file_num=4,write_buffer_size=268435456,w
ritable_file_max_buffer_size=0,compaction_readahead_size=2097152,max_b
ackground_compactions=2

Jul 23 00:07:17 condor_sc0 container_name/ceph-osd-1[1709]:
2021-07-23T00:07:17.524+0000 7f3de111bf40  1 freelist init

Jul 23 00:07:17 condor_sc0 container_name/ceph-osd-1[1709]:
2021-07-23T00:07:17.524+0000 7f3de111bf40  1 freelist _init_from_label

Jul 23 00:07:17 condor_sc0 container_name/ceph-osd-1[1709]:
2021-07-23T00:07:17.529+0000 7f3de111bf40  1
bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc opening allocation
metadata

Jul 23 00:07:18 condor_sc0 container_name/ceph-osd-1[1709]:
2021-07-23T00:07:18.238+0000 7f3de111bf40  1 HybridAllocator
_spillover_range constructing fallback allocator

Jul 23 00:07:20 condor_sc0 container_name/ceph-osd-1[1709]:
2021-07-23T00:07:20.563+0000 7f3de111bf40  1
bluestore(/var/lib/ceph/osd/ceph-1) _open_alloc loaded 132 GiB in
2930776 extents available 113 GiB

Jul 23 00:07:20 condor_sc0 container_name/ceph-osd-1[1709]:
2021-07-23T00:07:20.563+0000 7f3de111bf40  4 rocksdb:
[db/db_impl.cc:390] Shutdown: canceling all background work

Jul 23 00:07:20 condor_sc0 container_name/ceph-osd-1[1709]:
2021-07-23T00:07:20.565+0000 7f3de111bf40  4 rocksdb:
[db/db_impl.cc:563] Shutdown complete

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an
email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux