Re: LVM osds loose connection to disk

Igor Fedotov <igor.fedotov@xxxxxxxx> · Thu, 10 Nov 2022 17:48:23 +0300

Hi Frank,

unfortunately IMO it's not an easy task to identify what are the 
relevant difference between mimic and octopus in this respect.. At least 
the question would be what minor Ceph releases are/were in use.

I recall there were some tricks with setting/clearing bluefs_buffered_io 
somewhere in that period. But I can hardly recall anything else... 
Curious if OS/Container or other third-party software was upgraded with 
Ceph upgrade as well? Just in case I presume you were using containers 
in mimic too, right?

Instead I'd rather approach the issue from a different side:

1) Learn how to [easily] reproduce the issue, preferably in a test lab 
rather than in the field. Better use exactly the same disk(s), H/W, OS 
and S/W versions as in the production.

2) (can be done in prod as well) - once an OSD stuck - set debug-bluefs 
and debug-bdev to 20 and collect verbose log - check what's happening 
there.  Meanwhile monitor disk activity - is there any load to the disk 
while in this state at all? Do disk reads (e.g. via dd) out of OSD 
container succeed at this point?

Thanks,

Igor

On 11/10/2022 5:23 PM, Frank Schilder wrote:
Hi all,

I have some kind of update on the matter of stuck OSDs. It seems not to be an LVM issue and it also seems not to be connected to the OSD size.

After moving all data from the tiny 100G OSDs to spare SSDs, I redeployed the 400G disks with 1 OSD per disk and started to move data from the slow spare SSDs back to the fast ones. After moving about 100 out of 1024 PGs of the pool fast OSDs started failing again. It is kind of the same observation as before, I can't stop a container with a failed OSD. However, when I restart docker, everything comes up clean.

When I look at the logs, I see that the OSD aborts with suicide timeout after many osd_op_tp thread timeouts. However, the OSD process and with it the container does not terminate because of a hanging thread. The syslog has the message (all messages below in chronological order):

kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 122 seconds

about 30 seconds before the OSD aborts with suicide timeout with

ceph-osd: 2022-11-09T17:36:53.691+0100 7f663a23c700 -1 *** Caught signal (Aborted) **#012 in thread 7f663a23c700 thread_name:msgr-worker-2#012#012 ceph version 15.2.17 ...

What I see in the syslog is, that the thread bstore_kv_sync seems not to be terminated with the abort. These messages continue to show up:

kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 368 seconds.
kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 491 seconds.
kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 614 seconds.

On docker stop container the launch-script receives the term signal, but the OSD cannot be deactivated due to this thread:

journal: osd_lvm_start: deactivating OSD 959
journal: osd_lvm_start: unmounting /var/lib/ceph/osd/ceph-959
journal: umount: /var/lib/ceph/osd/ceph-959: target is busy.

Its probably busy because of bstore_kv_sync thread hanging. As a consequence, the container is still running, has a ceph-osd process shown with docker top and these messages continue to show up:

INFO: task bstore_kv_sync:1283156 blocked for more than 737 seconds.
INFO: task bstore_kv_sync:1283156 blocked for more than 860 seconds.

Although the bstore_kv_sync thread is unkillable, a restart of docker clears everything out and the OSD restarts fine. I'm somewhat hesitant to accept the simple "must be the firmware" statement, because these disks worked fine for 4 years with mimic. The only thing that changed was the ceph version from mimic to octopus, everything else stayed the same: OS version, kernel version, docker version, firmware version.

Since it happens only on this type of disks, it could very well have to do with firmware, but not without ceph having had a serious change in low-level disk access between mimic and cotopus. So, I'm wondering what features of the firmware octopus is using that mimic was not. Would be great if somebody has some pointers for what part of the software stack I should look at, I would like to avoid hunting ghosts.

Many thanks and best regards!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: 10 October 2022 23:33:32
To: Igor Fedotov; ceph-users@xxxxxxx
Subject:  Re: LVM osds loose connection to disk

Hi Igor.

The problem of OSD crashes was resolved after migrating just a little bit of the meta-data pool to other disks (we decided to evacuate the small OSDs onto larger disks to make space). Therefore, I don't think its an LVM or disk issue. The cluster is working perfectly now after migrating some data away from the small OSDs. I rather believe that its tightly related to "OSD crashes during upgrade mimic->octopus", it happens only on OSDs where the repair command errs out with abort on enospc.

My hypothesis is now more along the lines of a dead-lock occurring as a consequence of an aborted daemon-thread. Is there any part of the bluestore code that acquires an exclusive device lock that gets passed through to the pv and can lead to a device freeze if not released? I'm wondering if something like this happened here as a consequence of the allocator fail. I saw a lot of lock-up warnings related to OSD threads in the syslog.

Regarding the 2 minutes time difference and heartbeats. The OSD seems to have been responding to heartbeats all the time, even after the suicide time-out; see description below. I executed docker stop container at 16:15:39. Until this moment, the OSD was considered up+in by the MONs.

Here is a recall of events from memory together with a description of how the 4 OSDs on 1 disk are executed in a singe container. I will send detailed logs and scripts via our private communication. If anyone else is interested as well, I'm happy to make it available.

Being in the situation over a weekend and at night, we didn't take precise minutes. Our priority was to get everything working. I'm afraid this here is as accurate as it gets.

Let's start with how the processes are started inside the container. We have a main script M executed as the entry-point to the container. For each OSD found on a drive, M forks off a copy Mn of itself, which in turn forks off the OSD process:

M -> M1 -> OSD1
M -> M2 -> OSD2
M -> M3 -> OSD3
M -> M4 -> OSD4

At the end, we have 5 instances of the main script and 4 instances of OSDs running. This somewhat cumbersome looking startup is required to be able to forward signals sent by the docker daemon, most notably, SIGINT on docker stop container. In addition, all instances of M trap a number of signals, including SIGCHLD. If just one OSD dies, the entire container should stop and restart. On a disk fail all OSDs on that disk go down and will be rebuild in the background simultaneously.

Executing docker top container on the above situation gives:

M
M1
M2
M3
M4
OSD1
OSD2
OSD3
OSD4

After the crash of, say, OSD1, I saw something like this (docker top container):

M
M1
M2
M3
M4
OSD2
OSD3
OSD4

The OSD processes were reported to be in Sl-state by ps.

At this point, OSD1 was gone from the list but M1 was still running. There was no SIGCHILD! At the same time, OSDs 2-4 were marked down by the MONs, but not OSD1! Due to this, any IO targeting OSD1 got stuck and corresponding slow ops warnings started piling up.

My best bet is that not all threads of OSD1 were terminated and, therefore, no SIGCHLD was sent to M1. For some reason OSD1 was not marked down and I wonder if its left-overs might have responded to heartbeats.

At the same time the disk was not accessible to LVM commands any more. A "ceph-volume inventory /dev/sda" got stuck in "lvs" (in D-state). I did not try to access the raw device with dd. I was thinking about it, but attended to more pressing issues. I actually don't think the raw device was locked up, but that's just a guess.

In an attempt to clean-up the OSD down situation, I executed "docker stop container" (to be followed by docker start). The stop took a long time (I use an increased SIGKILL time-out) and resulted in this state (docker top container):

OSD2
OSD3
OSD4

The OSD processes were now reported in D-state by ps and the container was still reported to be running by docker. However, at this point all 4 OSDs were marked down, PGs peered and IO started again.

I'm wondering if a failed allocation attempt lead to a device/lvm lock being acquired but not released, leading to an LVM device freeze. There were thread lockup messages in the syslog. It smells a lot like a dead-lock situation created by not releasing a critical resource on SIGABRT. Unfortunately, there seem to be no log messages from the thread that got locked up.

Hope this makes some sense when interpreting the logs.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Igor Fedotov <igor.fedotov@xxxxxxxx>
Sent: 09 October 2022 22:07:16
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  LVM osds loose connection to disk

Hi Frank,

can't advise much on the disk issue  - just an obvious thought about
upgrading the firmware and/or contacting the vendor. IIUC disk is
totally inaccessible at this point, e.g. you're unable to read from it
bypassing LVM as well, right? If so this definitely looks like a
low-level problem.

As for OSD down issue - may I have some clarification please - did this
osd.975 never go down or it was just a few minutes later? In the log
snippet you shared I can see a 2 min gap between operation timeouts
indication and the final OSD suicide. I presume it had been able to
response heartbeats prior to that suicide and hence stayed online... But
mostly speculating so far...

Thanks,

Igor
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx