Re: LVM osds loose connection to disk

Frank Schilder <frans@xxxxxx> · Mon, 10 Oct 2022 21:33:32 +0000

Hi Igor.

The problem of OSD crashes was resolved after migrating just a little bit of the meta-data pool to other disks (we decided to evacuate the small OSDs onto larger disks to make space). Therefore, I don't think its an LVM or disk issue. The cluster is working perfectly now after migrating some data away from the small OSDs. I rather believe that its tightly related to "OSD crashes during upgrade mimic->octopus", it happens only on OSDs where the repair command errs out with abort on enospc.

My hypothesis is now more along the lines of a dead-lock occurring as a consequence of an aborted daemon-thread. Is there any part of the bluestore code that acquires an exclusive device lock that gets passed through to the pv and can lead to a device freeze if not released? I'm wondering if something like this happened here as a consequence of the allocator fail. I saw a lot of lock-up warnings related to OSD threads in the syslog.

Regarding the 2 minutes time difference and heartbeats. The OSD seems to have been responding to heartbeats all the time, even after the suicide time-out; see description below. I executed docker stop container at 16:15:39. Until this moment, the OSD was considered up+in by the MONs.

Here is a recall of events from memory together with a description of how the 4 OSDs on 1 disk are executed in a singe container. I will send detailed logs and scripts via our private communication. If anyone else is interested as well, I'm happy to make it available.

Being in the situation over a weekend and at night, we didn't take precise minutes. Our priority was to get everything working. I'm afraid this here is as accurate as it gets.

Let's start with how the processes are started inside the container. We have a main script M executed as the entry-point to the container. For each OSD found on a drive, M forks off a copy Mn of itself, which in turn forks off the OSD process:

M -> M1 -> OSD1
M -> M2 -> OSD2
M -> M3 -> OSD3
M -> M4 -> OSD4

At the end, we have 5 instances of the main script and 4 instances of OSDs running. This somewhat cumbersome looking startup is required to be able to forward signals sent by the docker daemon, most notably, SIGINT on docker stop container. In addition, all instances of M trap a number of signals, including SIGCHLD. If just one OSD dies, the entire container should stop and restart. On a disk fail all OSDs on that disk go down and will be rebuild in the background simultaneously.

Executing docker top container on the above situation gives:

M
M1
M2
M3
M4
OSD1
OSD2
OSD3
OSD4

After the crash of, say, OSD1, I saw something like this (docker top container):

M
M1
M2
M3
M4
OSD2
OSD3
OSD4

The OSD processes were reported to be in Sl-state by ps.

At this point, OSD1 was gone from the list but M1 was still running. There was no SIGCHILD! At the same time, OSDs 2-4 were marked down by the MONs, but not OSD1! Due to this, any IO targeting OSD1 got stuck and corresponding slow ops warnings started piling up.

My best bet is that not all threads of OSD1 were terminated and, therefore, no SIGCHLD was sent to M1. For some reason OSD1 was not marked down and I wonder if its left-overs might have responded to heartbeats.

At the same time the disk was not accessible to LVM commands any more. A "ceph-volume inventory /dev/sda" got stuck in "lvs" (in D-state). I did not try to access the raw device with dd. I was thinking about it, but attended to more pressing issues. I actually don't think the raw device was locked up, but that's just a guess.

In an attempt to clean-up the OSD down situation, I executed "docker stop container" (to be followed by docker start). The stop took a long time (I use an increased SIGKILL time-out) and resulted in this state (docker top container):

OSD2
OSD3
OSD4

The OSD processes were now reported in D-state by ps and the container was still reported to be running by docker. However, at this point all 4 OSDs were marked down, PGs peered and IO started again.

I'm wondering if a failed allocation attempt lead to a device/lvm lock being acquired but not released, leading to an LVM device freeze. There were thread lockup messages in the syslog. It smells a lot like a dead-lock situation created by not releasing a critical resource on SIGABRT. Unfortunately, there seem to be no log messages from the thread that got locked up.

Hope this makes some sense when interpreting the logs.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Igor Fedotov <igor.fedotov@xxxxxxxx>
Sent: 09 October 2022 22:07:16
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  LVM osds loose connection to disk

Hi Frank,

can't advise much on the disk issue  - just an obvious thought about
upgrading the firmware and/or contacting the vendor. IIUC disk is
totally inaccessible at this point, e.g. you're unable to read from it
bypassing LVM as well, right? If so this definitely looks like a
low-level problem.

As for OSD down issue - may I have some clarification please - did this
osd.975 never go down or it was just a few minutes later? In the log
snippet you shared I can see a 2 min gap between operation timeouts
indication and the final OSD suicide. I presume it had been able to
response heartbeats prior to that suicide and hence stayed online... But
mostly speculating so far...

Thanks,

Igor
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx