Hi Dan and Igor, looks very much like BFQ is indeed the culprit. I rolled back everything to none (high-performance SAS SSDs) and mq-deadline (low-medium performance SATA SSDs) and started a full speed data movement from the slow to the fast disks. The cluster operates as good as in the past now for at least an hour, really nice recovery performance again. I will keep watching, but don't really expect problems. An OSD should have failed by now under the current rebalance load. This change of disk scheduler was a change that went a bit below radar. We started moving to bfq a long time ago, but didn't go all-in at once. We started with a few disk types and hosts. In between, there was an OS upgrade after which we did suddenly observe OSDs getting stuck for some time. However, this was entirely happening for a single disk type only (Kingston SSDs if anyone wants to search this list about symptoms) and, therefore, we didn't make the connection to bfq. With the upgrade to octopus the change to bfq was rolled out to a few more disks types on other hosts. Again, observing problems on only one of the disk types. It took until yesterday that I remembered that and realised that I should check. We are back to normal and these stuck OSDs were our major headache since the octopus upgrade. Now we only face the usual paranormal activity like broken usage reporting after reducing pg_num of a pool, broken mappings after osd reweight and mysterious slow export_dir ops. Thanks and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Dan van der Ster <dvanders@xxxxxxxxx> Sent: 18 November 2022 10:43:12 To: Frank Schilder Cc: Igor Fedotov; ceph-users@xxxxxxx Subject: Re: Re: LVM osds loose connection to disk Hi Frank, bfq was definitely broken, deadlocking io for a few CentOS Stream 8 kernels between EL 8.5 and 8.6 -- we also hit that in production and switched over to `none`. I don't recall exactly when the upstream kernel was also broken but apparently this was the fix: https://marc.info/?l=linux-block&m=164366111512992&w=2 In any case you might want to just use `none` with flash devs -- i'm not sure the fair scheduling and mq is very meaningful anymore for high iops devices and ceph. Cheers, Dan On Thu, Nov 17, 2022 at 1:23 PM Frank Schilder <frans@xxxxxx> wrote: > > Hi Igor, > > I might have a smoking gun. Could it be that ceph (the kernel??) has issues with certain disk schedulers? There was a recommendation on this list to use bfq with bluestore. This was actually the one change other than ceph version during upgrade: to make bfq default. Now, this might be a problem with certain drives that have a preferred scheduler different than bfq. Here my observation: > > I managed to get one of the OSDs to hang today. It was not the usual abort, I don't know why the op_thread_timeout and suicide_timeout didn't trigger. The OSD's worker thread was unresponsive for a bit more than 10 minutes before I took action. Hence, nothing in the log (should maybe have used kill sigabort). Now, this time I tried to check if I can access the disk with dd. And, I could not. A > > dd if=/dev/sdn of=disk-dump bs=4096 count=100 > > got stuck right away in D-state: > > 1652472 D+ dd if=/dev/sdn of=disk-dump bs=4096 count=100 > > This time, since I was curious about the disk scheduler, I went to another terminal on the same machine and did: > > # cat /sys/block/sdn/queue/scheduler > mq-deadline kyber [bfq] none > # echo none >> /sys/block/sdn/queue/scheduler > # cat /sys/block/sdn/queue/scheduler > [none] mq-deadline kyber bfq > > Going back to the stuck session, I see now (you can see my attempts to interrupt the dd): > > # dd if=/dev/sdn of=disk-dump bs=4096 count=100 > ^C^C3+0 records in > 2+0 records out > 8192 bytes (8.2 kB) copied, 336.712 s, 0.0 kB/s > > Suddenly, the disk responds again! Also, the ceph container stopped (a docker stop container returned without the container stopping - as before in this situation). > > Could it be that recommendations for disk scheduler choice should be reconsidered, or is this pointing towards a bug in either how ceph or the kernel schedules disk IO? To confirm this hypothesis, I will retry the stress test with the scheduler set to the default kernel choice. > > I did day-long fio benchmarks with all schedulers and all sorts of workloads on our drives and could not find anything like that. It looks like it is very difficult to impossible to reproduce a realistic ceph-osd IO pattern for testing. Is there any tool available for this? > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Frank Schilder <frans@xxxxxx> > Sent: 14 November 2022 13:03:58 > To: Igor Fedotov; ceph-users@xxxxxxx > Subject: Re: LVM osds loose connection to disk > > I can't reproduce the problem with artificial workloads, I need to get one of these OSDs running in the meta-data pool until it crashes. My plan is to reduce time-outs and increase log level for these specific OSDs to capture what happened before an abort in the memory log. I can spare about 100G of RAM for log entries. I found the following relevant options with settings I think will work for my case: > > osd_op_thread_suicide_timeout 30 # default 150 > osd_op_thread_timeout 10 # default 15 > debug_bluefs 1/20 # default 1/5 > debug_bluestore 1/20 # default 1/5 > bluestore_kv_sync_util_logging_s 3 # default 10 > log_max_recent 100000 # default 10000 > > It would be great if someone could confirm that these settings will achieve what I want (or what is missing). I would like to capture at least 1 minute worth of log entries in RAM with high debug settings. Does anyone have a good estimate for how many log-entries are created per second with these settings for tuning log_max_recent? > > Thanks for your help! > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Frank Schilder <frans@xxxxxx> > Sent: 11 November 2022 10:25:17 > To: Igor Fedotov; ceph-users@xxxxxxx > Subject: Re: LVM osds loose connection to disk > > Hi Igor, > > thanks for your reply. We only exchanged the mimic containers with the octopus ones. We didn't even reboot the servers during upgrade, only later for trouble shooting. The only change since the upgrade is the ceph container. > > I'm trying to go down the stack and run a benchmark on the OSD directly. Unfortunately, it seems that osd bench is too simplistic. I don't think we have a problem with the disk, I rather think there is a race condition that gets the bstore_kv_sync thread stuck. I would need to reproduce a realistic workload on the OSD itself. Standard benchmarks on the disks directly did not show any problems (even when running for days). Is there a way to address an OSD directly with IO, for example, low-level rados commands? > > I'm almost down to thinking that the only way to create a realistic workload is to create a size-1 pool on a single OSD, create an rbd image on this pool and run fio with rbd engine. > > Another question. The osd bench command accepts 4 arguments: > > bench [<count:int>] [<size:int>] [<object_size:int>] [<object_num:int>] : OSD benchmark: write <count> <size>-byte objects(with <obj_size> <obj_num>), (default count=1G default size=4MB). Results in log. > > Only the first 2 are explained in the docs. Any idea what the other 2 are doing? > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Igor Fedotov <igor.fedotov@xxxxxxxx> > Sent: 10 November 2022 15:48:23 > To: Frank Schilder; ceph-users@xxxxxxx > Subject: Re: Re: LVM osds loose connection to disk > > Hi Frank, > > unfortunately IMO it's not an easy task to identify what are the > relevant difference between mimic and octopus in this respect.. At least > the question would be what minor Ceph releases are/were in use. > > I recall there were some tricks with setting/clearing bluefs_buffered_io > somewhere in that period. But I can hardly recall anything else... > Curious if OS/Container or other third-party software was upgraded with > Ceph upgrade as well? Just in case I presume you were using containers > in mimic too, right? > > > Instead I'd rather approach the issue from a different side: > > 1) Learn how to [easily] reproduce the issue, preferably in a test lab > rather than in the field. Better use exactly the same disk(s), H/W, OS > and S/W versions as in the production. > > 2) (can be done in prod as well) - once an OSD stuck - set debug-bluefs > and debug-bdev to 20 and collect verbose log - check what's happening > there. Meanwhile monitor disk activity - is there any load to the disk > while in this state at all? Do disk reads (e.g. via dd) out of OSD > container succeed at this point? > > > Thanks, > > Igor > > On 11/10/2022 5:23 PM, Frank Schilder wrote: > > Hi all, > > > > I have some kind of update on the matter of stuck OSDs. It seems not to be an LVM issue and it also seems not to be connected to the OSD size. > > > > After moving all data from the tiny 100G OSDs to spare SSDs, I redeployed the 400G disks with 1 OSD per disk and started to move data from the slow spare SSDs back to the fast ones. After moving about 100 out of 1024 PGs of the pool fast OSDs started failing again. It is kind of the same observation as before, I can't stop a container with a failed OSD. However, when I restart docker, everything comes up clean. > > > > When I look at the logs, I see that the OSD aborts with suicide timeout after many osd_op_tp thread timeouts. However, the OSD process and with it the container does not terminate because of a hanging thread. The syslog has the message (all messages below in chronological order): > > > > kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 122 seconds > > > > about 30 seconds before the OSD aborts with suicide timeout with > > > > ceph-osd: 2022-11-09T17:36:53.691+0100 7f663a23c700 -1 *** Caught signal (Aborted) **#012 in thread 7f663a23c700 thread_name:msgr-worker-2#012#012 ceph version 15.2.17 ... > > > > What I see in the syslog is, that the thread bstore_kv_sync seems not to be terminated with the abort. These messages continue to show up: > > > > kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 368 seconds. > > kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 491 seconds. > > kernel: INFO: task bstore_kv_sync:1283156 blocked for more than 614 seconds. > > > > On docker stop container the launch-script receives the term signal, but the OSD cannot be deactivated due to this thread: > > > > journal: osd_lvm_start: deactivating OSD 959 > > journal: osd_lvm_start: unmounting /var/lib/ceph/osd/ceph-959 > > journal: umount: /var/lib/ceph/osd/ceph-959: target is busy. > > > > Its probably busy because of bstore_kv_sync thread hanging. As a consequence, the container is still running, has a ceph-osd process shown with docker top and these messages continue to show up: > > > > INFO: task bstore_kv_sync:1283156 blocked for more than 737 seconds. > > INFO: task bstore_kv_sync:1283156 blocked for more than 860 seconds. > > > > Although the bstore_kv_sync thread is unkillable, a restart of docker clears everything out and the OSD restarts fine. I'm somewhat hesitant to accept the simple "must be the firmware" statement, because these disks worked fine for 4 years with mimic. The only thing that changed was the ceph version from mimic to octopus, everything else stayed the same: OS version, kernel version, docker version, firmware version. > > > > Since it happens only on this type of disks, it could very well have to do with firmware, but not without ceph having had a serious change in low-level disk access between mimic and cotopus. So, I'm wondering what features of the firmware octopus is using that mimic was not. Would be great if somebody has some pointers for what part of the software stack I should look at, I would like to avoid hunting ghosts. > > > > Many thanks and best regards! > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: Frank Schilder <frans@xxxxxx> > > Sent: 10 October 2022 23:33:32 > > To: Igor Fedotov; ceph-users@xxxxxxx > > Subject: Re: LVM osds loose connection to disk > > > > Hi Igor. > > > > The problem of OSD crashes was resolved after migrating just a little bit of the meta-data pool to other disks (we decided to evacuate the small OSDs onto larger disks to make space). Therefore, I don't think its an LVM or disk issue. The cluster is working perfectly now after migrating some data away from the small OSDs. I rather believe that its tightly related to "OSD crashes during upgrade mimic->octopus", it happens only on OSDs where the repair command errs out with abort on enospc. > > > > My hypothesis is now more along the lines of a dead-lock occurring as a consequence of an aborted daemon-thread. Is there any part of the bluestore code that acquires an exclusive device lock that gets passed through to the pv and can lead to a device freeze if not released? I'm wondering if something like this happened here as a consequence of the allocator fail. I saw a lot of lock-up warnings related to OSD threads in the syslog. > > > > Regarding the 2 minutes time difference and heartbeats. The OSD seems to have been responding to heartbeats all the time, even after the suicide time-out; see description below. I executed docker stop container at 16:15:39. Until this moment, the OSD was considered up+in by the MONs. > > > > Here is a recall of events from memory together with a description of how the 4 OSDs on 1 disk are executed in a singe container. I will send detailed logs and scripts via our private communication. If anyone else is interested as well, I'm happy to make it available. > > > > Being in the situation over a weekend and at night, we didn't take precise minutes. Our priority was to get everything working. I'm afraid this here is as accurate as it gets. > > > > Let's start with how the processes are started inside the container. We have a main script M executed as the entry-point to the container. For each OSD found on a drive, M forks off a copy Mn of itself, which in turn forks off the OSD process: > > > > M -> M1 -> OSD1 > > M -> M2 -> OSD2 > > M -> M3 -> OSD3 > > M -> M4 -> OSD4 > > > > At the end, we have 5 instances of the main script and 4 instances of OSDs running. This somewhat cumbersome looking startup is required to be able to forward signals sent by the docker daemon, most notably, SIGINT on docker stop container. In addition, all instances of M trap a number of signals, including SIGCHLD. If just one OSD dies, the entire container should stop and restart. On a disk fail all OSDs on that disk go down and will be rebuild in the background simultaneously. > > > > Executing docker top container on the above situation gives: > > > > M > > M1 > > M2 > > M3 > > M4 > > OSD1 > > OSD2 > > OSD3 > > OSD4 > > > > After the crash of, say, OSD1, I saw something like this (docker top container): > > > > M > > M1 > > M2 > > M3 > > M4 > > OSD2 > > OSD3 > > OSD4 > > > > The OSD processes were reported to be in Sl-state by ps. > > > > At this point, OSD1 was gone from the list but M1 was still running. There was no SIGCHILD! At the same time, OSDs 2-4 were marked down by the MONs, but not OSD1! Due to this, any IO targeting OSD1 got stuck and corresponding slow ops warnings started piling up. > > > > My best bet is that not all threads of OSD1 were terminated and, therefore, no SIGCHLD was sent to M1. For some reason OSD1 was not marked down and I wonder if its left-overs might have responded to heartbeats. > > > > At the same time the disk was not accessible to LVM commands any more. A "ceph-volume inventory /dev/sda" got stuck in "lvs" (in D-state). I did not try to access the raw device with dd. I was thinking about it, but attended to more pressing issues. I actually don't think the raw device was locked up, but that's just a guess. > > > > In an attempt to clean-up the OSD down situation, I executed "docker stop container" (to be followed by docker start). The stop took a long time (I use an increased SIGKILL time-out) and resulted in this state (docker top container): > > > > OSD2 > > OSD3 > > OSD4 > > > > The OSD processes were now reported in D-state by ps and the container was still reported to be running by docker. However, at this point all 4 OSDs were marked down, PGs peered and IO started again. > > > > I'm wondering if a failed allocation attempt lead to a device/lvm lock being acquired but not released, leading to an LVM device freeze. There were thread lockup messages in the syslog. It smells a lot like a dead-lock situation created by not releasing a critical resource on SIGABRT. Unfortunately, there seem to be no log messages from the thread that got locked up. > > > > Hope this makes some sense when interpreting the logs. > > > > Best regards, > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: Igor Fedotov <igor.fedotov@xxxxxxxx> > > Sent: 09 October 2022 22:07:16 > > To: Frank Schilder; ceph-users@xxxxxxx > > Subject: Re: LVM osds loose connection to disk > > > > Hi Frank, > > > > can't advise much on the disk issue - just an obvious thought about > > upgrading the firmware and/or contacting the vendor. IIUC disk is > > totally inaccessible at this point, e.g. you're unable to read from it > > bypassing LVM as well, right? If so this definitely looks like a > > low-level problem. > > > > > > As for OSD down issue - may I have some clarification please - did this > > osd.975 never go down or it was just a few minutes later? In the log > > snippet you shared I can see a 2 min gap between operation timeouts > > indication and the final OSD suicide. I presume it had been able to > > response heartbeats prior to that suicide and hence stayed online... But > > mostly speculating so far... > > > > > > Thanks, > > > > Igor > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > -- > Igor Fedotov > Ceph Lead Developer > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH, Freseniusstr. 31h, 81247 Munich > CEO: Martin Verges - VAT-ID: DE310638492 > Com. register: Amtsgericht Munich HRB 231263 > Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx