Hi Igor, the problematic disk holds OSDs 16,17,18 and 19. OSD 16 is the one crashing the show. I collected its startup log here: https://pastebin.com/25D3piS6 . The line sticking out is line 603: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/15.2.17/rpm/el8/BUILD/ceph-15.2.17/src/os/bluestore/BlueFS.cc: 2931: ceph_abort_msg("bluefs enospc") This smells a lot like rocksdb corruption. Can I do something about that? I still need to convert most of our OSDs and I cannot afford to loose more. The rebuild simply takes too long in the current situation. Thanks for your help and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Igor Fedotov <igor.fedotov@xxxxxxxx> Sent: 06 October 2022 17:03:53 To: Frank Schilder; ceph-users@xxxxxxx Cc: Stefan Kooman Subject: Re: OSD crashes during upgrade mimic->octopus Sorry - no clue about CephFS related questions... But could you please share full OSD startup log for any one which is unable to restart after host reboot? On 10/6/2022 5:12 PM, Frank Schilder wrote: > Hi Igor and Stefan. > >>> Not sure why you're talking about replicated(!) 4(2) pool. >> Its because in the production cluster its the 4(2) pool that has that problem. On the test cluster it was an > > EC pool. Seems to affect all sorts of pools. > I have to take this one back. It is indeed an EC pool that is also on these SSD OSDs that is affected. The meta-data pool was all active all the time until we lost the 3rd host. So, the bug reported is confirmed to affect EC pools. > >> If not - does any died OSD unconditionally mean its underlying disk is >> unavailable any more? > Fortunately not. After loosing disks on the 3rd host, we had to start taking somewhat more desperate measures. We set the file system off-line to stop client IO and started rebooting hosts in reverse order of failing. This brought back the OSDs on the still un-converted hosts. We rebooted the converted host with the original fail of OSDs last. Unfortunately, here it seems we lost a drive for good. It looks like the OSDs crashed while the conversion was going on or something. They don't boot up and I need to look into that with more detail. > > We are currently trying to encourage fs clients to reconnect to the file system. Unfortunately, on many we get > > # ls /shares/nfs/ait_pnora01 # this *is* a ceph-fs mount point > ls: cannot access '/shares/nfs/ait_pnora01': Stale file handle > > Is there a server-sided way to encourage the FS clients to reconnect to the cluster? What is a clean way to get them back onto the file system? I tried a remounts without success. > > Before executing the next conversion, I will compact the rocksdb on all SSD OSDs. The HDDs seem to be entirely unaffected. The SSDs have a very high number of objects per PG, which is potentially the main reason for our observations. > > Thanks for your help, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Igor Fedotov <igor.fedotov@xxxxxxxx> > Sent: 06 October 2022 14:39 > To: Frank Schilder; ceph-users@xxxxxxx > Subject: Re: OSD crashes during upgrade mimic->octopus > > Are crashing OSDs still bound to two hosts? > > If not - does any died OSD unconditionally mean its underlying disk is > unavailable any more? > > > On 10/6/2022 3:35 PM, Frank Schilder wrote: >> Hi Igor. >> >>> Not sure why you're talking about replicated(!) 4(2) pool. >> Its because in the production cluster its the 4(2) pool that has that problem. On the test cluster it was an EC pool. Seems to affect all sorts of pools. >> >> I just lost another disk, we have PGs down now. I really hope the stuck bstore_kv_sync thread does not lead to rocksdb corruption. >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Igor Fedotov <igor.fedotov@xxxxxxxx> >> Sent: 06 October 2022 14:26 >> To: Frank Schilder; ceph-users@xxxxxxx >> Subject: Re: OSD crashes during upgrade mimic->octopus >> >> On 10/6/2022 2:55 PM, Frank Schilder wrote: >>> Hi Igor, >>> >>> it has the SSD OSDs down, the HDD OSDs are running just fine. I don't want to make a bad situation worse for now and wait for recovery to finish. The inactive PGs are activating very slowly. >> Got it. >> >>> By the way, there are 2 out of 4 OSDs up in the replicated 4(2) pool. Why are PGs even inactive here? This "feature" is new in octopus, I reported it about 2 months ago as a bug. Testing with mimic I cannot reproduce this problem: https://tracker.ceph.com/issues/56995 >> Not sure why you're talking about replicated(!) 4(2) pool. In the above >> ticket I can see EC 4+2 one (pool 4 'fs-data' erasure profile >> ec-4-2...). Which means 6 shards per object and may be this setup has >> some issues with mapping to unique osds within a host (just 3 hosts are >> available!) ... One can see that pg 4.* are marked as inactive only. >> Not a big expert in this stuff so mostly just speculating.... >> >> >> Do you have the same setup in the production cluster in question? If so >> - then you lack 2 of 6 shards and IMO the cluster properly marks the >> relevant PGs as inactive. The same would apply to 3x replicated PGs as >> well though since two replicas are down.. >> >> >>> I found this in the syslog, maybe it helps: >>> >>> kernel: task:bstore_kv_sync state:D stack: 0 pid:3646032 ppid:3645340 flags:0x00000000 >>> kernel: Call Trace: >>> kernel: __schedule+0x2a2/0x7e0 >>> kernel: schedule+0x4e/0xb0 >>> kernel: io_schedule+0x16/0x40 >>> kernel: wait_on_page_bit_common+0x15c/0x3e0 >>> kernel: ? __page_cache_alloc+0xb0/0xb0 >>> kernel: wait_on_page_bit+0x3f/0x50 >>> kernel: wait_on_page_writeback+0x26/0x70 >>> kernel: __filemap_fdatawait_range+0x98/0x100 >>> kernel: ? __filemap_fdatawrite_range+0xd8/0x110 >>> kernel: file_fdatawait_range+0x1a/0x30 >>> kernel: sync_file_range+0xc2/0xf0 >>> kernel: ksys_sync_file_range+0x41/0x80 >>> kernel: __x64_sys_sync_file_range+0x1e/0x30 >>> kernel: do_syscall_64+0x3b/0x90 >>> kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae >>> kernel: RIP: 0033:0x7ffbb6f77ae7 >>> kernel: RSP: 002b:00007ffba478c3c0 EFLAGS: 00000293 ORIG_RAX: 0000000000000115 >>> kernel: RAX: ffffffffffffffda RBX: 000000000000002d RCX: 00007ffbb6f77ae7 >>> kernel: RDX: 0000000000002000 RSI: 000000015f849000 RDI: 000000000000002d >>> kernel: RBP: 000000015f849000 R08: 0000000000000000 R09: 0000000000002000 >>> kernel: R10: 0000000000000007 R11: 0000000000000293 R12: 0000000000002000 >>> kernel: R13: 0000000000000007 R14: 0000000000000001 R15: 0000560a1ae20380 >>> kernel: INFO: task bstore_kv_sync:3646117 blocked for more than 123 seconds. >>> kernel: Tainted: G E 5.14.13-1.el7.elrepo.x86_64 #1 >>> kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>> >>> It is quite possible that this was the moment when these OSDs got stuck and were marked down. The time stamp is about right. >> Right. this is a primary thread which submits transactions to DB. And it >> stuck for >123 seconds. Given that the disk is completely unresponsive I >> presume something has happened at lower level (controller or disk FW) >> though.. May be this was somehow caused by "fragmented" DB access and >> compaction would heal this. On the other hand the compaction had to be >> applied after omap upgrade so I'm not sure another one would change the >> state... >> >> >> >>> Best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Igor Fedotov <igor.fedotov@xxxxxxxx> >>> Sent: 06 October 2022 13:45:17 >>> To: Frank Schilder; ceph-users@xxxxxxx >>> Subject: Re: OSD crashes during upgrade mimic->octopus >>> >>> From your response to Stefan I'm getting that one of two damaged hosts >>> has all OSDs down and unable to start. I that correct? If so you can >>> reboot it with no problem and proceed with manual compaction [and other >>> experiments] quite "safely" for the rest of the cluster. >>> >>> >>> On 10/6/2022 2:35 PM, Frank Schilder wrote: >>>> Hi Igor, >>>> >>>> I can't access these drives. They have an OSD- or LVM process hanging in D-state. Any attempt to do something with these gets stuck as well. >>>> >>>> I somehow need to wait for recovery to finish and protect the still running OSDs from crashing similarly badly. >>>> >>>> After we have full redundancy again and service is back, I can add the setting osd_compact_on_start=true and start rebooting servers. Right now I need to prevent the ship from sinking. >>>> >>>> Best regards, >>>> ================= >>>> Frank Schilder >>>> AIT Risø Campus >>>> Bygning 109, rum S14 >>>> >>>> ________________________________________ >>>> From: Igor Fedotov <igor.fedotov@xxxxxxxx> >>>> Sent: 06 October 2022 13:28:11 >>>> To: Frank Schilder; ceph-users@xxxxxxx >>>> Subject: Re: OSD crashes during upgrade mimic->octopus >>>> >>>> IIUC the OSDs that expose "had timed out after 15" are failing to start >>>> up. Is that correct or I missed something? I meant trying compaction >>>> for them... >>>> >>>> >>>> On 10/6/2022 2:27 PM, Frank Schilder wrote: >>>>> Hi Igor, >>>>> >>>>> thanks for your response. >>>>> >>>>>> And what's the target Octopus release? >>>>> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) >>>>> >>>>> I'm afraid I don't have the luxury right now to take OSDs down or add extra load with an on-line compaction. I would really appreciate a way to make the OSDs more crash tolerant until I have full redundancy again. Is there a setting that increases the OPS timeout or is there a way to restrict the load to tolerable levels? >>>>> >>>>> Best regards, >>>>> ================= >>>>> Frank Schilder >>>>> AIT Risø Campus >>>>> Bygning 109, rum S14 >>>>> >>>>> ________________________________________ >>>>> From: Igor Fedotov <igor.fedotov@xxxxxxxx> >>>>> Sent: 06 October 2022 13:15 >>>>> To: Frank Schilder; ceph-users@xxxxxxx >>>>> Subject: Re: OSD crashes during upgrade mimic->octopus >>>>> >>>>> Hi Frank, >>>>> >>>>> you might want to compact RocksDB by ceph-kvstore-tool for those OSDs >>>>> which are showing >>>>> >>>>> "heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1886536700' had timed out after 15" >>>>> >>>>> >>>>> >>>>> I could see such an error after bulk data removal and following severe >>>>> DB performance drop pretty often. >>>>> >>>>> Thanks, >>>>> Igor >>>> -- >>>> Igor Fedotov >>>> Ceph Lead Developer >>>> >>>> Looking for help with your Ceph cluster? Contact us at https://croit.io >>>> >>>> croit GmbH, Freseniusstr. 31h, 81247 Munich >>>> CEO: Martin Verges - VAT-ID: DE310638492 >>>> Com. register: Amtsgericht Munich HRB 231263 >>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx >>>> >>> -- >>> Igor Fedotov >>> Ceph Lead Developer >>> >>> Looking for help with your Ceph cluster? Contact us at https://croit.io >>> >>> croit GmbH, Freseniusstr. 31h, 81247 Munich >>> CEO: Martin Verges - VAT-ID: DE310638492 >>> Com. register: Amtsgericht Munich HRB 231263 >>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx >>> >> -- >> Igor Fedotov >> Ceph Lead Developer >> >> Looking for help with your Ceph cluster? Contact us at https://croit.io >> >> croit GmbH, Freseniusstr. 31h, 81247 Munich >> CEO: Martin Verges - VAT-ID: DE310638492 >> Com. register: Amtsgericht Munich HRB 231263 >> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx >> > -- > Igor Fedotov > Ceph Lead Developer > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH, Freseniusstr. 31h, 81247 Munich > CEO: Martin Verges - VAT-ID: DE310638492 > Com. register: Amtsgericht Munich HRB 231263 > Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx > -- Igor Fedotov Ceph Lead Developer Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH, Freseniusstr. 31h, 81247 Munich CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht Munich HRB 231263 Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx