Hi Igor and Stefan. > > Not sure why you're talking about replicated(!) 4(2) pool. > > Its because in the production cluster its the 4(2) pool that has that problem. On the test cluster it was an > > EC pool. Seems to affect all sorts of pools. I have to take this one back. It is indeed an EC pool that is also on these SSD OSDs that is affected. The meta-data pool was all active all the time until we lost the 3rd host. So, the bug reported is confirmed to affect EC pools. > If not - does any died OSD unconditionally mean its underlying disk is > unavailable any more? Fortunately not. After loosing disks on the 3rd host, we had to start taking somewhat more desperate measures. We set the file system off-line to stop client IO and started rebooting hosts in reverse order of failing. This brought back the OSDs on the still un-converted hosts. We rebooted the converted host with the original fail of OSDs last. Unfortunately, here it seems we lost a drive for good. It looks like the OSDs crashed while the conversion was going on or something. They don't boot up and I need to look into that with more detail. We are currently trying to encourage fs clients to reconnect to the file system. Unfortunately, on many we get # ls /shares/nfs/ait_pnora01 # this *is* a ceph-fs mount point ls: cannot access '/shares/nfs/ait_pnora01': Stale file handle Is there a server-sided way to encourage the FS clients to reconnect to the cluster? What is a clean way to get them back onto the file system? I tried a remounts without success. Before executing the next conversion, I will compact the rocksdb on all SSD OSDs. The HDDs seem to be entirely unaffected. The SSDs have a very high number of objects per PG, which is potentially the main reason for our observations. Thanks for your help, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Igor Fedotov <igor.fedotov@xxxxxxxx> Sent: 06 October 2022 14:39 To: Frank Schilder; ceph-users@xxxxxxx Subject: Re: OSD crashes during upgrade mimic->octopus Are crashing OSDs still bound to two hosts? If not - does any died OSD unconditionally mean its underlying disk is unavailable any more? On 10/6/2022 3:35 PM, Frank Schilder wrote: > Hi Igor. > >> Not sure why you're talking about replicated(!) 4(2) pool. > Its because in the production cluster its the 4(2) pool that has that problem. On the test cluster it was an EC pool. Seems to affect all sorts of pools. > > I just lost another disk, we have PGs down now. I really hope the stuck bstore_kv_sync thread does not lead to rocksdb corruption. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Igor Fedotov <igor.fedotov@xxxxxxxx> > Sent: 06 October 2022 14:26 > To: Frank Schilder; ceph-users@xxxxxxx > Subject: Re: OSD crashes during upgrade mimic->octopus > > On 10/6/2022 2:55 PM, Frank Schilder wrote: >> Hi Igor, >> >> it has the SSD OSDs down, the HDD OSDs are running just fine. I don't want to make a bad situation worse for now and wait for recovery to finish. The inactive PGs are activating very slowly. > Got it. > >> By the way, there are 2 out of 4 OSDs up in the replicated 4(2) pool. Why are PGs even inactive here? This "feature" is new in octopus, I reported it about 2 months ago as a bug. Testing with mimic I cannot reproduce this problem: https://tracker.ceph.com/issues/56995 > Not sure why you're talking about replicated(!) 4(2) pool. In the above > ticket I can see EC 4+2 one (pool 4 'fs-data' erasure profile > ec-4-2...). Which means 6 shards per object and may be this setup has > some issues with mapping to unique osds within a host (just 3 hosts are > available!) ... One can see that pg 4.* are marked as inactive only. > Not a big expert in this stuff so mostly just speculating.... > > > Do you have the same setup in the production cluster in question? If so > - then you lack 2 of 6 shards and IMO the cluster properly marks the > relevant PGs as inactive. The same would apply to 3x replicated PGs as > well though since two replicas are down.. > > >> I found this in the syslog, maybe it helps: >> >> kernel: task:bstore_kv_sync state:D stack: 0 pid:3646032 ppid:3645340 flags:0x00000000 >> kernel: Call Trace: >> kernel: __schedule+0x2a2/0x7e0 >> kernel: schedule+0x4e/0xb0 >> kernel: io_schedule+0x16/0x40 >> kernel: wait_on_page_bit_common+0x15c/0x3e0 >> kernel: ? __page_cache_alloc+0xb0/0xb0 >> kernel: wait_on_page_bit+0x3f/0x50 >> kernel: wait_on_page_writeback+0x26/0x70 >> kernel: __filemap_fdatawait_range+0x98/0x100 >> kernel: ? __filemap_fdatawrite_range+0xd8/0x110 >> kernel: file_fdatawait_range+0x1a/0x30 >> kernel: sync_file_range+0xc2/0xf0 >> kernel: ksys_sync_file_range+0x41/0x80 >> kernel: __x64_sys_sync_file_range+0x1e/0x30 >> kernel: do_syscall_64+0x3b/0x90 >> kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae >> kernel: RIP: 0033:0x7ffbb6f77ae7 >> kernel: RSP: 002b:00007ffba478c3c0 EFLAGS: 00000293 ORIG_RAX: 0000000000000115 >> kernel: RAX: ffffffffffffffda RBX: 000000000000002d RCX: 00007ffbb6f77ae7 >> kernel: RDX: 0000000000002000 RSI: 000000015f849000 RDI: 000000000000002d >> kernel: RBP: 000000015f849000 R08: 0000000000000000 R09: 0000000000002000 >> kernel: R10: 0000000000000007 R11: 0000000000000293 R12: 0000000000002000 >> kernel: R13: 0000000000000007 R14: 0000000000000001 R15: 0000560a1ae20380 >> kernel: INFO: task bstore_kv_sync:3646117 blocked for more than 123 seconds. >> kernel: Tainted: G E 5.14.13-1.el7.elrepo.x86_64 #1 >> kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> >> It is quite possible that this was the moment when these OSDs got stuck and were marked down. The time stamp is about right. > Right. this is a primary thread which submits transactions to DB. And it > stuck for >123 seconds. Given that the disk is completely unresponsive I > presume something has happened at lower level (controller or disk FW) > though.. May be this was somehow caused by "fragmented" DB access and > compaction would heal this. On the other hand the compaction had to be > applied after omap upgrade so I'm not sure another one would change the > state... > > > >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Igor Fedotov <igor.fedotov@xxxxxxxx> >> Sent: 06 October 2022 13:45:17 >> To: Frank Schilder; ceph-users@xxxxxxx >> Subject: Re: OSD crashes during upgrade mimic->octopus >> >> From your response to Stefan I'm getting that one of two damaged hosts >> has all OSDs down and unable to start. I that correct? If so you can >> reboot it with no problem and proceed with manual compaction [and other >> experiments] quite "safely" for the rest of the cluster. >> >> >> On 10/6/2022 2:35 PM, Frank Schilder wrote: >>> Hi Igor, >>> >>> I can't access these drives. They have an OSD- or LVM process hanging in D-state. Any attempt to do something with these gets stuck as well. >>> >>> I somehow need to wait for recovery to finish and protect the still running OSDs from crashing similarly badly. >>> >>> After we have full redundancy again and service is back, I can add the setting osd_compact_on_start=true and start rebooting servers. Right now I need to prevent the ship from sinking. >>> >>> Best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Igor Fedotov <igor.fedotov@xxxxxxxx> >>> Sent: 06 October 2022 13:28:11 >>> To: Frank Schilder; ceph-users@xxxxxxx >>> Subject: Re: OSD crashes during upgrade mimic->octopus >>> >>> IIUC the OSDs that expose "had timed out after 15" are failing to start >>> up. Is that correct or I missed something? I meant trying compaction >>> for them... >>> >>> >>> On 10/6/2022 2:27 PM, Frank Schilder wrote: >>>> Hi Igor, >>>> >>>> thanks for your response. >>>> >>>>> And what's the target Octopus release? >>>> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) >>>> >>>> I'm afraid I don't have the luxury right now to take OSDs down or add extra load with an on-line compaction. I would really appreciate a way to make the OSDs more crash tolerant until I have full redundancy again. Is there a setting that increases the OPS timeout or is there a way to restrict the load to tolerable levels? >>>> >>>> Best regards, >>>> ================= >>>> Frank Schilder >>>> AIT Risø Campus >>>> Bygning 109, rum S14 >>>> >>>> ________________________________________ >>>> From: Igor Fedotov <igor.fedotov@xxxxxxxx> >>>> Sent: 06 October 2022 13:15 >>>> To: Frank Schilder; ceph-users@xxxxxxx >>>> Subject: Re: OSD crashes during upgrade mimic->octopus >>>> >>>> Hi Frank, >>>> >>>> you might want to compact RocksDB by ceph-kvstore-tool for those OSDs >>>> which are showing >>>> >>>> "heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1886536700' had timed out after 15" >>>> >>>> >>>> >>>> I could see such an error after bulk data removal and following severe >>>> DB performance drop pretty often. >>>> >>>> Thanks, >>>> Igor >>> -- >>> Igor Fedotov >>> Ceph Lead Developer >>> >>> Looking for help with your Ceph cluster? Contact us at https://croit.io >>> >>> croit GmbH, Freseniusstr. 31h, 81247 Munich >>> CEO: Martin Verges - VAT-ID: DE310638492 >>> Com. register: Amtsgericht Munich HRB 231263 >>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx >>> >> -- >> Igor Fedotov >> Ceph Lead Developer >> >> Looking for help with your Ceph cluster? Contact us at https://croit.io >> >> croit GmbH, Freseniusstr. 31h, 81247 Munich >> CEO: Martin Verges - VAT-ID: DE310638492 >> Com. register: Amtsgericht Munich HRB 231263 >> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx >> > -- > Igor Fedotov > Ceph Lead Developer > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH, Freseniusstr. 31h, 81247 Munich > CEO: Martin Verges - VAT-ID: DE310638492 > Com. register: Amtsgericht Munich HRB 231263 > Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx > -- Igor Fedotov Ceph Lead Developer Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH, Freseniusstr. 31h, 81247 Munich CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht Munich HRB 231263 Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx