Re: OSD crashes during upgrade mimic->octopus

Frank Schilder <frans@xxxxxx> · Thu, 6 Oct 2022 15:22:56 +0000

Hi Igor.

> But could you please share full OSD startup log for any one which is
> unable to restart after host reboot?

Will do. I also would like to know what happened here and if it is possible to recover these OSDs. The rebuild takes ages with the current throttled recovery settings.

> Sorry - no clue about CephFS related questions...

Just for the general audience. In the past we did cluster maintenance by setting "ceph fs set FS down true" (freezing all client IO in D-state), waited for all MDSes becoming standby and doing the job. After that, we set "ceph fs set FS down false", the MDSes started again, all clients connected more or less instantaneously and continued exactly at the point where they were frozen.

This time, a huge number of clients just crashed instead of freezing and of the few ones that remained up only a small number reconnected. This is in our experience very unusual behaviour. Was there a change or are we looking at a potential bug here?

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Igor Fedotov <igor.fedotov@xxxxxxxx>
Sent: 06 October 2022 17:03
To: Frank Schilder; ceph-users@xxxxxxx
Cc: Stefan Kooman
Subject: Re:  OSD crashes during upgrade mimic->octopus

Sorry - no clue about CephFS related questions...

But could you please share full OSD startup log for any one which is
unable to restart after host reboot?

On 10/6/2022 5:12 PM, Frank Schilder wrote:
> Hi Igor and Stefan.
>
>>> Not sure why you're talking about replicated(!) 4(2) pool.
>> Its because in the production cluster its the 4(2) pool that has that problem. On the test cluster it was an > > EC pool. Seems to affect all sorts of pools.
> I have to take this one back. It is indeed an EC pool that is also on these SSD OSDs that is affected. The meta-data pool was all active all the time until we lost the 3rd host. So, the bug reported is confirmed to affect EC pools.
>
>> If not - does any died OSD unconditionally mean its underlying disk is
>> unavailable any more?
> Fortunately not. After loosing disks on the 3rd host, we had to start taking somewhat more desperate measures. We set the file system off-line to stop client IO and started rebooting hosts in reverse order of failing. This brought back the OSDs on the still un-converted hosts. We rebooted the converted host with the original fail of OSDs last. Unfortunately, here it seems we lost a drive for good. It looks like the OSDs crashed while the conversion was going on or something. They don't boot up and I need to look into that with more detail.
>
> We are currently trying to encourage fs clients to reconnect to the file system. Unfortunately, on many we get
>
> # ls /shares/nfs/ait_pnora01 # this *is* a ceph-fs mount point
> ls: cannot access '/shares/nfs/ait_pnora01': Stale file handle
>
> Is there a server-sided way to encourage the FS clients to reconnect to the cluster? What is a clean way to get them back onto the file system? I tried a remounts without success.
>
> Before executing the next conversion, I will compact the rocksdb on all SSD OSDs. The HDDs seem to be entirely unaffected. The SSDs have a very high number of objects per PG, which is potentially the main reason for our observations.
>
> Thanks for your help,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Igor Fedotov <igor.fedotov@xxxxxxxx>
> Sent: 06 October 2022 14:39
> To: Frank Schilder; ceph-users@xxxxxxx
> Subject: Re:  OSD crashes during upgrade mimic->octopus
>
> Are crashing OSDs still bound to two hosts?
>
> If not - does any died OSD unconditionally mean its underlying disk is
> unavailable any more?
>
>
> On 10/6/2022 3:35 PM, Frank Schilder wrote:
>> Hi Igor.
>>
>>> Not sure why you're talking about replicated(!) 4(2) pool.
>> Its because in the production cluster its the 4(2) pool that has that problem. On the test cluster it was an EC pool. Seems to affect all sorts of pools.
>>
>> I just lost another disk, we have PGs down now. I really hope the stuck bstore_kv_sync thread does not lead to rocksdb corruption.
>>
>> Best regards,
>> =================
>> Frank Schilder
>> AIT Risø Campus
>> Bygning 109, rum S14
>>
>> ________________________________________
>> From: Igor Fedotov <igor.fedotov@xxxxxxxx>
>> Sent: 06 October 2022 14:26
>> To: Frank Schilder; ceph-users@xxxxxxx
>> Subject: Re:  OSD crashes during upgrade mimic->octopus
>>
>> On 10/6/2022 2:55 PM, Frank Schilder wrote:
>>> Hi Igor,
>>>
>>> it has the SSD OSDs down, the HDD OSDs are running just fine. I don't want to make a bad situation worse for now and wait for recovery to finish. The inactive PGs are activating very slowly.
>> Got it.
>>
>>> By the way, there are 2 out of 4 OSDs up in the replicated 4(2) pool. Why are PGs even inactive here? This "feature" is new in octopus, I reported it about 2 months ago as a bug. Testing with mimic I cannot reproduce this problem: https://tracker.ceph.com/issues/56995
>> Not sure why you're talking about replicated(!) 4(2) pool. In the above
>> ticket I can see EC 4+2 one (pool 4 'fs-data' erasure profile
>> ec-4-2...). Which means 6 shards per object and may be this setup has
>> some issues with mapping to unique osds within a host (just 3 hosts are
>> available!) ...  One can see that pg 4.* are marked as inactive only.
>> Not a big expert in this stuff so mostly just speculating....
>>
>>
>> Do you have the same setup in the production cluster in question? If so
>> - then you lack 2 of 6 shards and IMO the cluster properly marks the
>> relevant PGs as inactive. The same would apply to 3x replicated PGs as
>> well though since two replicas are down..
>>
>>
>>> I found this in the syslog, maybe it helps:
>>>
>>> kernel: task:bstore_kv_sync  state:D stack:    0 pid:3646032 ppid:3645340 flags:0x00000000
>>> kernel: Call Trace:
>>> kernel: __schedule+0x2a2/0x7e0
>>> kernel: schedule+0x4e/0xb0
>>> kernel: io_schedule+0x16/0x40
>>> kernel: wait_on_page_bit_common+0x15c/0x3e0
>>> kernel: ? __page_cache_alloc+0xb0/0xb0
>>> kernel: wait_on_page_bit+0x3f/0x50
>>> kernel: wait_on_page_writeback+0x26/0x70
>>> kernel: __filemap_fdatawait_range+0x98/0x100
>>> kernel: ? __filemap_fdatawrite_range+0xd8/0x110
>>> kernel: file_fdatawait_range+0x1a/0x30
>>> kernel: sync_file_range+0xc2/0xf0
>>> kernel: ksys_sync_file_range+0x41/0x80
>>> kernel: __x64_sys_sync_file_range+0x1e/0x30
>>> kernel: do_syscall_64+0x3b/0x90
>>> kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> kernel: RIP: 0033:0x7ffbb6f77ae7
>>> kernel: RSP: 002b:00007ffba478c3c0 EFLAGS: 00000293 ORIG_RAX: 0000000000000115
>>> kernel: RAX: ffffffffffffffda RBX: 000000000000002d RCX: 00007ffbb6f77ae7
>>> kernel: RDX: 0000000000002000 RSI: 000000015f849000 RDI: 000000000000002d
>>> kernel: RBP: 000000015f849000 R08: 0000000000000000 R09: 0000000000002000
>>> kernel: R10: 0000000000000007 R11: 0000000000000293 R12: 0000000000002000
>>> kernel: R13: 0000000000000007 R14: 0000000000000001 R15: 0000560a1ae20380
>>> kernel: INFO: task bstore_kv_sync:3646117 blocked for more than 123 seconds.
>>> kernel:      Tainted: G            E     5.14.13-1.el7.elrepo.x86_64 #1
>>> kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>
>>> It is quite possible that this was the moment when these OSDs got stuck and were marked down. The time stamp is about right.
>> Right. this is a primary thread which submits transactions to DB. And it
>> stuck for >123 seconds. Given that the disk is completely unresponsive I
>> presume something has happened at lower level (controller or disk FW)
>> though.. May be this was somehow caused by "fragmented" DB access and
>> compaction would heal this. On the other hand the compaction had to be
>> applied after omap upgrade so I'm not sure another one would change the
>> state...
>>
>>
>>
>>> Best regards,
>>> =================
>>> Frank Schilder
>>> AIT Risø Campus
>>> Bygning 109, rum S14
>>>
>>> ________________________________________
>>> From: Igor Fedotov <igor.fedotov@xxxxxxxx>
>>> Sent: 06 October 2022 13:45:17
>>> To: Frank Schilder; ceph-users@xxxxxxx
>>> Subject: Re:  OSD crashes during upgrade mimic->octopus
>>>
>>>     From your response to Stefan I'm getting that one of two damaged hosts
>>> has all OSDs down and unable to start. I that correct? If so you can
>>> reboot it with no problem and proceed with manual compaction [and other
>>> experiments] quite "safely" for the rest of the cluster.
>>>
>>>
>>> On 10/6/2022 2:35 PM, Frank Schilder wrote:
>>>> Hi Igor,
>>>>
>>>> I can't access these drives. They have an OSD- or LVM process hanging in D-state. Any attempt to do something with these gets stuck as well.
>>>>
>>>> I somehow need to wait for recovery to finish and protect the still running OSDs from crashing similarly badly.
>>>>
>>>> After we have full redundancy again and service is back, I can add the setting osd_compact_on_start=true and start rebooting servers. Right now I need to prevent the ship from sinking.
>>>>
>>>> Best regards,
>>>> =================
>>>> Frank Schilder
>>>> AIT Risø Campus
>>>> Bygning 109, rum S14
>>>>
>>>> ________________________________________
>>>> From: Igor Fedotov <igor.fedotov@xxxxxxxx>
>>>> Sent: 06 October 2022 13:28:11
>>>> To: Frank Schilder; ceph-users@xxxxxxx
>>>> Subject: Re:  OSD crashes during upgrade mimic->octopus
>>>>
>>>> IIUC the OSDs that expose "had timed out after 15" are failing to start
>>>> up. Is that correct or I missed something?  I meant trying compaction
>>>> for them...
>>>>
>>>>
>>>> On 10/6/2022 2:27 PM, Frank Schilder wrote:
>>>>> Hi Igor,
>>>>>
>>>>> thanks for your response.
>>>>>
>>>>>> And what's the target Octopus release?
>>>>> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)
>>>>>
>>>>> I'm afraid I don't have the luxury right now to take OSDs down or add extra load with an on-line compaction. I would really appreciate a way to make the OSDs more crash tolerant until I have full redundancy again. Is there a setting that increases the OPS timeout or is there a way to restrict the load to tolerable levels?
>>>>>
>>>>> Best regards,
>>>>> =================
>>>>> Frank Schilder
>>>>> AIT Risø Campus
>>>>> Bygning 109, rum S14
>>>>>
>>>>> ________________________________________
>>>>> From: Igor Fedotov <igor.fedotov@xxxxxxxx>
>>>>> Sent: 06 October 2022 13:15
>>>>> To: Frank Schilder; ceph-users@xxxxxxx
>>>>> Subject: Re:  OSD crashes during upgrade mimic->octopus
>>>>>
>>>>> Hi Frank,
>>>>>
>>>>> you might want to compact RocksDB by ceph-kvstore-tool for those OSDs
>>>>> which are showing
>>>>>
>>>>> "heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1886536700' had timed out after 15"
>>>>>
>>>>>
>>>>>
>>>>> I could see such an error after bulk data removal and following severe
>>>>> DB performance drop pretty often.
>>>>>
>>>>> Thanks,
>>>>> Igor
>>>> --
>>>> Igor Fedotov
>>>> Ceph Lead Developer
>>>>
>>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>>
>>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>>> CEO: Martin Verges - VAT-ID: DE310638492
>>>> Com. register: Amtsgericht Munich HRB 231263
>>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>>
>>> --
>>> Igor Fedotov
>>> Ceph Lead Developer
>>>
>>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>>
>>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>>> CEO: Martin Verges - VAT-ID: DE310638492
>>> Com. register: Amtsgericht Munich HRB 231263
>>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>>
>> --
>> Igor Fedotov
>> Ceph Lead Developer
>>
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>
>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>> CEO: Martin Verges - VAT-ID: DE310638492
>> Com. register: Amtsgericht Munich HRB 231263
>> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>>
> --
> Igor Fedotov
> Ceph Lead Developer
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
>
--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx