Re: OSD crashes during upgrade mimic->octopus

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Igor,

I can't access these drives. They have an OSD- or LVM process hanging in D-state. Any attempt to do something with these gets stuck as well.

I somehow need to wait for recovery to finish and protect the still running OSDs from crashing similarly badly.

After we have full redundancy again and service is back, I can add the setting osd_compact_on_start=true and start rebooting servers. Right now I need to prevent the ship from sinking.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Igor Fedotov <igor.fedotov@xxxxxxxx>
Sent: 06 October 2022 13:28:11
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  OSD crashes during upgrade mimic->octopus

IIUC the OSDs that expose "had timed out after 15" are failing to start
up. Is that correct or I missed something?  I meant trying compaction
for them...


On 10/6/2022 2:27 PM, Frank Schilder wrote:
> Hi Igor,
>
> thanks for your response.
>
>> And what's the target Octopus release?
> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)
>
> I'm afraid I don't have the luxury right now to take OSDs down or add extra load with an on-line compaction. I would really appreciate a way to make the OSDs more crash tolerant until I have full redundancy again. Is there a setting that increases the OPS timeout or is there a way to restrict the load to tolerable levels?
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Igor Fedotov <igor.fedotov@xxxxxxxx>
> Sent: 06 October 2022 13:15
> To: Frank Schilder; ceph-users@xxxxxxx
> Subject: Re:  OSD crashes during upgrade mimic->octopus
>
> Hi Frank,
>
> you might want to compact RocksDB by ceph-kvstore-tool for those OSDs
> which are showing
>
> "heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1886536700' had timed out after 15"
>
>
>
> I could see such an error after bulk data removal and following severe
> DB performance drop pretty often.
>
> Thanks,
> Igor

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux