Re: CEPH cluster stopped client I/O's when OSD host hangs

Prayank Saxena <pr31189@xxxxxxxxx> · Mon, 7 Feb 2022 11:30:14 +0530

Hello Abdelillah,

Thanks for replying

Since the kernel was in hung state we were not able to login to the node
unless we rebooted.

In previous occasions we got below logs whenever the osd was marked down my
monitor nodes
blk_update_request: I/O error, dev nvme0n1, sector 745934488
blk_update_request: I/O error, dev nvme0n1, sector 745934136
blk_update_request: I/O error, dev nvme0n1, sector 745934168
blk_update_request: I/O error, dev nvme0n1, sector 745933912
blk_update_request: I/O error, dev nvme0n1, sector 745933656
blk_update_request: I/O error, dev nvme0n1, sector 745934216
blk_update_request: I/O error, dev nvme0n1, sector 745934088
blk_update_request: I/O error, dev nvme0n1, sector 745933640
blk_update_request: I/O error, dev nvme0n1, sector 745934104
blk_update_request: I/O error, dev nvme0n1, sector 745933528
Buffer I/O error on dev dm-0, logical block 1802612857, lost async page
write
Buffer I/O error on dev dm-0, logical block 1802612857, lost async page
write
Buffer I/O error on dev dm-0, logical block 3750729712, async page read
Buffer I/O error on dev dm-0, logical block 0, async page read
Buffer I/O error on dev dm-0, logical block 0, async page read
Buffer I/O error on dev dm-0, logical block 0, async page read

Since we are not logging in /var/log/messages due to storage constraints.
As node was in hung state nothing got recorded in any of the logs, be it
osd.logs or dmesg since node was rebooted.

Regards
Prayank Saxena

On Thu, 3 Feb 2022 at 03:43, Abdelillah Asraoui <aasraoui@xxxxxxxxx> wrote:

> What kind of disk failure are you seeing on disk01?
>
> On Wed, Feb 2, 2022 at 10:27 AM Prayank Saxena <pr31189@xxxxxxxxx> wrote:
>
>> Hello Everyone,
>>
>>
>> We encounter some issue that OS hanging on host OSD causes the cluster to
>> stop ingesting data.
>>
>>
>>
>> Below are CEPH Cluster details:
>>
>> CEPH Object Storage v14.2.22
>>
>> No. of Monitor nodes: 5
>>
>> No. of RGW nodes:5
>>
>> No.of OSD's:252 (all NVME's)
>>
>> OS : Centos 7.9
>>
>> kernel: 3.10.0-1160.45.1.el7.x86_64
>>
>> Network: One NIC port 25G (used as public and cluster network)
>>
>> Data node hardware spec:
>>
>>    FW: EDB8007Q
>>
>>    MODEL: SAMSUNG MZ4LB15THMLA-00003
>>
>>
>>
>> Our architecture has two osd in one host, the disk01 is shared with kernel
>>
>>
>>
>>                 ┌──────┐
>>
>>          ┌──────┤disk01│
>>
>> ┌────────┤      └──────┘
>>
>> │osd-host│
>>
>> └────────┤
>>
>>          │      ┌──────┐
>>
>>          └──────┤disk02│
>>
>>                 └──────┘
>>
>>
>>
>> disk01 is detected failed and out of the cluster but disk02 is not.
>>
>>
>>
>> The Host OSD kernel suddenly hangs (cannot ssh). This happens sometimes
>> but
>> the OSD residing to that is always flags as down.
>>
>> This time when the host osd hangs the disk01 is flagged as down and disk02
>> is not flagged as down. At the same time all RGW services alerts as down
>> but while checking, services seem to be up, this causes the cluster to
>> stop
>> ingesting data, only after the host osd is shutdown that we regain the
>> cluster ingestion. No other osd's were down and out and the cluster was in
>> OK state before we had this incident.
>>
>>
>>
>>
>>
>> *from logs*
>>
>> --------
>>
>> 2022-01-25 15:42:10.994227 mon.cluster01-mon-001 (mon.0) 25030 : cluster
>> [DBG] osd.57 reported failed by osd.118
>>
>> 2022-01-25 15:42:11.340162 mon.cluster01-mon-001 (mon.0) 25031 : cluster
>> [DBG] osd.57 reported failed by osd.248
>>
>> 2022-01-25 15:42:11.340435 mon.cluster01-mon-001 (mon.0) 25032 : cluster
>> [INF] osd.57 failed (root=default,rack=rack004,host=cluster01-osd-1029) (2
>> reporters from different host after 23.000120 >= grace 20.000000)
>>
>> 2022-01-25 15:42:11.395065 mon.cluster01-mon-001 (mon.0) 25033 : cluster
>> [WRN] Health check failed: 1 osds down (OSD_DOWN)
>>
>>  --------
>>
>>
>>  We have encountered OS hang situations before but everytime osd's were
>> marked down by cluster after grace period and it never impacted client
>> I/O's. But this time due to the above issue RGW nodes were not able to
>> take
>> any requests.
>>
>> Has anyone encountered a similar case where a kernel hang does not flag
>> the
>> osd as down?
>>
>> If yes, any proactive measures that we can take to avoid such an incident?
>>
>> Also, if any more information is needed, please do let me know.
>> Any help would be appreciated.
>>
>> Regards
>> Prayank Saxena
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx