Re: Rebooting one node immediately blocks IO via RGW

<DHilsbos@xxxxxxxxxxxxxx> · Mon, 25 Oct 2021 16:42:03 +0000

Troels;

This sounds like a failure domain issue.  If I remember correctly, Ceph defaults to a failure domain of disk (osd), while you need a failure domain of host.

Could you do a ceph -s while one of the hosts is offline?  You're looking for the HEALTH_ flag, and any errors other than slow ops.

Also, what major version of Ceph are you running?

Thank you,

Dominic L. Hilsbos, MBA
Vice President - Information Technology
Perform Air International Inc.
DHilsbos@xxxxxxxxxxxxxx
www.PerformAir.com

-----Original Message-----
From: Troels Hansen [mailto:tha@xxxxxxxxxx] 
Sent: Monday, October 25, 2021 12:55 AM
To: ceph-users@xxxxxxx
Subject:  Rebooting one node immediately blocks IO via RGW

I have a strange issue......
Its a 3 node cluster, deployed on Ubuntu, on containers, running version
15.2.4, docker.io/ceph/ceph:v15

Its only running RGW, and everything seems fine, and everyting works.
No errors and the cluster is healthy.

As soon as one node is restarted all IO is blocked, apparently because of
slow ops, but I see no reason for it.

Its running as simple as possible, with a replica count of 3.

The second the OSD's on the halted node dissapears I see slow ops, but its
blocking everything, and there is no IO to the cluster.

The slow requests are spread accross all of the remaining OSD's.

2021-10-20T05:07:02.554282+0200 mon.prodceph-mon1 [WRN] Health check
failed: 0 slow ops, oldest one blocked for 30 sec, osd.4 has slow ops
(SLOW_OPS)
2021-10-20T05:07:04.652756+0200 osd.13 [WRN] slow request
osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping
cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected e18084)
initiated 2021-10-20T03:06:34.010528+0000 currently delayed
2021-10-20T05:07:05.585995+0200 osd.25 [WRN] slow request
osd_op(client.394158.0:62776921 7.1f3
7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head
[getxattrs,stat,read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected
e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed
2021-10-20T05:07:05.629622+0200 osd.13 [WRN] slow request
osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping
cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected e18084)
initiated 2021-10-20T03:06:34.010528+0000 currently delayed
2021-10-20T05:07:05.629660+0200 osd.13 [WRN] slow request
osd_op(client.394158.0:62776924 4.d 4:b4812045:::notify.4:head [watch ping
cookie 94141521019648] snapc 0=[] ondisk+write+known_if_redirected e18084)
initiated 2021-10-20T03:06:35.165999+0000 currently delayed
2021-10-20T05:07:05.629690+0200 osd.13 [WRN] slow request
osd_op(client.305099.0:3244269 4.d 4:b4812045:::notify.4:head [watch ping
cookie 94522369776384] snapc 0=[] ondisk+write+known_if_redirected e18084)
initiated 2021-10-20T03:06:35.402403+0000 currently delayed
2021-10-20T05:07:06.555735+0200 osd.25 [WRN] slow request
osd_op(client.394158.0:62776921 7.1f3
7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head
[getxattrs,stat,read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected
e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed
2021-10-20T05:07:06.677696+0200 osd.13 [WRN] slow request
osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping
cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected e18084)
initiated 2021-10-20T03:06:34.010528+0000 currently delayed
2021-10-20T05:07:06.677732+0200 osd.13 [WRN] slow request
osd_op(client.394158.0:62776924 4.d 4:b4812045:::notify.4:head [watch ping
cookie 94141521019648] snapc 0=[] ondisk+write+known_if_redirected e18084)
initiated 2021-10-20T03:06:35.165999+0000 currently delayed
2021-10-20T05:07:06.677750+0200 osd.13 [WRN] slow request
osd_op(client.305099.0:3244269 4.d 4:b4812045:::notify.4:head [watch ping
cookie 94522369776384] snapc 0=[] ondisk+write+known_if_redirected e18084)
initiated 2021-10-20T03:06:35.402403+0000 currently delayed
2021-10-20T05:07:07.553717+0200 osd.25 [WRN] slow request
osd_op(client.394158.0:62776921 7.1f3
7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head
[getxattrs,stat,read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected
e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed
2021-10-20T05:07:07.643135+0200 osd.13 [WRN] slow request
osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping
cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected e18084)
initiated 2021-10-20T03:06:34.010528+0000 currently delayed
2021-10-20T05:07:07.643159+0200 osd.13 [WRN] slow request
osd_op(client.394158.0:62776924 4.d 4:b4812045:::notify.4:head [watch ping
cookie 94141521019648] snapc 0=[] ondisk+write+known_if_redirected e18084)
initiated 2021-10-20T03:06:35.165999+0000 currently delayed
2021-10-20T05:07:07.643175+0200 osd.13 [WRN] slow request
osd_op(client.305099.0:3244269 4.d 4:b4812045:::notify.4:head [watch ping
cookie 94522369776384] snapc 0=[] ondisk+write+known_if_redirected e18084)
initiated 2021-10-20T03:06:35.402403+0000 currently delayed
2021-10-20T05:07:08.368877+0200 mon.prodceph-mon1 [WRN] Health check
update: 0 slow ops, oldest one blocked for 35 sec, osd.4 has slow ops
(SLOW_OPS)
2021-10-20T05:07:08.570167+0200 osd.25 [WRN] slow request
osd_op(client.394158.0:62776921 7.1f3
7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head
[getxattrs,stat,read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected
e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed
2021-10-20T05:07:08.570200+0200 osd.25 [WRN] slow request
osd_op(client.394158.0:62776930 7.1f3
7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head
[getxattrs,stat,read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected
e18084) initiated 2021-10-20T03:06:38.518576+0000 currently delayed
2021-10-20T05:07:08.598671+0200 osd.13 [WRN] slow request
osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping
cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected e18084)
initiated 2021-10-20T03:06:34.010528+0000 currently delayed

When then the node comes back up the slow ops dissapears, and IO resumes.

I have tried to replicate it in a test environment using same Ceph version,
as the other cluster is now running production, but not succeeded to
reproduce.

Any insights or ideas would be appreciated.

-- 
Med venlig hilsen

*Troels Hansen*
Senior Linux Konsulent

Tlf.: 22 43 71 57
tha@xxxxxxxxxx
www.miracle.dk
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx