Troels; This sounds like a failure domain issue. If I remember correctly, Ceph defaults to a failure domain of disk (osd), while you need a failure domain of host. Could you do a ceph -s while one of the hosts is offline? You're looking for the HEALTH_ flag, and any errors other than slow ops. Also, what major version of Ceph are you running? Thank you, Dominic L. Hilsbos, MBA Vice President - Information Technology Perform Air International Inc. DHilsbos@xxxxxxxxxxxxxx www.PerformAir.com -----Original Message----- From: Troels Hansen [mailto:tha@xxxxxxxxxx] Sent: Monday, October 25, 2021 12:55 AM To: ceph-users@xxxxxxx Subject: Rebooting one node immediately blocks IO via RGW I have a strange issue...... Its a 3 node cluster, deployed on Ubuntu, on containers, running version 15.2.4, docker.io/ceph/ceph:v15 Its only running RGW, and everything seems fine, and everyting works. No errors and the cluster is healthy. As soon as one node is restarted all IO is blocked, apparently because of slow ops, but I see no reason for it. Its running as simple as possible, with a replica count of 3. The second the OSD's on the halted node dissapears I see slow ops, but its blocking everything, and there is no IO to the cluster. The slow requests are spread accross all of the remaining OSD's. 2021-10-20T05:07:02.554282+0200 mon.prodceph-mon1 [WRN] Health check failed: 0 slow ops, oldest one blocked for 30 sec, osd.4 has slow ops (SLOW_OPS) 2021-10-20T05:07:04.652756+0200 osd.13 [WRN] slow request osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected e18084) initiated 2021-10-20T03:06:34.010528+0000 currently delayed 2021-10-20T05:07:05.585995+0200 osd.25 [WRN] slow request osd_op(client.394158.0:62776921 7.1f3 7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head [getxattrs,stat,read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed 2021-10-20T05:07:05.629622+0200 osd.13 [WRN] slow request osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected e18084) initiated 2021-10-20T03:06:34.010528+0000 currently delayed 2021-10-20T05:07:05.629660+0200 osd.13 [WRN] slow request osd_op(client.394158.0:62776924 4.d 4:b4812045:::notify.4:head [watch ping cookie 94141521019648] snapc 0=[] ondisk+write+known_if_redirected e18084) initiated 2021-10-20T03:06:35.165999+0000 currently delayed 2021-10-20T05:07:05.629690+0200 osd.13 [WRN] slow request osd_op(client.305099.0:3244269 4.d 4:b4812045:::notify.4:head [watch ping cookie 94522369776384] snapc 0=[] ondisk+write+known_if_redirected e18084) initiated 2021-10-20T03:06:35.402403+0000 currently delayed 2021-10-20T05:07:06.555735+0200 osd.25 [WRN] slow request osd_op(client.394158.0:62776921 7.1f3 7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head [getxattrs,stat,read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed 2021-10-20T05:07:06.677696+0200 osd.13 [WRN] slow request osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected e18084) initiated 2021-10-20T03:06:34.010528+0000 currently delayed 2021-10-20T05:07:06.677732+0200 osd.13 [WRN] slow request osd_op(client.394158.0:62776924 4.d 4:b4812045:::notify.4:head [watch ping cookie 94141521019648] snapc 0=[] ondisk+write+known_if_redirected e18084) initiated 2021-10-20T03:06:35.165999+0000 currently delayed 2021-10-20T05:07:06.677750+0200 osd.13 [WRN] slow request osd_op(client.305099.0:3244269 4.d 4:b4812045:::notify.4:head [watch ping cookie 94522369776384] snapc 0=[] ondisk+write+known_if_redirected e18084) initiated 2021-10-20T03:06:35.402403+0000 currently delayed 2021-10-20T05:07:07.553717+0200 osd.25 [WRN] slow request osd_op(client.394158.0:62776921 7.1f3 7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head [getxattrs,stat,read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed 2021-10-20T05:07:07.643135+0200 osd.13 [WRN] slow request osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected e18084) initiated 2021-10-20T03:06:34.010528+0000 currently delayed 2021-10-20T05:07:07.643159+0200 osd.13 [WRN] slow request osd_op(client.394158.0:62776924 4.d 4:b4812045:::notify.4:head [watch ping cookie 94141521019648] snapc 0=[] ondisk+write+known_if_redirected e18084) initiated 2021-10-20T03:06:35.165999+0000 currently delayed 2021-10-20T05:07:07.643175+0200 osd.13 [WRN] slow request osd_op(client.305099.0:3244269 4.d 4:b4812045:::notify.4:head [watch ping cookie 94522369776384] snapc 0=[] ondisk+write+known_if_redirected e18084) initiated 2021-10-20T03:06:35.402403+0000 currently delayed 2021-10-20T05:07:08.368877+0200 mon.prodceph-mon1 [WRN] Health check update: 0 slow ops, oldest one blocked for 35 sec, osd.4 has slow ops (SLOW_OPS) 2021-10-20T05:07:08.570167+0200 osd.25 [WRN] slow request osd_op(client.394158.0:62776921 7.1f3 7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head [getxattrs,stat,read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed 2021-10-20T05:07:08.570200+0200 osd.25 [WRN] slow request osd_op(client.394158.0:62776930 7.1f3 7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head [getxattrs,stat,read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected e18084) initiated 2021-10-20T03:06:38.518576+0000 currently delayed 2021-10-20T05:07:08.598671+0200 osd.13 [WRN] slow request osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected e18084) initiated 2021-10-20T03:06:34.010528+0000 currently delayed When then the node comes back up the slow ops dissapears, and IO resumes. I have tried to replicate it in a test environment using same Ceph version, as the other cluster is now running production, but not succeeded to reproduce. Any insights or ideas would be appreciated. -- Med venlig hilsen *Troels Hansen* Senior Linux Konsulent Tlf.: 22 43 71 57 tha@xxxxxxxxxx www.miracle.dk _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx