I have a strange issue...... Its a 3 node cluster, deployed on Ubuntu, on containers, running version 15.2.4, docker.io/ceph/ceph:v15 Its only running RGW, and everything seems fine, and everyting works. No errors and the cluster is healthy. As soon as one node is restarted all IO is blocked, apparently because of slow ops, but I see no reason for it. Its running as simple as possible, with a replica count of 3. The second the OSD's on the halted node dissapears I see slow ops, but its blocking everything, and there is no IO to the cluster. The slow requests are spread accross all of the remaining OSD's. 2021-10-20T05:07:02.554282+0200 mon.prodceph-mon1 [WRN] Health check failed: 0 slow ops, oldest one blocked for 30 sec, osd.4 has slow ops (SLOW_OPS) 2021-10-20T05:07:04.652756+0200 osd.13 [WRN] slow request osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected e18084) initiated 2021-10-20T03:06:34.010528+0000 currently delayed 2021-10-20T05:07:05.585995+0200 osd.25 [WRN] slow request osd_op(client.394158.0:62776921 7.1f3 7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head [getxattrs,stat,read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed 2021-10-20T05:07:05.629622+0200 osd.13 [WRN] slow request osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected e18084) initiated 2021-10-20T03:06:34.010528+0000 currently delayed 2021-10-20T05:07:05.629660+0200 osd.13 [WRN] slow request osd_op(client.394158.0:62776924 4.d 4:b4812045:::notify.4:head [watch ping cookie 94141521019648] snapc 0=[] ondisk+write+known_if_redirected e18084) initiated 2021-10-20T03:06:35.165999+0000 currently delayed 2021-10-20T05:07:05.629690+0200 osd.13 [WRN] slow request osd_op(client.305099.0:3244269 4.d 4:b4812045:::notify.4:head [watch ping cookie 94522369776384] snapc 0=[] ondisk+write+known_if_redirected e18084) initiated 2021-10-20T03:06:35.402403+0000 currently delayed 2021-10-20T05:07:06.555735+0200 osd.25 [WRN] slow request osd_op(client.394158.0:62776921 7.1f3 7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head [getxattrs,stat,read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed 2021-10-20T05:07:06.677696+0200 osd.13 [WRN] slow request osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected e18084) initiated 2021-10-20T03:06:34.010528+0000 currently delayed 2021-10-20T05:07:06.677732+0200 osd.13 [WRN] slow request osd_op(client.394158.0:62776924 4.d 4:b4812045:::notify.4:head [watch ping cookie 94141521019648] snapc 0=[] ondisk+write+known_if_redirected e18084) initiated 2021-10-20T03:06:35.165999+0000 currently delayed 2021-10-20T05:07:06.677750+0200 osd.13 [WRN] slow request osd_op(client.305099.0:3244269 4.d 4:b4812045:::notify.4:head [watch ping cookie 94522369776384] snapc 0=[] ondisk+write+known_if_redirected e18084) initiated 2021-10-20T03:06:35.402403+0000 currently delayed 2021-10-20T05:07:07.553717+0200 osd.25 [WRN] slow request osd_op(client.394158.0:62776921 7.1f3 7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head [getxattrs,stat,read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed 2021-10-20T05:07:07.643135+0200 osd.13 [WRN] slow request osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected e18084) initiated 2021-10-20T03:06:34.010528+0000 currently delayed 2021-10-20T05:07:07.643159+0200 osd.13 [WRN] slow request osd_op(client.394158.0:62776924 4.d 4:b4812045:::notify.4:head [watch ping cookie 94141521019648] snapc 0=[] ondisk+write+known_if_redirected e18084) initiated 2021-10-20T03:06:35.165999+0000 currently delayed 2021-10-20T05:07:07.643175+0200 osd.13 [WRN] slow request osd_op(client.305099.0:3244269 4.d 4:b4812045:::notify.4:head [watch ping cookie 94522369776384] snapc 0=[] ondisk+write+known_if_redirected e18084) initiated 2021-10-20T03:06:35.402403+0000 currently delayed 2021-10-20T05:07:08.368877+0200 mon.prodceph-mon1 [WRN] Health check update: 0 slow ops, oldest one blocked for 35 sec, osd.4 has slow ops (SLOW_OPS) 2021-10-20T05:07:08.570167+0200 osd.25 [WRN] slow request osd_op(client.394158.0:62776921 7.1f3 7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head [getxattrs,stat,read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed 2021-10-20T05:07:08.570200+0200 osd.25 [WRN] slow request osd_op(client.394158.0:62776930 7.1f3 7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head [getxattrs,stat,read 0~4194304] snapc 0=[] ondisk+read+known_if_redirected e18084) initiated 2021-10-20T03:06:38.518576+0000 currently delayed 2021-10-20T05:07:08.598671+0200 osd.13 [WRN] slow request osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected e18084) initiated 2021-10-20T03:06:34.010528+0000 currently delayed When then the node comes back up the slow ops dissapears, and IO resumes. I have tried to replicate it in a test environment using same Ceph version, as the other cluster is now running production, but not succeeded to reproduce. Any insights or ideas would be appreciated. -- Med venlig hilsen *Troels Hansen* Senior Linux Konsulent Tlf.: 22 43 71 57 tha@xxxxxxxxxx www.miracle.dk _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx