All pools are: replicated size 3 min_size 2 failure domain host. On Mon, Oct 25, 2021 at 11:07 AM Eugen Block <eblock@xxxxxx> wrote: > Hi, > > what's the pool's min_size? > > ceph osd pool ls detail > > > Zitat von Troels Hansen <tha@xxxxxxxxxx>: > > > I have a strange issue...... > > Its a 3 node cluster, deployed on Ubuntu, on containers, running version > > 15.2.4, docker.io/ceph/ceph:v15 > > > > Its only running RGW, and everything seems fine, and everyting works. > > No errors and the cluster is healthy. > > > > As soon as one node is restarted all IO is blocked, apparently because of > > slow ops, but I see no reason for it. > > > > Its running as simple as possible, with a replica count of 3. > > > > The second the OSD's on the halted node dissapears I see slow ops, but > its > > blocking everything, and there is no IO to the cluster. > > > > The slow requests are spread accross all of the remaining OSD's. > > > > 2021-10-20T05:07:02.554282+0200 mon.prodceph-mon1 [WRN] Health check > > failed: 0 slow ops, oldest one blocked for 30 sec, osd.4 has slow ops > > (SLOW_OPS) > > 2021-10-20T05:07:04.652756+0200 osd.13 [WRN] slow request > > osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping > > cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected > e18084) > > initiated 2021-10-20T03:06:34.010528+0000 currently delayed > > 2021-10-20T05:07:05.585995+0200 osd.25 [WRN] slow request > > osd_op(client.394158.0:62776921 7.1f3 > > > 7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head > > [getxattrs,stat,read 0~4194304] snapc 0=[] > ondisk+read+known_if_redirected > > e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed > > 2021-10-20T05:07:05.629622+0200 osd.13 [WRN] slow request > > osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping > > cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected > e18084) > > initiated 2021-10-20T03:06:34.010528+0000 currently delayed > > 2021-10-20T05:07:05.629660+0200 osd.13 [WRN] slow request > > osd_op(client.394158.0:62776924 4.d 4:b4812045:::notify.4:head [watch > ping > > cookie 94141521019648] snapc 0=[] ondisk+write+known_if_redirected > e18084) > > initiated 2021-10-20T03:06:35.165999+0000 currently delayed > > 2021-10-20T05:07:05.629690+0200 osd.13 [WRN] slow request > > osd_op(client.305099.0:3244269 4.d 4:b4812045:::notify.4:head [watch ping > > cookie 94522369776384] snapc 0=[] ondisk+write+known_if_redirected > e18084) > > initiated 2021-10-20T03:06:35.402403+0000 currently delayed > > 2021-10-20T05:07:06.555735+0200 osd.25 [WRN] slow request > > osd_op(client.394158.0:62776921 7.1f3 > > > 7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head > > [getxattrs,stat,read 0~4194304] snapc 0=[] > ondisk+read+known_if_redirected > > e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed > > 2021-10-20T05:07:06.677696+0200 osd.13 [WRN] slow request > > osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping > > cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected > e18084) > > initiated 2021-10-20T03:06:34.010528+0000 currently delayed > > 2021-10-20T05:07:06.677732+0200 osd.13 [WRN] slow request > > osd_op(client.394158.0:62776924 4.d 4:b4812045:::notify.4:head [watch > ping > > cookie 94141521019648] snapc 0=[] ondisk+write+known_if_redirected > e18084) > > initiated 2021-10-20T03:06:35.165999+0000 currently delayed > > 2021-10-20T05:07:06.677750+0200 osd.13 [WRN] slow request > > osd_op(client.305099.0:3244269 4.d 4:b4812045:::notify.4:head [watch ping > > cookie 94522369776384] snapc 0=[] ondisk+write+known_if_redirected > e18084) > > initiated 2021-10-20T03:06:35.402403+0000 currently delayed > > 2021-10-20T05:07:07.553717+0200 osd.25 [WRN] slow request > > osd_op(client.394158.0:62776921 7.1f3 > > > 7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head > > [getxattrs,stat,read 0~4194304] snapc 0=[] > ondisk+read+known_if_redirected > > e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed > > 2021-10-20T05:07:07.643135+0200 osd.13 [WRN] slow request > > osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping > > cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected > e18084) > > initiated 2021-10-20T03:06:34.010528+0000 currently delayed > > 2021-10-20T05:07:07.643159+0200 osd.13 [WRN] slow request > > osd_op(client.394158.0:62776924 4.d 4:b4812045:::notify.4:head [watch > ping > > cookie 94141521019648] snapc 0=[] ondisk+write+known_if_redirected > e18084) > > initiated 2021-10-20T03:06:35.165999+0000 currently delayed > > 2021-10-20T05:07:07.643175+0200 osd.13 [WRN] slow request > > osd_op(client.305099.0:3244269 4.d 4:b4812045:::notify.4:head [watch ping > > cookie 94522369776384] snapc 0=[] ondisk+write+known_if_redirected > e18084) > > initiated 2021-10-20T03:06:35.402403+0000 currently delayed > > 2021-10-20T05:07:08.368877+0200 mon.prodceph-mon1 [WRN] Health check > > update: 0 slow ops, oldest one blocked for 35 sec, osd.4 has slow ops > > (SLOW_OPS) > > 2021-10-20T05:07:08.570167+0200 osd.25 [WRN] slow request > > osd_op(client.394158.0:62776921 7.1f3 > > > 7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head > > [getxattrs,stat,read 0~4194304] snapc 0=[] > ondisk+read+known_if_redirected > > e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed > > 2021-10-20T05:07:08.570200+0200 osd.25 [WRN] slow request > > osd_op(client.394158.0:62776930 7.1f3 > > > 7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head > > [getxattrs,stat,read 0~4194304] snapc 0=[] > ondisk+read+known_if_redirected > > e18084) initiated 2021-10-20T03:06:38.518576+0000 currently delayed > > 2021-10-20T05:07:08.598671+0200 osd.13 [WRN] slow request > > osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping > > cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected > e18084) > > initiated 2021-10-20T03:06:34.010528+0000 currently delayed > > > > > > When then the node comes back up the slow ops dissapears, and IO resumes. > > > > I have tried to replicate it in a test environment using same Ceph > version, > > as the other cluster is now running production, but not succeeded to > > reproduce. > > > > Any insights or ideas would be appreciated. > > > > -- > > Med venlig hilsen > > > > > > *Troels Hansen* > > Senior Linux Konsulent > > > > Tlf.: 22 43 71 57 > > tha@xxxxxxxxxx > > www.miracle.dk > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > -- Med venlig hilsen *Troels Hansen* Senior Linux Konsulent Tlf.: 22 43 71 57 tha@xxxxxxxxxx www.miracle.dk _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx