Re: Rebooting one node immediately blocks IO via RGW

Troels Hansen <tha@xxxxxxxxxx> · Tue, 26 Oct 2021 13:56:28 +0200

All pools are:
replicated size 3 min_size 2

failure domain host.

On Mon, Oct 25, 2021 at 11:07 AM Eugen Block <eblock@xxxxxx> wrote:

> Hi,
>
> what's the pool's min_size?
>
> ceph osd pool ls detail
>
>
> Zitat von Troels Hansen <tha@xxxxxxxxxx>:
>
> > I have a strange issue......
> > Its a 3 node cluster, deployed on Ubuntu, on containers, running version
> > 15.2.4, docker.io/ceph/ceph:v15
> >
> > Its only running RGW, and everything seems fine, and everyting works.
> > No errors and the cluster is healthy.
> >
> > As soon as one node is restarted all IO is blocked, apparently because of
> > slow ops, but I see no reason for it.
> >
> > Its running as simple as possible, with a replica count of 3.
> >
> > The second the OSD's on the halted node dissapears I see slow ops, but
> its
> > blocking everything, and there is no IO to the cluster.
> >
> > The slow requests are spread accross all of the remaining OSD's.
> >
> > 2021-10-20T05:07:02.554282+0200 mon.prodceph-mon1 [WRN] Health check
> > failed: 0 slow ops, oldest one blocked for 30 sec, osd.4 has slow ops
> > (SLOW_OPS)
> > 2021-10-20T05:07:04.652756+0200 osd.13 [WRN] slow request
> > osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping
> > cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected
> e18084)
> > initiated 2021-10-20T03:06:34.010528+0000 currently delayed
> > 2021-10-20T05:07:05.585995+0200 osd.25 [WRN] slow request
> > osd_op(client.394158.0:62776921 7.1f3
> >
> 7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head
> > [getxattrs,stat,read 0~4194304] snapc 0=[]
> ondisk+read+known_if_redirected
> > e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed
> > 2021-10-20T05:07:05.629622+0200 osd.13 [WRN] slow request
> > osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping
> > cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected
> e18084)
> > initiated 2021-10-20T03:06:34.010528+0000 currently delayed
> > 2021-10-20T05:07:05.629660+0200 osd.13 [WRN] slow request
> > osd_op(client.394158.0:62776924 4.d 4:b4812045:::notify.4:head [watch
> ping
> > cookie 94141521019648] snapc 0=[] ondisk+write+known_if_redirected
> e18084)
> > initiated 2021-10-20T03:06:35.165999+0000 currently delayed
> > 2021-10-20T05:07:05.629690+0200 osd.13 [WRN] slow request
> > osd_op(client.305099.0:3244269 4.d 4:b4812045:::notify.4:head [watch ping
> > cookie 94522369776384] snapc 0=[] ondisk+write+known_if_redirected
> e18084)
> > initiated 2021-10-20T03:06:35.402403+0000 currently delayed
> > 2021-10-20T05:07:06.555735+0200 osd.25 [WRN] slow request
> > osd_op(client.394158.0:62776921 7.1f3
> >
> 7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head
> > [getxattrs,stat,read 0~4194304] snapc 0=[]
> ondisk+read+known_if_redirected
> > e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed
> > 2021-10-20T05:07:06.677696+0200 osd.13 [WRN] slow request
> > osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping
> > cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected
> e18084)
> > initiated 2021-10-20T03:06:34.010528+0000 currently delayed
> > 2021-10-20T05:07:06.677732+0200 osd.13 [WRN] slow request
> > osd_op(client.394158.0:62776924 4.d 4:b4812045:::notify.4:head [watch
> ping
> > cookie 94141521019648] snapc 0=[] ondisk+write+known_if_redirected
> e18084)
> > initiated 2021-10-20T03:06:35.165999+0000 currently delayed
> > 2021-10-20T05:07:06.677750+0200 osd.13 [WRN] slow request
> > osd_op(client.305099.0:3244269 4.d 4:b4812045:::notify.4:head [watch ping
> > cookie 94522369776384] snapc 0=[] ondisk+write+known_if_redirected
> e18084)
> > initiated 2021-10-20T03:06:35.402403+0000 currently delayed
> > 2021-10-20T05:07:07.553717+0200 osd.25 [WRN] slow request
> > osd_op(client.394158.0:62776921 7.1f3
> >
> 7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head
> > [getxattrs,stat,read 0~4194304] snapc 0=[]
> ondisk+read+known_if_redirected
> > e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed
> > 2021-10-20T05:07:07.643135+0200 osd.13 [WRN] slow request
> > osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping
> > cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected
> e18084)
> > initiated 2021-10-20T03:06:34.010528+0000 currently delayed
> > 2021-10-20T05:07:07.643159+0200 osd.13 [WRN] slow request
> > osd_op(client.394158.0:62776924 4.d 4:b4812045:::notify.4:head [watch
> ping
> > cookie 94141521019648] snapc 0=[] ondisk+write+known_if_redirected
> e18084)
> > initiated 2021-10-20T03:06:35.165999+0000 currently delayed
> > 2021-10-20T05:07:07.643175+0200 osd.13 [WRN] slow request
> > osd_op(client.305099.0:3244269 4.d 4:b4812045:::notify.4:head [watch ping
> > cookie 94522369776384] snapc 0=[] ondisk+write+known_if_redirected
> e18084)
> > initiated 2021-10-20T03:06:35.402403+0000 currently delayed
> > 2021-10-20T05:07:08.368877+0200 mon.prodceph-mon1 [WRN] Health check
> > update: 0 slow ops, oldest one blocked for 35 sec, osd.4 has slow ops
> > (SLOW_OPS)
> > 2021-10-20T05:07:08.570167+0200 osd.25 [WRN] slow request
> > osd_op(client.394158.0:62776921 7.1f3
> >
> 7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head
> > [getxattrs,stat,read 0~4194304] snapc 0=[]
> ondisk+read+known_if_redirected
> > e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed
> > 2021-10-20T05:07:08.570200+0200 osd.25 [WRN] slow request
> > osd_op(client.394158.0:62776930 7.1f3
> >
> 7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head
> > [getxattrs,stat,read 0~4194304] snapc 0=[]
> ondisk+read+known_if_redirected
> > e18084) initiated 2021-10-20T03:06:38.518576+0000 currently delayed
> > 2021-10-20T05:07:08.598671+0200 osd.13 [WRN] slow request
> > osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping
> > cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected
> e18084)
> > initiated 2021-10-20T03:06:34.010528+0000 currently delayed
> >
> >
> > When then the node comes back up the slow ops dissapears, and IO resumes.
> >
> > I have tried to replicate it in a test environment using same Ceph
> version,
> > as the other cluster is now running production, but not succeeded to
> > reproduce.
> >
> > Any insights or ideas would be appreciated.
> >
> > --
> > Med venlig hilsen
> >
> >
> > *Troels Hansen*
> > Senior Linux Konsulent
> >
> > Tlf.: 22 43 71 57
> > tha@xxxxxxxxxx
> > www.miracle.dk
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

-- 
Med venlig hilsen

*Troels Hansen*
Senior Linux Konsulent

Tlf.: 22 43 71 57
tha@xxxxxxxxxx
www.miracle.dk
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx