Re: Rebooting one node immediately blocks IO via RGW

Eugen Block <eblock@xxxxxx> · Wed, 27 Oct 2021 05:58:56 +0000

Can you share more details about that cluster like the applied crush  
rules and 'ceph -s' and 'ceph osd tree'?

Zitat von Troels Hansen <tha@xxxxxxxxxx>:

All pools are:
replicated size 3 min_size 2

failure domain host.

On Mon, Oct 25, 2021 at 11:07 AM Eugen Block <eblock@xxxxxx> wrote:

Hi,

what's the pool's min_size?

ceph osd pool ls detail

Zitat von Troels Hansen <tha@xxxxxxxxxx>:

> I have a strange issue......
> Its a 3 node cluster, deployed on Ubuntu, on containers, running version
> 15.2.4, docker.io/ceph/ceph:v15
>
> Its only running RGW, and everything seems fine, and everyting works.
> No errors and the cluster is healthy.
>
> As soon as one node is restarted all IO is blocked, apparently because of
> slow ops, but I see no reason for it.
>
> Its running as simple as possible, with a replica count of 3.
>
> The second the OSD's on the halted node dissapears I see slow ops, but
its
> blocking everything, and there is no IO to the cluster.
>
> The slow requests are spread accross all of the remaining OSD's.
>
> 2021-10-20T05:07:02.554282+0200 mon.prodceph-mon1 [WRN] Health check
> failed: 0 slow ops, oldest one blocked for 30 sec, osd.4 has slow ops
> (SLOW_OPS)
> 2021-10-20T05:07:04.652756+0200 osd.13 [WRN] slow request
> osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping
> cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected
e18084)
> initiated 2021-10-20T03:06:34.010528+0000 currently delayed
> 2021-10-20T05:07:05.585995+0200 osd.25 [WRN] slow request
> osd_op(client.394158.0:62776921 7.1f3
>
7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head
> [getxattrs,stat,read 0~4194304] snapc 0=[]
ondisk+read+known_if_redirected
> e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed
> 2021-10-20T05:07:05.629622+0200 osd.13 [WRN] slow request
> osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping
> cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected
e18084)
> initiated 2021-10-20T03:06:34.010528+0000 currently delayed
> 2021-10-20T05:07:05.629660+0200 osd.13 [WRN] slow request
> osd_op(client.394158.0:62776924 4.d 4:b4812045:::notify.4:head [watch
ping
> cookie 94141521019648] snapc 0=[] ondisk+write+known_if_redirected
e18084)
> initiated 2021-10-20T03:06:35.165999+0000 currently delayed
> 2021-10-20T05:07:05.629690+0200 osd.13 [WRN] slow request
> osd_op(client.305099.0:3244269 4.d 4:b4812045:::notify.4:head [watch ping
> cookie 94522369776384] snapc 0=[] ondisk+write+known_if_redirected
e18084)
> initiated 2021-10-20T03:06:35.402403+0000 currently delayed
> 2021-10-20T05:07:06.555735+0200 osd.25 [WRN] slow request
> osd_op(client.394158.0:62776921 7.1f3
>
7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head
> [getxattrs,stat,read 0~4194304] snapc 0=[]
ondisk+read+known_if_redirected
> e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed
> 2021-10-20T05:07:06.677696+0200 osd.13 [WRN] slow request
> osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping
> cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected
e18084)
> initiated 2021-10-20T03:06:34.010528+0000 currently delayed
> 2021-10-20T05:07:06.677732+0200 osd.13 [WRN] slow request
> osd_op(client.394158.0:62776924 4.d 4:b4812045:::notify.4:head [watch
ping
> cookie 94141521019648] snapc 0=[] ondisk+write+known_if_redirected
e18084)
> initiated 2021-10-20T03:06:35.165999+0000 currently delayed
> 2021-10-20T05:07:06.677750+0200 osd.13 [WRN] slow request
> osd_op(client.305099.0:3244269 4.d 4:b4812045:::notify.4:head [watch ping
> cookie 94522369776384] snapc 0=[] ondisk+write+known_if_redirected
e18084)
> initiated 2021-10-20T03:06:35.402403+0000 currently delayed
> 2021-10-20T05:07:07.553717+0200 osd.25 [WRN] slow request
> osd_op(client.394158.0:62776921 7.1f3
>
7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head
> [getxattrs,stat,read 0~4194304] snapc 0=[]
ondisk+read+known_if_redirected
> e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed
> 2021-10-20T05:07:07.643135+0200 osd.13 [WRN] slow request
> osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping
> cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected
e18084)
> initiated 2021-10-20T03:06:34.010528+0000 currently delayed
> 2021-10-20T05:07:07.643159+0200 osd.13 [WRN] slow request
> osd_op(client.394158.0:62776924 4.d 4:b4812045:::notify.4:head [watch
ping
> cookie 94141521019648] snapc 0=[] ondisk+write+known_if_redirected
e18084)
> initiated 2021-10-20T03:06:35.165999+0000 currently delayed
> 2021-10-20T05:07:07.643175+0200 osd.13 [WRN] slow request
> osd_op(client.305099.0:3244269 4.d 4:b4812045:::notify.4:head [watch ping
> cookie 94522369776384] snapc 0=[] ondisk+write+known_if_redirected
e18084)
> initiated 2021-10-20T03:06:35.402403+0000 currently delayed
> 2021-10-20T05:07:08.368877+0200 mon.prodceph-mon1 [WRN] Health check
> update: 0 slow ops, oldest one blocked for 35 sec, osd.4 has slow ops
> (SLOW_OPS)
> 2021-10-20T05:07:08.570167+0200 osd.25 [WRN] slow request
> osd_op(client.394158.0:62776921 7.1f3
>
7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head
> [getxattrs,stat,read 0~4194304] snapc 0=[]
ondisk+read+known_if_redirected
> e18084) initiated 2021-10-20T03:06:35.106815+0000 currently delayed
> 2021-10-20T05:07:08.570200+0200 osd.25 [WRN] slow request
> osd_op(client.394158.0:62776930 7.1f3
>
7:cfb51b5f:::5a288701-a65a-45c0-97c7-edfb38f2f487.124110.147864_b19283e9-c7bd-448e-952d-2f172467fa5c:head
> [getxattrs,stat,read 0~4194304] snapc 0=[]
ondisk+read+known_if_redirected
> e18084) initiated 2021-10-20T03:06:38.518576+0000 currently delayed
> 2021-10-20T05:07:08.598671+0200 osd.13 [WRN] slow request
> osd_op(client.394115.0:2994408 4.d 4:b4812045:::notify.4:head [watch ping
> cookie 94796974922496] snapc 0=[] ondisk+write+known_if_redirected
e18084)
> initiated 2021-10-20T03:06:34.010528+0000 currently delayed
>
>
> When then the node comes back up the slow ops dissapears, and IO resumes.
>
> I have tried to replicate it in a test environment using same Ceph
version,
> as the other cluster is now running production, but not succeeded to
> reproduce.
>
> Any insights or ideas would be appreciated.
>
> --
> Med venlig hilsen
>
>
> *Troels Hansen*
> Senior Linux Konsulent
>
> Tlf.: 22 43 71 57
> tha@xxxxxxxxxx
> www.miracle.dk
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Med venlig hilsen

*Troels Hansen*
Senior Linux Konsulent

Tlf.: 22 43 71 57
tha@xxxxxxxxxx
www.miracle.dk
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx