Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

"Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx> · Thu, 4 Nov 2021 15:08:56 +0000

Out of curiosity, waiting_for_readable couldn't mean that our placement group distribution is not optimal yet?
Maybe we have too much data even if the cluster not complaining yet that pg should be increased?

In my case I have 42x 15.3TB osds in 7 servers with a host based 4:2 ec pool.
Currently storing 85TB data which is 1.4 Billions of objects and currently the placement group number os on the data pool 128. 

The maximum could be 512 in my setup, but the data amount is not there yet that the autoscaler starts to complain.

Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx
---------------------------------------------------

-----Original Message-----
From: Manuel Lausch <manuel.lausch@xxxxxxxx> 
Sent: Thursday, November 4, 2021 4:15 PM
To: Sage Weil <sage@xxxxxxxxxxxx>
Cc: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>; Ceph Users <ceph-users@xxxxxxx>; Peter Lieven <pl@xxxxxxx>
Subject: Re:  Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

Email received from the internet. If in doubt, don't click any link nor open any attachment !
________________________________

On Tue, 2 Nov 2021 09:02:31 -0500
Sage Weil <sage@xxxxxxxxxxxx> wrote:

>
> Just to be clear, you should try
>   osd_fast_shutdown = true
>   osd_fast_shutdown_notify_mon = false

I added some logs to the tracker ticket with this options set.

> You write if the osd rejects messenger connections, because it is
> > stopped, the peering process will skip the read_lease timeout. If 
> > the OSD annouces its shutdown, can we not skip this read_lease 
> > timeout as well?
> >
>
> If memory serves, yes, but the notify_mon process can take more time 
> than a peer OSD getting ECONNREFUSED.  The combination above is the 
> recommended combation (and the default).

On my tests yesterday I saw again, that it took about 2 seconds between stopping a OSD and the first blame in the ceph.log With the notification enabled, I got immediately the down message.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx