Re: slow ops at restarting OSDs (octopus)

Manuel Lausch <manuel.lausch@xxxxxxxx> · Thu, 10 Jun 2021 17:45:02 +0200

Hi Peter,

your suggestion pointed me to the right spot. 
I didn't know about the feature, that ceph will read from replica
PGs.

So on. I found two functions in the osd/PrimaryLogPG.cc:
"check_laggy" and "check_laggy_requeue". On both is first a check, if
the partners have the octopus features. if not, the function is
skipped. This explains the beginning of the problem after about the
half cluster was updated.

To verifiy this, I added "return true" in the first line of the
functions. The issue is gone with it. But
I don't know what problems this could trigger. I know, the root cause
is not fixed with it.
I think I will open a bug ticket with this knowlage.

osd_op_queue_cutoff is set to high
and a icmp rate limiting should not happen

Thanks
Manuel

On Thu, 10 Jun 2021 11:28:48 +0200
Peter Lieven <pl@xxxxxxx> wrote:

> Am 10.06.21 um 11:08 schrieb Manuel Lausch:
> > Hi,
> >
> > has no one a idea what could cause this issue. Or how I could debug
> > it?
> >
> > In some days I have to go live with this cluster. If I don't have a
> > solution I have to go live with nautilus.   
> 
> 
> Hi Manuel,
> 
> 
> I had similar issues with Octopus and i am thus stuck with Nautilus.
> 
> Can you debug the slow ops and see if the slow ops are caused by the
> status "waiting for readable".
> 
> I suspected that it has something to do with the new feature in
> Octopus to read from all OSDs regardless if
> 
> they are master for a PG or not.
> 
> 
> Can you also verify that osd_op_queue_cut_off is set to high and that
> icmp rate limiting is disabled on your hosts?
> 
> 
> Peter
> 
> 
> 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx