Re: OSD spend too much time on "waiting for readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Manuel,

Before giving up and putting in an off switch, I'd like to understand why
it is taking as long as it is for the PGs to go active.

Would you consider enabling debug_osd=10 and debug_ms=1 on your OSDs, and
debug_mon=10 + debug_ms=1 on the mons, and reproducing this (without the
patch applied this time of course!)?  The logging will slow things down a
bit but hopefully the behavior will be close enough to what you see
normally that we can tell what is going on (and presumably picking out the
pg that was most laggy will highlight the source(s) of the delay).

sage

On Wed, Nov 10, 2021 at 4:41 AM Manuel Lausch <manuel.lausch@xxxxxxxx>
wrote:

> This is the patch I made. I think this is the wrong place to do this. but
> in the first place in worked.
>
>
> diff --git a/src/osd/PrimaryLogPG.cc b/src/osd/PrimaryLogPG.cc
> index 9fb22e0f9ee..69341840153 100644
> --- a/src/osd/PrimaryLogPG.cc
> +++ b/src/osd/PrimaryLogPG.cc
> @@ -798,6 +798,10 @@ void PrimaryLogPG::maybe_force_recovery()
>
>  bool PrimaryLogPG::check_laggy(OpRequestRef& op)
>  {
> +  if (!cct->_conf->osd_read_lease_enabled) {
> +    // possibility to deactivate this feature.
> +    return true;
> +  }
>    if (!HAVE_FEATURE(recovery_state.get_min_upacting_features(),
>                     SERVER_OCTOPUS)) {
>      dout(20) << __func__ << " not all upacting has SERVER_OCTOPUS" <<
> dendl;
> @@ -833,6 +837,10 @@ bool PrimaryLogPG::check_laggy(OpRequestRef& op)
>
>  bool PrimaryLogPG::check_laggy_requeue(OpRequestRef& op)
>  {
> +  if (!cct->_conf->osd_read_lease_enabled) {
> +    // possibility to deactivate this feature.
> +    return true;
> +  }
>    if (!HAVE_FEATURE(recovery_state.get_min_upacting_features(),
>                     SERVER_OCTOPUS)) {
>      return true;
>
>
> ________________________________________
> Von: Peter Lieven <pl@xxxxxxx>
> Gesendet: Mittwoch, 10. November 2021 11:37
> An: Manuel Lausch; Sage Weil
> Cc: ceph-users@xxxxxxx
> Betreff: Re:  Re: OSD spend too much time on "waiting for
> readable" -> slow ops -> laggy pg -> rgw stop -> worst case osd restart
>
> Am 10.11.21 um 11:35 schrieb Manuel Lausch:
> > oh shit,
> >
> > I patched in a switch to deactivate the read_lease feature. This is only
> a hack to test a bit around. But accidentally I had this switch enabled for
> my last tests done here in this mail-thread.
> >
> > The bad news. The require_osd_release doesn't fix the slow op
> problematic, only the increasing of the osdmap epochs are fixed.
> > Unfortunately, even reduceing the paxos_prpopose_interval changes
> anything. My last tests with it was wrong due to my hack :-(
>
>
> Would it be an option to make this hack a switch for all those who don't
> require the read lease feature and are happy with reading from just the
> primary?
>
>
> Peter
>
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux