Re: CephFS warning: clients laggy due to laggy OSDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Janek,

The PR venky mentioned makes use of OSD's laggy parameters (laggy_interval
and
laggy_probability) to find if any OSD is laggy or not. These laggy
parameters
can reset to 0 if the interval between the last modification done to OSDMap
and
the time stamp when OSD was marked down exceeds the grace interval threshold
which is the value we get by `mon_osd_laggy_halflife * 48` where
mon_osd_laggy_halflife is a configurable value which is by default 3600 so
only
if the interval I talked about exceeds 172800; the laggy parameters would
reset
to 0. I'd recommend taking a look at what your configured value is(using
cmd:
ceph config get osd mon_osd_laggy_halflife).

There is also a "hack" to reset the parameters manually(
*Not recommended, justfor info*): set mon_osd_laggy_weight to 1 using `ceph
config set osd
mon_osd_laggy_weight 1` and reboot the OSD(s) which is/are being said laggy
and
you will see the lagginess go away.


*Dhairya Parmar*

Associate Software Engineer, CephFS

Red Hat Inc. <https://www.redhat.com/>

dparmar@xxxxxxxxxx
<https://www.redhat.com/>


On Wed, Sep 20, 2023 at 3:25 PM Venky Shankar <vshankar@xxxxxxxxxx> wrote:

> Hey Janek,
>
> I took a closer look at various places where the MDS would consider a
> client as laggy and it seems like a wide variety of reasons are taken
> into consideration and not all of them might be a reason to defer client
> eviction, so the warning is a bit misleading. I'll post a PR for this. In
> the meantime, could you share the debug logs stated in my previous email?
>
> On Wed, Sep 20, 2023 at 3:07 PM Venky Shankar <vshankar@xxxxxxxxxx> wrote:
>
> > Hi Janek,
> >
> > On Tue, Sep 19, 2023 at 4:44 PM Janek Bevendorff <
> > janek.bevendorff@xxxxxxxxxxxxx> wrote:
> >
> >> Hi Venky,
> >>
> >> As I said: There are no laggy OSDs. The maximum ping I have for any OSD
> >> in ceph osd perf is around 60ms (just a handful, probably aging disks).
> The
> >> vast majority of OSDs have ping times of less than 1ms. Same for the
> host
> >> machines, yet I'm still seeing this message. It seems that the affected
> >> hosts are usually the same, but I have absolutely no clue why.
> >>
> >
> > It's possible that you are running into a bug which does not clear the
> > laggy clients list which the MDS sends to monitors via beacons. Could you
> > help us out with debug mds logs (by setting debug_mds=20) for the active
> > mds for around 15-20 seconds and share the logs please? Also reset the
> log
> > level once done since it can hurt performance.
> >
> > # ceph config set mds.<> debug_mds 20
> >
> > and reset via
> >
> > # ceph config rm mds.<> debug_mds
> >
> >
> >> Janek
> >>
> >>
> >> On 19/09/2023 12:36, Venky Shankar wrote:
> >>
> >> Hi Janek,
> >>
> >> On Mon, Sep 18, 2023 at 9:52 PM Janek Bevendorff <
> >> janek.bevendorff@xxxxxxxxxxxxx> wrote:
> >>
> >>> Thanks! However, I still don't really understand why I am seeing this.
> >>>
> >>
> >> This is due to a changes that was merged recently in pacific
> >>
> >>         https://github.com/ceph/ceph/pull/52270
> >>
> >> The MDS would not evict laggy clients if the OSDs report as laggy. Laggy
> >> OSDs can cause cephfs clients to not flush dirty data (during cap
> revokes
> >> by the MDS) and thereby showing up as laggy and getting evicted by the
> MDS.
> >> This behaviour was changed and therefore you get warnings that some
> client
> >> are laggy but they are not evicted since the OSDs are laggy.
> >>
> >>
> >>> The first time I had this, one of the clients was a remote user
> dialling
> >>> in via VPN, which could indeed be laggy. But I am also seeing it from
> >>> neighbouring hosts that are on the same physical network with reliable
> ping
> >>> times way below 1ms. How is that considered laggy?
> >>>
> >>  Are some of your OSDs reporting laggy? This can be check via `perf
> dump`
> >>
> >> > ceph tell mds.<> perf dump
> >> (search for op_laggy/osd_laggy)
> >>
> >>
> >>> On 18/09/2023 18:07, Laura Flores wrote:
> >>>
> >>> Hi Janek,
> >>>
> >>> There was some documentation added about it here:
> >>> https://docs.ceph.com/en/pacific/cephfs/health-messages/
> >>>
> >>> There is a description of what it means, and it's tied to an mds
> >>> configurable.
> >>>
> >>> On Mon, Sep 18, 2023 at 10:51 AM Janek Bevendorff <
> >>> janek.bevendorff@xxxxxxxxxxxxx> wrote:
> >>>
> >>>> Hey all,
> >>>>
> >>>> Since the upgrade to Ceph 16.2.14, I keep seeing the following
> warning:
> >>>>
> >>>> 10 client(s) laggy due to laggy OSDs
> >>>>
> >>>> ceph health detail shows it as:
> >>>>
> >>>> [WRN] MDS_CLIENTS_LAGGY: 10 client(s) laggy due to laggy OSDs
> >>>>      mds.***(mds.3): Client *** is laggy; not evicted because some
> >>>> OSD(s) is/are laggy
> >>>>      more of this...
> >>>>
> >>>> When I restart the client(s) or the affected MDS daemons, the message
> >>>> goes away and then comes back after a while. ceph osd perf does not
> >>>> list
> >>>> any laggy OSDs (a few with 10-60ms ping, but overwhelmingly < 1ms), so
> >>>> I'm on a total loss what this even means.
> >>>>
> >>>> I have never seen this message before nor was I able to find anything
> >>>> about it. Do you have any idea what this message actually means and
> how
> >>>> I can get rid of it?
> >>>>
> >>>> Thanks
> >>>> Janek
> >>>>
> >>>> _______________________________________________
> >>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>>
> >>>
> >>>
> >>> --
> >>>
> >>> Laura Flores
> >>>
> >>> She/Her/Hers
> >>>
> >>> Software Engineer, Ceph Storage <https://ceph.io>
> >>>
> >>> Chicago, IL
> >>>
> >>> lflores@xxxxxxx | lflores@xxxxxxxxxx <lflores@xxxxxxxxxx>
> >>> M: +17087388804
> >>>
> >>>
> >>> --
> >>> Bauhaus-Universität Weimar
> >>> Bauhausstr. 9a, R308
> >>> 99423 Weimar, Germany
> >>>
> >>> Phone: +49 3643 58 3577www.webis.de
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>
> >>
> >>
> >> --
> >> Cheers,
> >> Venky
> >>
> >> --
> >> Bauhaus-Universität Weimar
> >> Bauhausstr. 9a, R308
> >> 99423 Weimar, Germany
> >>
> >> Phone: +49 3643 58 3577www.webis.de
> >>
> >>
> >
> > --
> > Cheers,
> > Venky
> >
>
>
> --
> Cheers,
> Venky
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux