Re: CephFS warning: clients laggy due to laggy OSDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Venky,

As I said: There are no laggy OSDs. The maximum ping I have for any OSD in ceph osd perf is around 60ms (just a handful, probably aging disks). The vast majority of OSDs have ping times of less than 1ms. Same for the host machines, yet I'm still seeing this message. It seems that the affected hosts are usually the same, but I have absolutely no clue why.

Janek


On 19/09/2023 12:36, Venky Shankar wrote:
Hi Janek,

On Mon, Sep 18, 2023 at 9:52 PM Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> wrote:

Thanks! However, I still don't really understand why I am seeing this.


This is due to a changes that was merged recently in pacific

        https://github.com/ceph/ceph/pull/52270

The MDS would not evict laggy clients if the OSDs report as laggy. Laggy OSDs can cause cephfs clients to not flush dirty data (during cap revokes by the MDS) and thereby showing up as laggy and getting evicted by the MDS. This behaviour was changed and therefore you get warnings that some client are laggy but they are not evicted since the OSDs are laggy.
 

The first time I had this, one of the clients was a remote user dialling in via VPN, which could indeed be laggy. But I am also seeing it from neighbouring hosts that are on the same physical network with reliable ping times way below 1ms. How is that considered laggy? 

 Are some of your OSDs reporting laggy? This can be check via `perf dump` 

> ceph tell mds.<> perf dump 
(search for op_laggy/osd_laggy)


On 18/09/2023 18:07, Laura Flores wrote:
Hi Janek,

There was some documentation added about it here: https://docs.ceph.com/en/pacific/cephfs/health-messages/

There is a description of what it means, and it's tied to an mds configurable.

On Mon, Sep 18, 2023 at 10:51 AM Janek Bevendorff <janek.bevendorff@xxxxxxxxxxxxx> wrote:
Hey all,

Since the upgrade to Ceph 16.2.14, I keep seeing the following warning:

10 client(s) laggy due to laggy OSDs

ceph health detail shows it as:

[WRN] MDS_CLIENTS_LAGGY: 10 client(s) laggy due to laggy OSDs
     mds.***(mds.3): Client *** is laggy; not evicted because some
OSD(s) is/are laggy
     more of this...

When I restart the client(s) or the affected MDS daemons, the message
goes away and then comes back after a while. ceph osd perf does not list
any laggy OSDs (a few with 10-60ms ping, but overwhelmingly < 1ms), so
I'm on a total loss what this even means.

I have never seen this message before nor was I able to find anything
about it. Do you have any idea what this message actually means and how
I can get rid of it?

Thanks
Janek

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


--

Laura Flores

She/Her/Hers

Software Engineer, Ceph Storage

Chicago, IL

lflores@xxxxxxx | lflores@xxxxxxxxxx
M: +17087388804



-- 
Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


--
Cheers,
Venky
-- 
Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux