Re: CephFS warning: clients laggy due to laggy OSDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I took a snapshot of MDS.0's logs. We have five active MDS in total, each one reporting laggy OSDs/clients, but I cannot find anything related to that in the log snippet. Anyhow, I uploaded the log for your reference with ceph-post-file ID 79b5138b-61d7-4ba7-b0a9-c6f02f47b881.

This is what ceph status looks like after a couple of days. This is not normal:

HEALTH_WARN
55 client(s) laggy due to laggy OSDs
8 clients failing to respond to capability release
1 clients failing to advance oldest client/flush tid
5 MDSs report slow requests

(55 clients are actually "just" 11 unique client IDs, but each MDS makes their own report.)

osd mon_osd_laggy_halflife is not configured on our cluster, so it's the default of 3600.


Janek


On 20/09/2023 13:17, Dhairya Parmar wrote:
Hi Janek, 

The PR venky mentioned makes use of OSD's laggy parameters (laggy_interval and
laggy_probability) to find if any OSD is laggy or not. These laggy parameters
can reset to 0 if the interval between the last modification done to OSDMap and
the time stamp when OSD was marked down exceeds the grace interval threshold
which is the value we get by `mon_osd_laggy_halflife * 48` where
mon_osd_laggy_halflife is a configurable value which is by default 3600 so only
if the interval I talked about exceeds 172800; the laggy parameters would reset
to 0. I'd recommend taking a look at what your configured value is(using cmd:
ceph config get osd mon_osd_laggy_halflife).

There is also a "hack" to reset the parameters manually(Not recommended, just
for info
): set mon_osd_laggy_weight to 1 using `ceph config set osd
mon_osd_laggy_weight 1` and reboot the OSD(s) which is/are being said laggy and
you will see the lagginess go away.


Dhairya Parmar

Associate Software Engineer, CephFS

Red Hat Inc.

dparmar@xxxxxxxxxx



On Wed, Sep 20, 2023 at 3:25 PM Venky Shankar <vshankar@xxxxxxxxxx> wrote:
Hey Janek,

I took a closer look at various places where the MDS would consider a
client as laggy and it seems like a wide variety of reasons are taken
into consideration and not all of them might be a reason to defer client
eviction, so the warning is a bit misleading. I'll post a PR for this. In
the meantime, could you share the debug logs stated in my previous email?

On Wed, Sep 20, 2023 at 3:07 PM Venky Shankar <vshankar@xxxxxxxxxx> wrote:

> Hi Janek,
>
> On Tue, Sep 19, 2023 at 4:44 PM Janek Bevendorff <
> janek.bevendorff@xxxxxxxxxxxxx> wrote:
>
>> Hi Venky,
>>
>> As I said: There are no laggy OSDs. The maximum ping I have for any OSD
>> in ceph osd perf is around 60ms (just a handful, probably aging disks). The
>> vast majority of OSDs have ping times of less than 1ms. Same for the host
>> machines, yet I'm still seeing this message. It seems that the affected
>> hosts are usually the same, but I have absolutely no clue why.
>>
>
> It's possible that you are running into a bug which does not clear the
> laggy clients list which the MDS sends to monitors via beacons. Could you
> help us out with debug mds logs (by setting debug_mds=20) for the active
> mds for around 15-20 seconds and share the logs please? Also reset the log
> level once done since it can hurt performance.
>
> # ceph config set mds.<> debug_mds 20
>
> and reset via
>
> # ceph config rm mds.<> debug_mds
>
>
>> Janek
>>
>>
>> On 19/09/2023 12:36, Venky Shankar wrote:
>>
>> Hi Janek,
>>
>> On Mon, Sep 18, 2023 at 9:52 PM Janek Bevendorff <
>> janek.bevendorff@xxxxxxxxxxxxx> wrote:
>>
>>> Thanks! However, I still don't really understand why I am seeing this.
>>>
>>
>> This is due to a changes that was merged recently in pacific
>>
>>         https://github.com/ceph/ceph/pull/52270
>>
>> The MDS would not evict laggy clients if the OSDs report as laggy. Laggy
>> OSDs can cause cephfs clients to not flush dirty data (during cap revokes
>> by the MDS) and thereby showing up as laggy and getting evicted by the MDS.
>> This behaviour was changed and therefore you get warnings that some client
>> are laggy but they are not evicted since the OSDs are laggy.
>>
>>
>>> The first time I had this, one of the clients was a remote user dialling
>>> in via VPN, which could indeed be laggy. But I am also seeing it from
>>> neighbouring hosts that are on the same physical network with reliable ping
>>> times way below 1ms. How is that considered laggy?
>>>
>>  Are some of your OSDs reporting laggy? This can be check via `perf dump`
>>
>> > ceph tell mds.<> perf dump
>> (search for op_laggy/osd_laggy)
>>
>>
>>> On 18/09/2023 18:07, Laura Flores wrote:
>>>
>>> Hi Janek,
>>>
>>> There was some documentation added about it here:
>>> https://docs.ceph.com/en/pacific/cephfs/health-messages/
>>>
>>> There is a description of what it means, and it's tied to an mds
>>> configurable.
>>>
>>> On Mon, Sep 18, 2023 at 10:51 AM Janek Bevendorff <
>>> janek.bevendorff@xxxxxxxxxxxxx> wrote:
>>>
>>>> Hey all,
>>>>
>>>> Since the upgrade to Ceph 16.2.14, I keep seeing the following warning:
>>>>
>>>> 10 client(s) laggy due to laggy OSDs
>>>>
>>>> ceph health detail shows it as:
>>>>
>>>> [WRN] MDS_CLIENTS_LAGGY: 10 client(s) laggy due to laggy OSDs
>>>>      mds.***(mds.3): Client *** is laggy; not evicted because some
>>>> OSD(s) is/are laggy
>>>>      more of this...
>>>>
>>>> When I restart the client(s) or the affected MDS daemons, the message
>>>> goes away and then comes back after a while. ceph osd perf does not
>>>> list
>>>> any laggy OSDs (a few with 10-60ms ping, but overwhelmingly < 1ms), so
>>>> I'm on a total loss what this even means.
>>>>
>>>> I have never seen this message before nor was I able to find anything
>>>> about it. Do you have any idea what this message actually means and how
>>>> I can get rid of it?
>>>>
>>>> Thanks
>>>> Janek
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>
>>>
>>>
>>> --
>>>
>>> Laura Flores
>>>
>>> She/Her/Hers
>>>
>>> Software Engineer, Ceph Storage <https://ceph.io>
>>>
>>> Chicago, IL
>>>
>>> lflores@xxxxxxx | lflores@xxxxxxxxxx <lflores@xxxxxxxxxx>
>>> M: +17087388804
>>>
>>>
>>> --
>>> Bauhaus-Universität Weimar
>>> Bauhausstr. 9a, R308
>>> 99423 Weimar, Germany
>>>
>>> Phone: +49 3643 58 3577www.webis.de
>>>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>>
>>
>> --
>> Cheers,
>> Venky
>>
>> --
>> Bauhaus-Universität Weimar
>> Bauhausstr. 9a, R308
>> 99423 Weimar, Germany
>>
>> Phone: +49 3643 58 3577www.webis.de
>>
>>
>
> --
> Cheers,
> Venky
>


--
Cheers,
Venky
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
-- 
Bauhaus-Universität Weimar
Bauhausstr. 9a, R308
99423 Weimar, Germany

Phone: +49 3643 58 3577
www.webis.de

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux