Re: failing to respond to cache pressure

Zhenshi Zhou <deaderzzs@xxxxxxxxx> · Fri, 7 Sep 2018 17:43:11 +0800

Hi Eugen,
Thanks for the update.

The message still appears in the logs these days. Option client_oc_size in 
my cluster is 100MB from the start. I have configured mds_cache_memory_limit 
to 4G and from then on the message reduced.

What I noticed is that the mds task reserves 6G memory(in top) while "cache status" 
is closed to 4G in my environment.

I'll keep on researching this.

Thanks

Eugen Block <eblock@xxxxxx> 于2018年9月6日周四 下午11:01写道：
Hi,

I would like to update this thread for others struggling with cache pressure.

The last time we hit that message was more than three weeks ago  

(workload has not changed), so it seems as our current configuration  

is fitting our workload.

Reducing client_oc_size to 100 MB (from default 200 MB) seems to be  

the trick here, just increasing the cache size was not enough, at  

least not if you are limited in memory. Currently we have set  

mds_cache_memory_limit to 4 GB.

Another note on MDS cache size:

I had configured the mds_cache_memory_limit (4 GB) and client_oc_size  

(100 MB) in version 12.2.5. Comparing the real usage with "ceph daemon  

mds.<MDS> cache status" and the reserved memory with "top" I noticed a  

huge difference, the reserved memory was almost 8 GB while "cache  

status" was at nearly 4 GB.

After upgrading to 12.2.7 the reserved memory size in top is still  

only about 5 GB after one week. Obviously there have been improvements  

regarding memory consumption of MDS, which is nice. :-)

Regards,

Eugen

Zitat von Eugen Block <eblock@xxxxxx>:

> Hi,

>

>> I think it does have positive effect on the messages. Cause I get fewer

>> messages than before.

>

> that's nice. I also receive definitely less cache pressure messages  

> than before.

> I also started to play around with the client side cache  

> configuration. I halved the client object cache size from 200 MB to  

> 100 MB:

>

> ceph@host1:~ $ ceph daemon mds.host1 config set client_oc_size 104857600

>

> Although I still encountered one pressure message recently the total  

> amount of these messages has decreased significantly.

>

> Regards,

> Eugen

>

>

> Zitat von Zhenshi Zhou <deaderzzs@xxxxxxxxx>:

>

>> Hi Eugen,

>> I think it does have positive effect on the messages. Cause I get fewer

>> messages than before.

>>

>> Eugen Block <eblock@xxxxxx> 于2018年8月20日周一 下午9:29写道：

>>

>>> Update: we are getting these messages again.

>>>

>>> So the search continues...

>>>

>>>

>>> Zitat von Eugen Block <eblock@xxxxxx>:

>>>

>>>> Hi,

>>>>

>>>> Depending on your kernel (memory leaks with CephFS) increasing the

>>>> mds_cache_memory_limit could be of help. What is your current

>>>> setting now?

>>>>

>>>> ceph:~ # ceph daemon mds.<MDS> config show | grep mds_cache_memory_limit

>>>>

>>>> We had these messages for months, almost every day.

>>>> It would occur when hourly backup jobs ran and the MDS had to serve

>>>> an additional client (searching the whole CephFS for changes)

>>>> besides the existing CephFS clients. First we updated all clients to

>>>> a more recent kernel version, but the warnings didn't stop. Then we

>>>> doubled the cache size from 2 GB to 4 GB last week and since then I

>>>> haven't seen this warning again (for now).

>>>>

>>>> Try playing with the cache size to find a setting fitting your

>>>> needs, but don't forget to monitor your MDS in case something goes

>>>> wrong.

>>>>

>>>> Regards,

>>>> Eugen

>>>>

>>>>

>>>> Zitat von Wido den Hollander <wido@xxxxxxxx>:

>>>>

>>>>> On 08/13/2018 01:22 PM, Zhenshi Zhou wrote:

>>>>>> Hi,

>>>>>> Recently, the cluster runs healthy, but I get warning messages

>>> everyday:

>>>>>>

>>>>>

>>>>> Which version of Ceph? Which version of clients?

>>>>>

>>>>> Can you post:

>>>>>

>>>>> $ ceph versions

>>>>> $ ceph features

>>>>> $ ceph fs status

>>>>>

>>>>> Wido

>>>>>

>>>>>> 2018-08-13 17:39:23.682213 [INF]  Cluster is now healthy

>>>>>> 2018-08-13 17:39:23.682144 [INF]  Health check cleared:

>>>>>> MDS_CLIENT_RECALL (was: 6 clients failing to respond to cache pressure)

>>>>>> 2018-08-13 17:39:23.052022 [INF]  MDS health message cleared (mds.0):

>>>>>> Client docker38:docker failing to respond to cache pressure

>>>>>> 2018-08-13 17:39:23.051979 [INF]  MDS health message cleared (mds.0):

>>>>>> Client docker73:docker failing to respond to cache pressure

>>>>>> 2018-08-13 17:39:23.051934 [INF]  MDS health message cleared (mds.0):

>>>>>> Client docker74:docker failing to respond to cache pressure

>>>>>> 2018-08-13 17:39:23.051853 [INF]  MDS health message cleared (mds.0):

>>>>>> Client docker75:docker failing to respond to cache pressure

>>>>>> 2018-08-13 17:39:23.051815 [INF]  MDS health message cleared (mds.0):

>>>>>> Client docker27:docker failing to respond to cache pressure

>>>>>> 2018-08-13 17:39:23.051753 [INF]  MDS health message cleared (mds.0):

>>>>>> Client docker27 failing to respond to cache pressure

>>>>>> 2018-08-13 17:38:11.100331 [WRN]  Health check update: 6 clients

>>> failing

>>>>>> to respond to cache pressure (MDS_CLIENT_RECALL)

>>>>>> 2018-08-13 17:37:39.570014 [WRN]  Health check update: 5 clients

>>> failing

>>>>>> to respond to cache pressure (MDS_CLIENT_RECALL)

>>>>>> 2018-08-13 17:37:31.099418 [WRN]  Health check update: 3 clients

>>> failing

>>>>>> to respond to cache pressure (MDS_CLIENT_RECALL)

>>>>>> 2018-08-13 17:36:34.564345 [WRN]  Health check update: 1 clients

>>> failing

>>>>>> to respond to cache pressure (MDS_CLIENT_RECALL)

>>>>>> 2018-08-13 17:36:27.121891 [WRN]  Health check update: 3 clients

>>> failing

>>>>>> to respond to cache pressure (MDS_CLIENT_RECALL)

>>>>>> 2018-08-13 17:36:11.967531 [WRN]  Health check update: 5 clients

>>> failing

>>>>>> to respond to cache pressure (MDS_CLIENT_RECALL)

>>>>>> 2018-08-13 17:35:59.870055 [WRN]  Health check update: 6 clients

>>> failing

>>>>>> to respond to cache pressure (MDS_CLIENT_RECALL)

>>>>>> 2018-08-13 17:35:47.787323 [WRN]  Health check update: 3 clients

>>> failing

>>>>>> to respond to cache pressure (MDS_CLIENT_RECALL)

>>>>>> 2018-08-13 17:34:59.435933 [WRN]  Health check failed: 1 clients

>>> failing

>>>>>> to respond to cache pressure (MDS_CLIENT_RECALL)

>>>>>> 2018-08-13 17:34:59.045510 [WRN]  MDS health message (mds.0): Client

>>>>>> docker75:docker failing to respond to cache pressure

>>>>>>

>>>>>> How can I fix it?

>>>>>>

>>>>>>

>>>>>> _______________________________________________

>>>>>> ceph-users mailing list

>>>>>> ceph-users@xxxxxxxxxxxxxx

>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>>>>

>>>>> _______________________________________________

>>>>> ceph-users mailing list

>>>>> ceph-users@xxxxxxxxxxxxxx

>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>

>>>

>>>

>>> _______________________________________________

>>> ceph-users mailing list

>>> ceph-users@xxxxxxxxxxxxxx

>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com