Re: Clients failing to respond to cache pressure

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Wed, 8 May 2019 09:33:19 -0700

On Wed, May 8, 2019 at 4:10 AM Stolte, Felix <f.stolte@xxxxxxxxxxxxx> wrote:
>
> Hi folks,
>
> we are running a luminous cluster and using the cephfs for fileservices. We use Tivoli Storage Manager to backup all data in the ceph filesystem to tape for disaster recovery. Backup runs on two dedicated servers, which mounted the cephfs via kernel mount. In order to complete the Backup in time we are using 60 Backup Threads per Server. While backup is running, ceph health often changes from “OK” to “2 clients failing to respond to cache pressure”. After investigating and doing research in the mailing list I set the following parameters:
>
> mds_cache_memory_limit = 34359738368 (32 GB) on MDS Server
>
> client_oc_size = 104857600 (100 MB, default is 200 MB) on Backup Servers
>
> All Servers running Ubuntu 18.04 with Kernel 4.15.0-47 and ceph 12.2.11. We have 3 MDS Servers, 1 Active, 2 Standby. Changing to multiple active MDS Servers is not an option, since we are planning to use snapshots. Cephfs holds 78,815,975 files.
>
> Any advice on getting rid of the Warning would be very much appreciated. On a sidenote: Although MDS Cache Memory is set to 32GB htop shows 60GB Memory Usage for the ceph-mds process

With clients doing backup it's likely that they hold millions of caps.
This is not a good situation to be in. I recommend upgrading to
12.2.12 as we recently backported a fix for the MDS to limit the
number of caps held by clients to 1M. Additionally, trimming the cache
and recalling caps is now throttled. This may help a lot for your
workload.

Note that these fixes haven't been backported to Mimic yet.

-- 
Patrick Donnelly
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com