Re: Clients failing to respond to cache pressure

"Stolte, Felix" <f.stolte@xxxxxxxxxxxxx> · Thu, 9 May 2019 10:21:13 +0000

Thanks for the info Patrick. We are using ceph packages from ubuntu main repo, so it will take some weeks until I can do the update. In the meantime is there anything I can do manually to decrease the number of caps hold by the backup nodes, like flushing the client cache or something like that? Is it possible to mount cephfs without caching on specific mounts? 

I had a look at the mds sessions and both nodes had over 5 million num_caps...

Regards Felix

-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
-------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------

Am 08.05.19, 18:33 schrieb "Patrick Donnelly" <pdonnell@xxxxxxxxxx>:

    On Wed, May 8, 2019 at 4:10 AM Stolte, Felix <f.stolte@xxxxxxxxxxxxx> wrote:
    >
    > Hi folks,
    >
    > we are running a luminous cluster and using the cephfs for fileservices. We use Tivoli Storage Manager to backup all data in the ceph filesystem to tape for disaster recovery. Backup runs on two dedicated servers, which mounted the cephfs via kernel mount. In order to complete the Backup in time we are using 60 Backup Threads per Server. While backup is running, ceph health often changes from “OK” to “2 clients failing to respond to cache pressure”. After investigating and doing research in the mailing list I set the following parameters:
    >
    > mds_cache_memory_limit = 34359738368 (32 GB) on MDS Server
    >
    > client_oc_size = 104857600 (100 MB, default is 200 MB) on Backup Servers
    >
    > All Servers running Ubuntu 18.04 with Kernel 4.15.0-47 and ceph 12.2.11. We have 3 MDS Servers, 1 Active, 2 Standby. Changing to multiple active MDS Servers is not an option, since we are planning to use snapshots. Cephfs holds 78,815,975 files.
    >
    > Any advice on getting rid of the Warning would be very much appreciated. On a sidenote: Although MDS Cache Memory is set to 32GB htop shows 60GB Memory Usage for the ceph-mds process

    With clients doing backup it's likely that they hold millions of caps.
    This is not a good situation to be in. I recommend upgrading to
    12.2.12 as we recently backported a fix for the MDS to limit the
    number of caps held by clients to 1M. Additionally, trimming the cache
    and recalling caps is now throttled. This may help a lot for your
    workload.

    Note that these fixes haven't been backported to Mimic yet.

    -- 
    Patrick Donnelly

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com