Re: CephFS: costly MDS cache misses?

"Yan, Zheng" <ukernel@xxxxxxxxx> · Thu, 30 Nov 2017 16:14:07 +0800

On Thu, Nov 30, 2017 at 2:08 AM, Jens-U. Mozdzen <jmozdzen@xxxxxx> wrote:
> Hi *,
>
> while tracking down a different performance issue with CephFS (creating tar
> balls from CephFS-based directories takes multiple times as long as when
> backing up the same data from local disks, i.e. 56 hours instead of 7), we
> had a look at CephFS performance related to the size of the MDS process.
>
> Our Ceph cluster (Luminous 12.2.1) is using file-based OSDs, CephFS data is
> on SAS HDDs, meta data is on SAS SSDs.
>
> It came to mind that MDS memory consumption might cause the delays with
> "tar". But while below results don't confirm this (it actually confirms that
> MDS memory size does not affect CephFS read speed when the cache is
> sufficiently warm), it does show an almost 30% performance drop if the cache
> is filled with the wrong entries.
>
> After a fresh process start, our MDS takes about 450k memory, with 56k
> residual. I then start a tar run for 36 GB small files (which I had also run
> a few minutes before MDS restart, to warm up disk caches):
>
> --- cut here ---
>    PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+
> COMMAND
>    1233 ceph      20   0  446584  56000  15908 S  3.960 0.085   0:01.08
> ceph-mds
>
> server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . | wc -c;
> date
> Wed Nov 29 17:38:21 CET 2017
> 38245529600
> Wed Nov 29 17:44:27 CET 2017
> server01:~ #
>
>    PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+
> COMMAND
>   1233 ceph      20   0  485760 109156  16148 S  0.331 0.166   0:10.76
> ceph-mds
> --- cut here ---
>
> As you can see, there's only small growth in MDS virtual size.
>
> The job took 366 seconds, that an average of about 100 MB/s.
>
> I repeat that job a few minutes later, to get numbers with a previously
> active MDS (the MDS cache should be warmed up now):
>
> --- cut here ---
>    PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+
> COMMAND
>   1233 ceph      20   0  494976 118404  16148 S  2.961 0.180   0:16.21
> ceph-mds
>
> server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . | wc -c;
> date
> Wed Nov 29 17:53:09 CET 2017
> 38245529600
> Wed Nov 29 17:58:53 CET 2017
> server01:~ #
>
>    PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+
> COMMAND
>   1233 ceph      20   0  508288 131368  16148 S  1.980 0.200   0:25.45
> ceph-mds
> --- cut here ---
>
> The job took 344 seconds, that's an average of about 106 MB/s. With only a
> single run per situation, these numbers aren't more than rough estimate, of
> course.
>
> At 18:00:00, a file-based incremental backup job kicks in, which reads
> through most of the files on the CephFS, but only backing up those that were
> changed since the last run. This has nothing to do with our "tar" and is
> running on a different node, where CephFS is kernel-mounted as well. That
> backup job makes the MDS cache grow drastically, you can see MDS at more
> than 8 GB now.
>
> We then start another tar job (or rather two, to account for MDS caching),
> as before:
>
> --- cut here ---
>    PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+
> COMMAND
>   1233 ceph      20   0 8644776 7.750g  16184 S  0.990 12.39   6:45.24
> ceph-mds
>
> server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . | wc -c;
> date
> Wed Nov 29 18:13:20 CET 2017
> 38245529600
> Wed Nov 29 18:21:50 CET 2017
> server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . | wc -c;
> date
> Wed Nov 29 18:22:52 CET 2017
> 38245529600
> Wed Nov 29 18:28:28 CET 2017
> server01:~ #
>
>    PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+
> COMMAND
>   1233 ceph      20   0 8761512 7.642g  16184 S  3.300 12.22   7:03.52
> ceph-mds
> --- cut here ---
>
> The second run is even a bit quicker than the "warmed-up" run with the only
> partially filled cache (336 seconds, that's 108,5 MB/s).
>
> But the run against the filled-up MDS cache, where most (if not all) entries
> are no match for our tar lookups, took 510 seconds - that 71,5 MB/s, instead
> of the roughly 100 MB/s when the cache was empty.
>
> This is by far no precise benchmark test, indeed. But it at least seems to
> be an indicator that MDS cache misses are costly. (During the tests, only
> small amounts of changes in CephFS were likely - especially compared to the
> amount of reads and file lookups for their metadata.)
>
> Regards,
> Jens
>
> PS: Why so much memory for MDS in the first place? Because during those
> (hourly) incremental backup runs, we got a large number of MDS warnings
> about insufficient cache pressure responses from clients. Increasing the MDS
> cache size did help to avoid these.
>

I just found a kernel client bug. The kernel can fail to trim as many
capabilities as MDS asked. MDS need to keep corresponding inode in its
cache while client has capabilitiy. The bug can explain why large
memory usage of MDS during backup job. It also could be the cause of
slowdown in your test .

.
https://github.com/ceph/ceph-client/commit/9031ea643edc79efc6a69f73b11752d88c49777b

> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com