CephFS: costly MDS cache misses?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi *,

while tracking down a different performance issue with CephFS (creating tar balls from CephFS-based directories takes multiple times as long as when backing up the same data from local disks, i.e. 56 hours instead of 7), we had a look at CephFS performance related to the size of the MDS process.

Our Ceph cluster (Luminous 12.2.1) is using file-based OSDs, CephFS data is on SAS HDDs, meta data is on SAS SSDs.

It came to mind that MDS memory consumption might cause the delays with "tar". But while below results don't confirm this (it actually confirms that MDS memory size does not affect CephFS read speed when the cache is sufficiently warm), it does show an almost 30% performance drop if the cache is filled with the wrong entries.

After a fresh process start, our MDS takes about 450k memory, with 56k residual. I then start a tar run for 36 GB small files (which I had also run a few minutes before MDS restart, to warm up disk caches):

--- cut here ---
   PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+ COMMAND
1233 ceph 20 0 446584 56000 15908 S 3.960 0.085 0:01.08 ceph-mds

server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . | wc -c; date
Wed Nov 29 17:38:21 CET 2017
38245529600
Wed Nov 29 17:44:27 CET 2017
server01:~ #

   PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+ COMMAND
1233 ceph 20 0 485760 109156 16148 S 0.331 0.166 0:10.76 ceph-mds
--- cut here ---

As you can see, there's only small growth in MDS virtual size.

The job took 366 seconds, that an average of about 100 MB/s.

I repeat that job a few minutes later, to get numbers with a previously active MDS (the MDS cache should be warmed up now):

--- cut here ---
   PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+ COMMAND
1233 ceph 20 0 494976 118404 16148 S 2.961 0.180 0:16.21 ceph-mds

server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . | wc -c; date
Wed Nov 29 17:53:09 CET 2017
38245529600
Wed Nov 29 17:58:53 CET 2017
server01:~ #

   PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+ COMMAND
1233 ceph 20 0 508288 131368 16148 S 1.980 0.200 0:25.45 ceph-mds
--- cut here ---

The job took 344 seconds, that's an average of about 106 MB/s. With only a single run per situation, these numbers aren't more than rough estimate, of course.

At 18:00:00, a file-based incremental backup job kicks in, which reads through most of the files on the CephFS, but only backing up those that were changed since the last run. This has nothing to do with our "tar" and is running on a different node, where CephFS is kernel-mounted as well. That backup job makes the MDS cache grow drastically, you can see MDS at more than 8 GB now.

We then start another tar job (or rather two, to account for MDS caching), as before:

--- cut here ---
   PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+ COMMAND
1233 ceph 20 0 8644776 7.750g 16184 S 0.990 12.39 6:45.24 ceph-mds

server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . | wc -c; date
Wed Nov 29 18:13:20 CET 2017
38245529600
Wed Nov 29 18:21:50 CET 2017
server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . | wc -c; date
Wed Nov 29 18:22:52 CET 2017
38245529600
Wed Nov 29 18:28:28 CET 2017
server01:~ #

   PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+ COMMAND
1233 ceph 20 0 8761512 7.642g 16184 S 3.300 12.22 7:03.52 ceph-mds
--- cut here ---

The second run is even a bit quicker than the "warmed-up" run with the only partially filled cache (336 seconds, that's 108,5 MB/s).

But the run against the filled-up MDS cache, where most (if not all) entries are no match for our tar lookups, took 510 seconds - that 71,5 MB/s, instead of the roughly 100 MB/s when the cache was empty.

This is by far no precise benchmark test, indeed. But it at least seems to be an indicator that MDS cache misses are costly. (During the tests, only small amounts of changes in CephFS were likely - especially compared to the amount of reads and file lookups for their metadata.)

Regards,
Jens

PS: Why so much memory for MDS in the first place? Because during those (hourly) incremental backup runs, we got a large number of MDS warnings about insufficient cache pressure responses from clients. Increasing the MDS cache size did help to avoid these.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux