Re: Ceph FS - MDS problem

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jul 3, 2015 at 10:34 AM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> Hi,
>
> We're looking at similar issues here and I was composing a mail just
> as you sent this. I'm just a user -- hopefully a dev will correct me
> where I'm wrong.
>
> 1. A CephFS cap is a way to delegate permission for a client to do IO
> with a file knowing that other clients are not also accessing that
> file. These caps need to be tracked so they can be later revoked for
> other clients to access files. (I didn't find a doc on CephFS caps, so
> this is a guess and probably wrong).

More or less, yep.

>
> 2. If you set debug_mds = 3 you can see memory usage and how many caps
> are delegated in total. Here's an example:
>
> mds.0.cache check_memory_usage total 7988108, rss 7018088, heap
> -457420, malloc -1747875 mmap 0, baseline -457420, buffers 0, max
> 1048576, 332739 / 332812 inodes have caps, 335839 caps, 1.0091 caps
> per inode
>
> It seems there is an int overflow for the heap and malloc measures on
> our server :(

Huh. We have some tickets in the tracker to improve our "MemoryModel";
I didn't know it was exposed at all. I'm sure the inode and cap counts
are good but I wouldn't rely on those memory values to mean much.

>
> Anyway, once the MDS has delegated I think 90% of its max caps it will
> start asking clients to give some back. If those clients don't release
> caps, or don't release them fast enough you'll see...
>
> 3. "failing to respond to capability release" and "failing to respond
> to cache pressure" can be caused by two different things: an old
> client -- maybe 3.14 is too old like Wido said -- or a busy client. We
> have a trivial bash script that creates many small files in a loop.
> This client is grabbing new caps faster than it can release them.

Yes to both of those.

>
> 3.b BTW, our old friend updatedb seems to trigger the same problem..
> grabbing caps very quickly as it indexes CephFS. updatedb.conf is
> configured to PRUNEFS="... fuse ...", but CephFS has type
> fuse.ceph-fuse. We'll need to add "ceph" to that list too.
>
> 4. "mds cache size = 5000000" is going to use a lot of memory! We have
> an MDS with just 8GB of RAM and it goes OOM after delegating  around 1
> million caps. (this is with mds cache size = 100000, btw)

Hmm. We do have some data for each client with a cap, but I think it's
pretty small in comparison to the size of each inode in memory. The
number of caps shouldn't impact memory usage very much, although the
number of inodes in cache definitely will.

>
> 4.b. "mds cache size" is used for more than one purpose .. it sets the
> size of the MDS LRU _and_ it sets the maximum number of client caps.
> Those seem like two completely different things... why is it the same
> config option?!!!

Sheer laziness/history. That limit on the max number of client caps is
new and we thought we shouldn't let a single client use up the entire
cache. We've seen situations where clients can cache more dentries
than the MDS is supposed to and can then run it out of memory; this
80% limit is one of the ways we're fighting that (alongside other bits
like these newer cap revocation warnings and more aggressive revokes).
We could perhaps allow the admin to set a different per-client limit
(below the max cache size) on the MDS; for those of you controlling
your clients there is also a "client_cache_size" config option you can
set.

>
>
> For me there are still a couple things missing related to CephFS caps
> and memory usage::
>   - a hard limit on the number of caps per client (to prevent a
> busy/broken client from DOS'ing the MDS)
>   - an automatic way to forcably revoke caps from a misbehaving
> client, e.g. revoke and put a client into RO or even no-IO mode

This isn't automatic yet but there is the "session evict" admin socket
command that you should use in conjunction with the osd blacklist
command. (We have it on our list to make that all prettier.)

>   - AFAICT, "mds mem max" has been unused since before argonaut -- we
> should remove that completely since it is confusing (PR incoming...)
>   - the MDS should eventually auto-tune the mds cache size to fit the
> amount of available memory.

Tickets already exist, but it's not bubbled anywhere near the top of
the priority list yet.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux