Re: Cluster always in WARN state, failing to respond to cache pressure

Cullen King <cullen@xxxxxxxxxxxxxxx> · Tue, 12 May 2015 13:56:26 -0700

Thanks for the suggestions Greg. One thing I forgot to mention, restarting the main MDS service fixes the problem temporarily.

Clearing inodes and dentries on the client with "echo 2 | sudo tee /proc/sys/vm/drop_caches" on the two cephfs clients that were failing to respond to cache pressure fixed the warning. Additionally, I realized my ceph clients were using 0.87.1 while my cluster is on 0.94.1-111. I've updated all the clients and remounted my cephfs shares and will cross my fingers that it resolves the issue.

Thanks for the help!

On Tue, May 12, 2015 at 12:55 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
On Tue, May 12, 2015 at 12:03 PM, Cullen King <cullen@xxxxxxxxxxxxxxx> wrote:

> I'm operating a fairly small ceph cluster, currently three nodes (with plans

> to expand to five in the next couple of months) with more than adequate

> hardware. Node specs:

>

> 2x Xeon E5-2630

> 64gb ram

> 2x RAID1 SSD for system

> 2x 256gb SSDs for journals

> 4x 4tb drives for OSDs

> 1GbE for frontend (shared with rest of my app servers, etc)

> 10GbE switch for cluster (only used for ceph storage nodes)

>

> I am using CephFS along with the object store w/ RadosGW in front of it. My

> problems existed when using only CephFS. I use CephFS as a shared datastore

> for two low volume OSM map tile servers to have a shared tile cache. Usage

> isn't heavy, it's mostly read. Here's a typical output from ceph status:

>

> https://gist.github.com/kingcu/499c3d9373726e5c7a95

>

> Here's my current ceph.conf:

>

> https://gist.github.com/kingcu/78ab0fe8669b7acb120c

>

> I've upped the mds cache size as recommended by some historical

> correspondence on the mailing list, which helped for a while. There doesn't

> seem to be any real issue with the cluster operating in this WARN state, as

> it has been in production for a couple months now without issue. I'm

> starting to migrate other data into the Ceph cluster, and before making the

> final plunge with critical data, wanted to get a handle on this issue.

> Suggestions are appreciated!

This warning has come up several times recently. It means that the MDS

has exceeded its specified cache size and has asked the clients to

return inodes/dentries, but they have not done so.

This either means that the clients are broken in some way (unlikely if

you're using the same Ceph release on both), or that the

kernel/applications are pinning so many dentries that the client can't

drop any of them from cache. You could try telling it to dump caches

and see if that releases stuff.

There's a new PR to make ceph-fuse be more aggressive about trying to

make the kernel dump stuff out

(https://github.com/ceph/ceph/pull/4653) but it just got created so

I'm not sure whether that will solve this problem or not.

Note that if your MDS isn't using too much memory this is probably not

going to be an issue for you, despite the WARN state.

-Greg

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com