Re: CephFS unresponsive at scale (2M files,

Kevin Sumner <kevin@xxxxxxxxx> · Wed, 19 Nov 2014 11:59:16 -0800

Making mds cache size 5 million seems to have helped significantly, but we’re still seeing issues occasionally on metadata reads while under load.  Settings over 5 million don’t seem to have any noticeable impact on this problem.  I’m starting the upgrade to Giant today.
--
Kevin Sumner
kevin@xxxxxxxxx

On Nov 18, 2014, at 1:10 PM, Kevin Sumner <kevin@xxxxxxxxx> wrote:

Hi Thomas,

I looked over the mds config reference a bit yesterday, but mds cache size seems to be the most relevant tunable.

As suggested, I upped mds-cache-size to 1 million yesterday and started the load generator.  During load generation, we’re seeing similar behavior on the filesystem and the mds.  The mds process is running a little hotter now with higher CPU average and 11GB resident size (was just under 10GB iirc).  Enumerating files on the filesystem, e.g., with ls, is still hanging though.
With load generation disabled, the behavior is the same as before, i.e., things work ask expected.

I’ve got a lot of memory and CPU headroom on the box hosting the mds, so unless there’s good reason not to, I’m to continue increasing the mds cache iteratively in the hopes of finding a size that produces good behavior.  Right now, I’d expect us to hit around 2 million inodes each minute, so cache at 1 million is still undersized.  If that doesn’t work, we’re running Firefly on the cluster currently and I’ll be upgrading it to Giant.
--
Kevin Sumner
kevin@xxxxxxxxx

On Nov 18, 2014, at 1:36 AM, Thomas Lemarchand <thomas.lemarchand@xxxxxxxxxxxxxxxxxx> wrote:

Hi Kevin,

There are every (I think) MDS tunables listed on this page with a short
description : http://ceph.com/docs/master/cephfs/mds-config-ref/

Can you tell us how your cluster behave after the mds-cache-size
change ? What is your MDS ram consumption, before and after ?

Thanks !
-- 
Thomas Lemarchand
Cloud Solutions SAS - Responsable des systèmes d'information

On lun., 2014-11-17 at 16:06 -0800, Kevin Sumner wrote:
On Nov 17, 2014, at 15:52, Sage Weil <sage@xxxxxxxxxxxx> wrote:

On Mon, 17 Nov 2014, Kevin Sumner wrote:
I?ve got a test cluster together with a ~500 OSDs and, 5 MON, and
1 MDS.  All
the OSDs also mount CephFS at /ceph.  I?ve got Graphite pointing
at a space
under /ceph.  Over the weekend, I drove almost 2 million metrics,
each of
which creates a ~3MB file in a hierarchical path, each sending a
datapoint
into the metric file once a minute.  CephFS seemed to handle the
writes ok
while I was driving load.  All files containing each metric are at
paths
like this:
/ceph/whisper/sandbox/cephtest-osd0013/2/3/4/5.wsp

Today, however, with the load generator still running, reading
metadata of
files (e.g. directory entries and stat(2) info) in the filesystem
(presumably MDS-managed data) seems nearly impossible, especially
deeper
into the tree.  For example, in a shell cd seems to work but
ls hangs,
seemingly indefinitely.  After turning off the load generator and
allowing a
while for things to settle down, everything seems to behave
better.

ceph status and ceph health both return good statuses the entire
time.
During load generation, the ceph-mds process seems pegged at
between 100%
and 150%, but with load generation turned off, the process has
some high
variability from near-idle up to similar 100-150% CPU.

Hopefully, I?ve missed something in the CephFS tuning.  However,
I?m looking for
direction on figuring out if it is, indeed, a tuning problem or if
this
behavior is a symptom of the ?not ready for production? banner in
the
documentation.

My first guess is that the MDS cache is just too small and it is 
thrashing.  Try

ceph mds tell 0 injectargs '--mds-cache-size 1000000'

That's 10x bigger than the default, tho be aware that it will eat up
10x 
as much RAM too.

We've also seen teh cache behave in a non-optimal way when evicting 
things, making it thrash more often than it should.  I'm hoping we
can 
implement something like MQ instead of our two-level LRU, but it
isn't 
high on the priority list right now.

sage

Thanks!  I’ll pursue mds cache size tuning.  Is there any guidance on
setting the cache and other mds tunables correctly, or is it an
adjust-and-test sort of thing?  Cursory searching doesn’t return any
relevant documentation for ceph.com.  I’m plowing through some other
list posts now.
--
Kevin Sumner
kevin@xxxxxxxxx

-- 
This message has been scanned for viruses and 
dangerous content by MailScanner, and is 
believed to be clean. 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com