Re: CephFS unresponsive at scale (2M files,

Kevin Sumner <kevin@xxxxxxxxx> · Mon, 17 Nov 2014 16:06:44 -0800

On Nov 17, 2014, at 15:52, Sage Weil <sage@xxxxxxxxxxxx> wrote:

On Mon, 17 Nov 2014, Kevin Sumner wrote:
I?ve got a test cluster together with a ~500 OSDs and, 5 MON, and 1 MDS.  All
the OSDs also mount CephFS at /ceph.  I?ve got Graphite pointing at a space
under /ceph.  Over the weekend, I drove almost 2 million metrics, each of
which creates a ~3MB file in a hierarchical path, each sending a datapoint
into the metric file once a minute.  CephFS seemed to handle the writes ok
while I was driving load.  All files containing each metric are at paths
like this:
/ceph/whisper/sandbox/cephtest-osd0013/2/3/4/5.wsp

Today, however, with the load generator still running, reading metadata of
files (e.g. directory entries and stat(2) info) in the filesystem
(presumably MDS-managed data) seems nearly impossible, especially deeper
into the tree.  For example, in a shell cd seems to work but ls hangs,
seemingly indefinitely.  After turning off the load generator and allowing a
while for things to settle down, everything seems to behave better.

ceph status and ceph health both return good statuses the entire time.
 During load generation, the ceph-mds process seems pegged at between 100%
and 150%, but with load generation turned off, the process has some high
variability from near-idle up to similar 100-150% CPU.

Hopefully, I?ve missed something in the CephFS tuning.  However, I?m looking for
direction on figuring out if it is, indeed, a tuning problem or if this
behavior is a symptom of the ?not ready for production? banner in the
documentation.

My first guess is that the MDS cache is just too small and it is 
thrashing.  Try

 ceph mds tell 0 injectargs '--mds-cache-size 1000000'

That's 10x bigger than the default, tho be aware that it will eat up 10x 
as much RAM too.

We've also seen teh cache behave in a non-optimal way when evicting 
things, making it thrash more often than it should.  I'm hoping we can 
implement something like MQ instead of our two-level LRU, but it isn't 
high on the priority list right now.

sage

Thanks!  I’ll pursue mds cache size tuning.  Is there any guidance on setting the cache and other mds tunables correctly, or is it an adjust-and-test sort of thing?  Cursory searching doesn’t return any relevant documentation for ceph.com.  I’m plowing through some other list posts now.--
Kevin Sumner
kevin@xxxxxxxxx

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com