Re: CephFS unresponsive at scale (2M files,

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 17 Nov 2014 15:52:46 -0800 (PST)

On Mon, 17 Nov 2014, Kevin Sumner wrote:
> I?ve got a test cluster together with a ~500 OSDs and, 5 MON, and 1 MDS.  All
> the OSDs also mount CephFS at /ceph.  I?ve got Graphite pointing at a space
> under /ceph.  Over the weekend, I drove almost 2 million metrics, each of
> which creates a ~3MB file in a hierarchical path, each sending a datapoint
> into the metric file once a minute.  CephFS seemed to handle the writes ok
> while I was driving load.  All files containing each metric are at paths
> like this:
> /ceph/whisper/sandbox/cephtest-osd0013/2/3/4/5.wsp
> 
> Today, however, with the load generator still running, reading metadata of
> files (e.g. directory entries and stat(2) info) in the filesystem
> (presumably MDS-managed data) seems nearly impossible, especially deeper
> into the tree.  For example, in a shell cd seems to work but ls hangs,
> seemingly indefinitely.  After turning off the load generator and allowing a
> while for things to settle down, everything seems to behave better.
> 
> ceph status and ceph health both return good statuses the entire time.
>  During load generation, the ceph-mds process seems pegged at between 100%
> and 150%, but with load generation turned off, the process has some high
> variability from near-idle up to similar 100-150% CPU.
> 
> Hopefully, I?ve missed something in the CephFS tuning.  However, I?m looking for
> direction on figuring out if it is, indeed, a tuning problem or if this
> behavior is a symptom of the ?not ready for production? banner in the
> documentation.

My first guess is that the MDS cache is just too small and it is 
thrashing.  Try

 ceph mds tell 0 injectargs '--mds-cache-size 1000000'

That's 10x bigger than the default, tho be aware that it will eat up 10x 
as much RAM too.

We've also seen teh cache behave in a non-optimal way when evicting 
things, making it thrash more often than it should.  I'm hoping we can 
implement something like MQ instead of our two-level LRU, but it isn't 
high on the priority list right now.

sage
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com