Re: Ceph MDS balancing issues

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 5 Mar 2019 11:48:08 -0800

On Mon, Feb 18, 2019 at 7:20 AM Theofilos Mouratidis
<mtheofilos@xxxxxxxxx> wrote:
>
> Sorry, there is also a graph of the metadata that wasn't uploaded
>  by mistake.
>
> On Mon, 18 Feb 2019 at 16:08, Theofilos Mouratidis <mtheofilos@xxxxxxxxx> wrote:
> >
> > Hello Cephers,
> >
> > Here at CERN we enabled the MDS balancing after v12.2.8, which
> > included some improvements to the balancing code. In our cephfs setup
> > we have 6 mds machines with multi-mds enabled (5 active + 1 standby).
> > (Prior to enabling mds balancing we used pinning to pin manila
> > directories to a random mds). Since we disabled the pinning the
> > performance has generally degraded, and we think we know why:
> >
> > 1. When a directory is exported from mds a to b, this drops the inode
> > caches for the directory from mds a.
> > 2. Because of this, the MDS's need significantly more metadata IO to
> > keep loading inodes from RADOS. (metadata is on hdd's in this cluster)
> > 3. Our workloads are quite dynamic and the balancer never manages to
> > find a stable set of exports; it is constantly churning the exports
> > between MDS's.
> >
> > I have attached 2 images that show the mds ram usage and the metadata
> > io. You can see that before, with mds pinning, the RAM usage is stable
> > (so the inode caches are effective). After disabling pinning, the RAM
> > usage is fluctuating and metadata IO is much more intensive than
> > before.
> >
> > We think this comes from the general goal of the MDS balancing -- it
> > tries to balance requests evenly across MDSs but has no concern for
> > the inode caches in the MDS's.

Your diagnosis makes sense to me. The balancing heuristics are
designed to try and match the amount of work done by each MDS, but
that work is measured by IOPS and metadata touched, not the cache
sizes.

Do note that balancing includes cache migration, so the MDS ships off
all the metadata it has in-memory that the peer will need — but this
migration can also increase the amount of IO to the metadata pool. (I
don't have all the details about what it writes down in my head these
days — it's definitely *not* everything transmitted, but it probably
does scale in some way with the amount of cache being moved as I think
there are some cases which require being made durable?)

>> Do you additional metrics such as cache
> > hit rate or memory usage should be added to the load metric? Or rather
> > is this something that can be fixed with some mds_bal_* parameter fine
> > tuning?

I haven't been in recent discussions about the MDS balancing
algorithms, but my opinion for a while has been that we need a rethink
of the whole system. It's currently a pile of heuristics that I don't
think we can make universally-applicable, so my hope is we can build a
more robust system like the Mantle[1] experiments and let admins
switch between strategies to find the one that works in their cluster.

For now, I suspect that if directory pinning works for you at all,
that is probably the route to take. :/
-Greg
[1]: http://docs.ceph.com/docs/master/cephfs/mantle/