On Mon, Feb 18, 2019 at 7:20 AM Theofilos Mouratidis <mtheofilos@xxxxxxxxx> wrote: > > Sorry, there is also a graph of the metadata that wasn't uploaded > by mistake. > > On Mon, 18 Feb 2019 at 16:08, Theofilos Mouratidis <mtheofilos@xxxxxxxxx> wrote: > > > > Hello Cephers, > > > > Here at CERN we enabled the MDS balancing after v12.2.8, which > > included some improvements to the balancing code. In our cephfs setup > > we have 6 mds machines with multi-mds enabled (5 active + 1 standby). > > (Prior to enabling mds balancing we used pinning to pin manila > > directories to a random mds). Since we disabled the pinning the > > performance has generally degraded, and we think we know why: > > > > 1. When a directory is exported from mds a to b, this drops the inode > > caches for the directory from mds a. > > 2. Because of this, the MDS's need significantly more metadata IO to > > keep loading inodes from RADOS. (metadata is on hdd's in this cluster) > > 3. Our workloads are quite dynamic and the balancer never manages to > > find a stable set of exports; it is constantly churning the exports > > between MDS's. > > > > I have attached 2 images that show the mds ram usage and the metadata > > io. You can see that before, with mds pinning, the RAM usage is stable > > (so the inode caches are effective). After disabling pinning, the RAM > > usage is fluctuating and metadata IO is much more intensive than > > before. > > > > We think this comes from the general goal of the MDS balancing -- it > > tries to balance requests evenly across MDSs but has no concern for > > the inode caches in the MDS's. Your diagnosis makes sense to me. The balancing heuristics are designed to try and match the amount of work done by each MDS, but that work is measured by IOPS and metadata touched, not the cache sizes. Do note that balancing includes cache migration, so the MDS ships off all the metadata it has in-memory that the peer will need — but this migration can also increase the amount of IO to the metadata pool. (I don't have all the details about what it writes down in my head these days — it's definitely *not* everything transmitted, but it probably does scale in some way with the amount of cache being moved as I think there are some cases which require being made durable?) >> Do you additional metrics such as cache > > hit rate or memory usage should be added to the load metric? Or rather > > is this something that can be fixed with some mds_bal_* parameter fine > > tuning? I haven't been in recent discussions about the MDS balancing algorithms, but my opinion for a while has been that we need a rethink of the whole system. It's currently a pile of heuristics that I don't think we can make universally-applicable, so my hope is we can build a more robust system like the Mantle[1] experiments and let admins switch between strategies to find the one that works in their cluster. For now, I suspect that if directory pinning works for you at all, that is probably the route to take. :/ -Greg [1]: http://docs.ceph.com/docs/master/cephfs/mantle/