Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

Chris Smart <distroguy@xxxxxxxxx> · Tue, 16 Aug 2022 21:14:36 +1000

On Tue, 2022-08-16 at 10:52 +0000, Frank Schilder wrote:
> Hi Chris,
> 
> I would strongly advice not to use multi-MDS with 5000 clients on
> luminous. I enabled it on mimic with ca. 1750 clients and it was
> extremely dependent on luck if it converged to a stable distribution
> of dirfrags or ended up doing export_dir operations all the time,
> completely killing the FS performance. Also, even in mimic where
> multi-MDS is no longer experimental, it still has a lot of bugs. You
> will need to monitor the cluster tightly and might be forced to
> intervene regularly, including going back and forth between single-
> and multi-MDS.
> 

Hi Frank,

Thanks a lot for passing on your experience, that's really valuable
info for a CephFS n00b like me. I have been wary of enabling multi-MDS
as I figured I'd end up hitting a lot of issues on Luminuous, plus I'd
be in even more deep over my head...

> My recommendation would be to upgrade to octopus as fast as possible.
> Its the first version that supports ephemeral pinning, which I would
> say is pretty much the most useful multi-MDS mode, because it uses a
> static dirfrag distribution over all MDSes avoiding the painful
> export_dir operations.
> 

OK yeah, I was just reading about ephemeral pinning, actually. Sounds
like the best plan is to move to Octopus and then also ensure we have a
solid upgrade plan moving forward. I only inherited this a couple of
months ago and it's still the same original Lumiuous cluster.

> You are in the unlucky situation that you will need 2 upgrades. I
> think going L->M->O might be the least painful as it requires only 1
> OSD conversion. If you are a bit more adventurous, you could also aim
> for L->N->P. Nautilus will probably not solve your performance issue
> and any path including nautilus will have an extra OSD conversion.
> However, in case you are using file store, you might want to go this
> route and change from file store to bluestore with a re-deployment of
> OSDs when you are on pacific. You will get out of some performance
> issues with upgraded OSDs and pacific has fixes for a boat load of FS
> snapshot issues.
> 

I am wary of upgrading between releases in general, I've looked into
this a bit and have noticed a number of people hit some strange issues.
I guess the fortunate thing is that most people have probably
experienced them already and solutions are probably relatively easy to
find - on the downside, I'm not sure many people will be able to help
as this cluster is so old, people probably have forgotten or moved on.

But I guess I don't really have any other choice, it's either upgrade
or perhaps building a brand new cluster and migrating data.

Yeah, the cluster is also using filestore and it would be good to get
onto bluestore at some point. The cache is already on NVMe at least, so
that's helped.

> In the mean time, can you roll out something like ganglia on all
> client- and storage nodes and collect network traffic stats? I found
> the packet report combined with bytes-in/out extremely useful to hunt
> down rogue FS clients. If you use snapshots, also kworker CPU and
> wait-IO on the client node are are indicative of problems with this
> client.
> 

That's a good idea, I'll look into that.

Thanks again for the input, it's really helpful!

Cheers,
-c

> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx