Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

Frank Schilder <frans@xxxxxx> · Fri, 19 Aug 2022 13:04:17 +0000

Hi Chris,

looks like your e-mail stampede is over :) I will cherry-pick some questions to answer, other things either follow or you will figure it out with the docs and trial-and-error. The cluster set-up is actually not that bad.

1) Set osd_op_queue_cut_off = high on global level. Even though its prefixed with osd_, it seems actually used by more daemons. After setting it just on the OSDs on my mimic-cluster, the MDSes crashed until I set it on global level.

2) I think your PG nums on the pools are fine, you should aim for between 100-200 PGs per OSD. On the meta-data pool you can increase it if you convert your OSDs one-by-one to LVM (mimic and later) and deploy 2 or 4 OSDs per NVMe drive. They should easily cope with the load and you can then safely increase pgnum. Make sure that you don't create OSDs smaller than 100GB. If your drives are not large enough, deploy fewer. If your drives are much larger than 400G, you could consider deploying a fast FS data pool for special applications (your favourite users).

3) The PG distribution per OSD is really bad. I was expecting max +/- 30% difference from average. Don't enable upmap balancer too early. On your system with separate OSDs per pool pools you should be able to get away with simpler methods. I believe there was a crush algorithm change from luminous to mimic, leading to a much better distribution after upgrading the crush tunables. I didn't upgrade, so I don't know how to proceed here. It is very well possible that a proper upgrade already re-balances PGs.

4) Following up on the PG distribution. With your values, equalizing the distribution could lead to a factor 2 or even more performance gain. You have mostly small files in your system and these profit dramatically from good IOPs distribution across disks. I had the same on a luminous test cluster, 1 OSD bottlenecked everything. After reducing the number of PGs on this OSD, performance improved by a factor of 2. Currently, your few fullest disks bottleneck overall performance. Before using any rebalancing method, please check up if upgrading from luminous already changes things and - if so - upgrade first.

5) Your utilisation is too high. In my experience, an 80% full OSD is a magic boundary and performance degrades if OSDs get fuller than that. Try to add more capacity. You should gain a bit by rebalancing. Still, plan for capacity increase and collect money to buy what you need.

6) Co-locating the NVMes with SATA is fine. Its more likely that the NVMes starve the SATA drives than the other way around. Make sure that the connection order is according to recommendations for your controller. Then things will be fine.

7) Lots of iowait on a 60 huge-OSDs node is not surprising. If your users want more performance, they should allow you to buy 4TB or 6TB NLSAS drives instead of the big SATA ones. You are also a bit low on memory. 256GB of RAM for 60 OSDs is not much. I operate 70+ OSD-nodes with 512GB and the OSDs sometimes use up to 80% of that. You might run into trouble on upgrade. Try to get money for a RAM upgrade before upgrading ceph.

8) I don't think you will find out much by following ops. My guess is that the general imbalance also leads to a very uneven distribution of primary OSDs and the OSDs with many primaries have the slow ops. If its the full OSDs being reported all the time, the first step is rebalancing.

Hope that helps.

Good luck with fixing stuff and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Chris Smart <distroguy@xxxxxxxxx>
Sent: 17 August 2022 13:49:01
To: Eugen Block; ceph-users@xxxxxxx
Subject:  Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

On Wed, 2022-08-17 at 21:43 +1000, Chris Smart wrote:
>
> Ohhhh, is "journal_and_reply" actually the very last event in a
> successful operation?...[1] No wonder so many are the last event...
> :facepalm:
>
> OK, well assuming that, then I can probably look out for ops which
> have
> a both a journal_and_reply event and took a large duration and see
> what
> they got stuck on... then maybe work out whatever that stuck event
> means.
>
> [1]
> https://github.com/ceph/ceph/blob/d54685879d59f2780035623e40e31115d80dabb1/src/mds/Server.cc#L1925
>
> -c

Oh, no actually from dumping historic mds ops I can see that
"journal_and_reply" is totally _not_ the last event at all, "done" is.
OK that makes more sense, I was on the right path the first time.

Now to go through these ops and see if I can find what the bottlenecks
are.

-c
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx