Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

Chris Smart <distroguy@xxxxxxxxx> · Sun, 21 Aug 2022 12:31:05 +1000

On Fri, 2022-08-19 at 13:04 +0000, Frank Schilder wrote:
> Hi Chris,
> 
> looks like your e-mail stampede is over :) I will cherry-pick some
> questions to answer, other things either follow or you will figure it
> out with the docs and trial-and-error. The cluster set-up is actually
> not that bad.
> 

Hi Frank, Eugene, et al,

Hehe. I have been trying to learn more before sending any more emails
;-)

In combination with general cluster improvements, I have been looking
at the mds operations to try and understand CephFS more and hopefully
shine light on specific causes of bottlenecks (as per your
suggestions).

It appears that the performance issues are caused by delays in
obtaining locks (like rdlock, xlock).

For example, here's an example where obtaining an rdlock took a few
minutes:

        {
            "description": "client_request(client.1989107546:151927
lookup #0x101645b1253/joblog.log 2022-08-17 23:58:49.504425
caller_uid=0, caller_gid=0{})",
            "initiated_at": "2022-08-17 23:58:49.504934",
            "age": 454.123112,
            "duration": 187.355917,
            "type_data": {
                "flag_point": "done",
                "reqid": "client.1989107546:151927",
                "op_type": "client_request",
                "client_info": {
                    "client": "client.1989107546",
                    "tid": 151927
                },
                "events": [
                    {
                        "time": "2022-08-17 23:58:49.504934",
                        "event": "initiated"
                    },
                    {
                        "time": "2022-08-18 00:01:14.341882",
                        "event": "failed to rdlock, waiting"
                    },
                    {
                        "time": "2022-08-18 00:01:56.860807",
                        "event": "acquired locks"
                    },
                    {
                        "time": "2022-08-18 00:01:56.860814",
                        "event": "replying"
                    },
                    {
                        "time": "2022-08-18 00:01:56.860829",
                        "event": "finishing request"
                    },
                    {
                        "time": "2022-08-18 00:01:56.860850",
                        "event": "cleaned up request"
                    },
                    {
                        "time": "2022-08-18 00:01:56.860851",
                        "event": "done"
                    }
                ]
            }

Now I'm trying to understand the cause of delay in obtaining locks.

I'd like to confirm first that when I see mds ops waiting to get locks
I assume those are locks on objects in the data pool, right? And that
this is data that a client somewhere has a lock on and is potentially
doing something with?

I assume that the MDS needs to ask the client to finish up their work
and then release the cap so that the MDS can perform a metadata
operation (without the file being changed again in the meantime), and
it just has to wait for this to happen. (Maybe it's also stuck in an
OSD ops queue).

Is that right?

Interestingly, when duration gets long and performance gets bad,
rdlocks seem to cause the most trouble (maybe similar to
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-April/026056.html
?), others less so.

$ sudo ceph daemon mds.$(hostname) dump_historic_ops | grep -B4
'"acquired locks"' |grep '"event":' |grep -v '"acquired locks"'
                        "event": "failed to rdlock, waiting"
                        "event": "failed to rdlock, waiting"
                        "event": "failed to rdlock, waiting"
                        "event": "initiated"
                        "event": "initiated"
                        "event": "initiated"
                        "event": "failed to wrlock, waiting"
                        "event": "failed to wrlock, waiting"
                        "event": "initiated"
                        "event": "failed to wrlock, waiting"
                        "event": "failed to rdlock, waiting"
                        "event": "failed to rdlock, waiting"
                        "event": "failed to rdlock, waiting"
                        "event": "failed to rdlock, waiting"

If the MDS is waiting for client operations to complete, I'd like to
try and understand why after a fresh restart of the MDS, performance is
good and obtaining locks are quick, but over time it gets worse.

When the MDS is freshly restarted, usually for an hour or so ops are
taking less than a second and the throughput to the metadata pool is
nice and consistent, hovering around between 15 to 25MB/sec.

Then ops start to get slower, up to about 15 seconds and the throughput
starts to get much more choppy - between 5MB to 75MB.

Then the ops start to take 30-50 seconds and the throughput goes
between 2 to 100MB/sec for an hour or so (we usually see significant
client impact here).

Then ops start to take 2 minutes and the throughput jumps goes between
0MB/sec to 125MB. The cluster becomes mostly unusable at this time.
Sometimes restarting MDS here helps, but sometimes it doesn't and we
just have to wait for a few days.

I have put a graph that kind of shows these metadata pool throughput
spikes on my blog here, if anyone's interested:

https://blog.christophersmart.com/wp-content/uploads/2022/08/metadata_throughput.png

So I'm wondering:

Is this getting progressively worse because the clients are storing
more and more cache since the MDS was last restarted? And that the
delay in obtaining the lock is due to waiting for the client to finish
flushing data to the data pool (or whatever operation it was doing).

Because it's fast after a fresh MDS restart, I guess this would imply
that when the mds is restarted, clients are forced to flush all their
data to disk and start with less once the new mds comes back (therefore
it takes time to build up their caches again). It doesn't seem to be
only related to the number of caps a client has.

I do see spikes in the data pool which line up with MDS restarts (I
don't know if this is clients flushing data, or MDS flushing data?):

https://blog.christophersmart.com/wp-content/uploads/2022/08/ceph_data_flush.png

Or does it get slower over time because the MDS cache is getting too
large, has too many caps and inodes to track and therefore too many
operations? But if that's the case, I guess the delay would be some
internal thread the MDS is not doing fast enough and the delay is not
actually caused by a client releasing the lock, but the MDS not able to
get back to that thread fast enough.

Any thoughts?

Might there be any mds or client specific ceph options that could help
with this? (e.g. mds_cache_memory_limit, mds_cache_trim_threshold,
mds_cache_trim_decay_rate, mds_recall_max_caps,
mds_recall_max_decay_rate).

Current settings:

$ sudo ceph --admin-daemon /var/run/ceph/ceph-mds$(hostname).asok
config show |grep -E
"mds_cache_trim_threshold|mds_cache_trim_decay_rate|mds_cache_memory_li
mit|mds_recall_max_caps|mds_recall_max_decay_rate"

    "mds_cache_memory_limit": "51539607552",
    "mds_cache_trim_decay_rate": "1.000000",
    "mds_cache_trim_threshold": "65536",
    "mds_recall_max_caps": "5000",
    "mds_recall_max_decay_rate": "2.500000",

And similarly might there be any client options people know of that can
help? (We're using mix of kernel and fuse mounts). Like to not keep so
much cache, or to restrict number of caps or something?

Is this just a case of "improve your cluster by making your data pool
OSDs go faster and your problems will be alleviated" kind of deal?

Next in my journey to try and understand CephFS more, look into debugfs
to see what clients are doing and also interpret the output of
perfdump...

> 1) Set osd_op_queue_cut_off = high on global level. Even though its
> prefixed with osd_, it seems actually used by more daemons. After
> setting it just on the OSDs on my mimic-cluster, the MDSes crashed
> until I set it on global level.
> 
> 

Ahh, good tip, thanks!

> 2) I think your PG nums on the pools are fine, you should aim for
> between 100-200 PGs per OSD. On the meta-data pool you can increase
> it if you convert your OSDs one-by-one to LVM (mimic and later) and
> deploy 2 or 4 OSDs per NVMe drive. They should easily cope with the
> load and you can then safely increase pgnum. Make sure that you don't
> create OSDs smaller than 100GB. If your drives are not large enough,
> deploy fewer. If your drives are much larger than 400G, you could
> consider deploying a fast FS data pool for special applications (your
> favourite users).
> 

That makes sense, I have done this before on data pool OSDs. The nvme
drives in the metadata pool are mostly 800B (a few are 1.4TB), so this
is possible.

> 3) The PG distribution per OSD is really bad. I was expecting max +/-
> 30% difference from average. Don't enable upmap balancer too early.
> On your system with separate OSDs per pool pools you should be able
> to get away with simpler methods. I believe there was a crush
> algorithm change from luminous to mimic, leading to a much better
> distribution after upgrading the crush tunables. I didn't upgrade, so
> I don't know how to proceed here. It is very well possible that a
> proper upgrade already re-balances PGs.
> 

Agreed, it's pretty bad :-)

> 4) Following up on the PG distribution. With your values, equalizing
> the distribution could lead to a factor 2 or even more performance
> gain. You have mostly small files in your system and these profit
> dramatically from good IOPs distribution across disks. I had the same
> on a luminous test cluster, 1 OSD bottlenecked everything. After
> reducing the number of PGs on this OSD, performance improved by a
> factor of 2. Currently, your few fullest disks bottleneck overall
> performance. Before using any rebalancing method, please check up if
> upgrading from luminous already changes things and - if so - upgrade
> first.
> 

Thanks, will do.

> 5) Your utilisation is too high. In my experience, an 80% full OSD is
> a magic boundary and performance degrades if OSDs get fuller than
> that. Try to add more capacity. You should gain a bit by rebalancing.
> Still, plan for capacity increase and collect money to buy what you
> need.

Yep, thanks. I have also proposed this, so I'm hoping we'll be able to
add more nodes and spread the data around.

> 
> 6) Co-locating the NVMes with SATA is fine. Its more likely that the
> NVMes starve the SATA drives than the other way around. Make sure
> that the connection order is according to recommendations for your
> controller. Then things will be fine.
> 

OK, thanks.

> 7) Lots of iowait on a 60 huge-OSDs node is not surprising. If your
> users want more performance, they should allow you to buy 4TB or 6TB
> NLSAS drives instead of the big SATA ones. You are also a bit low on
> memory. 256GB of RAM for 60 OSDs is not much. I operate 70+ OSD-nodes
> with 512GB and the OSDs sometimes use up to 80% of that. You might
> run into trouble on upgrade. Try to get money for a RAM upgrade
> before upgrading ceph.
> 

I think currently the 60 drives matches the 4G default memory setting
per OSD. If I can get more memory in these nodes, then I assume I
should look at raising that memory setting, too.

> 8) I don't think you will find out much by following ops. My guess is
> that the general imbalance also leads to a very uneven distribution
> of primary OSDs and the OSDs with many primaries have the slow ops.
> If its the full OSDs being reported all the time, the first step is
> rebalancing.

Thanks, I'll keep working on deleting data and rebalancing as best I
can, with longer-view of upgrading and expanding the cluster... 

> 
> Hope that helps.
> 
> Good luck with fixing stuff and best regards,

It really does, thanks very much!

Cheers
-c

> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> ________________________________________
> From: Chris Smart <distroguy@xxxxxxxxx>
> Sent: 17 August 2022 13:49:01
> To: Eugen Block; ceph-users@xxxxxxx
> Subject:  Re: What is client request_load_avg?
> Troubleshooting MDS issues on Luminous
> 
> On Wed, 2022-08-17 at 21:43 +1000, Chris Smart wrote:
> > 
> > Ohhhh, is "journal_and_reply" actually the very last event in a
> > successful operation?...[1] No wonder so many are the last event...
> > :facepalm:
> > 
> > OK, well assuming that, then I can probably look out for ops which
> > have
> > a both a journal_and_reply event and took a large duration and see
> > what
> > they got stuck on... then maybe work out whatever that stuck event
> > means.
> > 
> > [1]
> > https://github.com/ceph/ceph/blob/d54685879d59f2780035623e40e31115d80dabb1/src/mds/Server.cc#L1925
> > 
> > -c
> 
> Oh, no actually from dumping historic mds ops I can see that
> "journal_and_reply" is totally _not_ the last event at all, "done"
> is.
> OK that makes more sense, I was on the right path the first time.
> 
> Now to go through these ops and see if I can find what the
> bottlenecks
> are.
> 
> -c
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx