Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

Chris Smart <distroguy@xxxxxxxxx> · Wed, 17 Aug 2022 18:49:28 +1000

On Wed, 2022-08-17 at 07:31 +0000, Frank Schilder wrote:
> Hi Chris,
> 
> setting cut-off to high is recommended, but unlikely to address your
> performance issues. It will remove instabilities though.
> 

Hi Frank,

OK thanks, I will look into setting that on the OSDs. I'll test it in
staging and then slowly out to prod and see what happens. Even if it
doesn't solve my problems, it seems like a reasonable setting to me
from the description.

> For the meta data pool your throughput is normal. An MDS writes many
> small pieces and 150MB/s is actually a lot given that the amount of
> data in the meta-data pool is usually not more than about 500MB-1G
> per active MDS. You are probably facing a performance bottle neck
> with your data pools. 

Ahh OK, yeah I suspect that the data pool OSDs are contributing but I
wanted to try and understand the direct link. For example, I can
imagine that the MDS needs to make sure that certain files are deleted
from data pool before it removes metadata and maybe that's holding
things up.

> There is meta-data that needs to go to the data pool, its these guys:
> MDS_SLOW_METADATA_IO. 

Oh, metadata also goes into the data pool? Hmm, I had thought only the
metadata pool but if not then that might explain a lot.

> Can you remind us of how the pools are configured and what devices
> they are on and how many OSDs and PGs they have? 

I don't think the design of the cluster is particularly great, but here
goes:

3 x mon/mgr nodes
3 x mds nods
10 x osd nodes

The OSD nodes have 72 CPU threads, 256GB RAM, a few NVMe drives and 60x
(yes, sixty) SATA drives. Of the SATA drives, ~200 are 10TB and ~400
are 16TB (previous custodians replaced drives to try and increase free
space in the cluster).

The nodes are very busy with lots of io wait (potentially also
impacting the metadata pool, as they are co-located on the same nodes).

The cephfs_data is on the SATA drive across the 10 nodes and uses a
couple of NVMe drives on each node for journal. This pool has ~600
OSDs, 8PB total and 1.7PB used (84%) with almost 3 billion objects and
32,768 pgnums (probably needs doubling, but I haven't wanted to do that
with these performance issues still going on).

The cephfs_metadata pool is running on the remaining NVMe drives on the
same nodes, which is a total of 12 NVMe drives across the 10 OSD nodes.
It has 115 million objects and 512 pgnum (maybe this should be
increased).

$ sudo ceph df
GLOBAL:
 SIZE     AVAIL    RAW USED  %RAW USED
 8.01PiB  2.94PiB   5.06PiB      63.22 
POOLS:
 NAME             ID  USED     %USED  MAX AVAIL  OBJECTS
 cephfs_data      1   1.67PiB  84.78     308TiB  2692685430
 cephfs_metadata  2   2.16GiB   0.16    1.33TiB   115441446

The cluster is very unbalanced, when I inherited the cluster, ~30% were
over 90% full and ~20% were over 92% full. With some reweighting and
data deletion the most full is now under 85%. However, it's still very
unbalanced. I would like to turn upmap on and use the balancer module
on, but some (not many) of the clients are showing up as jewel and I
need to confirm whether they really are jewel or not before I can set
min clien version to luminous... so it's another thing on the backlog.

The data OSDs range between 106 and 190 pgs for the 10TB drives
(average 132) and between 144 and 243 for the 16GB drives (average
185).

The metadata nvme OSDs are also unbalanced, between 57 and 251 pgs per
OSD, average is 118. Looking at the ops for these nvme metadata OSDs, I
can see that those with more pgnums have higher ops (~1000 from
ceph_osd_op metric), which makes sense. So maybe I should try and
balance these out more, I had been ignoring them as I was focusing on
the data OSDs and not filling the cluster. Although I'm not sure of
1000 ops is a lot or not for an NVMe in metadata pool and it won't make
any difference as the bottleneck is elsewhere.

So if slow metadata ops is or can be caused by slow data pool OSDs
(which I think makes sense and is also what you're saying), then I'll
also focus on the data OSDs. But it does seem like the NVMe metadata
pool could use some love too.

The fact that they are co-located with such busy OSD nodes probably
isn't helping...

> What is the network link speed? 

Ceph cluster is running with only public network (no separate cluster)
and traffic is going over 2 x 40Gb/s LACP bond links to spine and leaf
switches.

> What replication method?
> 

All 3x replication, no erasure coding.

> Personally, I think your cluster is probably doing normal given that
> the main storage is SATA HDD drives. Scale-out doesn't mean you can
> just scale one side up (client side) and forget about the other side.
> With 5000 clients you would probably want a cluster with 3 times the
> nodes and flash cache devices per OSD. My guess is that aggregated
> IOP/s on your SATA drives are the bottleneck. You might be able to
> improve this a bit by balancing the PGs across OSDs (use ceph osd
> reweight method) if there is a large imbalance, but don't expect
> wonders.
> 
> 

Make sense, thanks. IMO we need to not only rebalance the cluster, but
also add more less-dense nodes to spread the load out... (upgrade the
cluster and get multi-MDS, etc).

> One big performance hit could be if you are having the primary FS
> data pool on HDD, specifically when its erasure-coded. There is a new
> recommendation for a 3-pool layout and there were many long ceph-user
> threads about it. It is de-facto so important to get the primary data
> pool on a replicated flash pool that I decided to migrate a 1PB file
> system over to the new format. This reduces the meta data IO load on
> the HDD pool significantly and even speeds up some operations that
> only operate on meta-data.
> 

Fortunately, it's just replication so I think I avoid this issue at
least :-)

Many thanks, again.

-c

> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
> 
> ________________________________________
> From: Chris Smart <distroguy@xxxxxxxxx>
> Sent: 17 August 2022 09:10:30
> To: Eugen Block; ceph-users@xxxxxxx
> Subject:  Re: What is client request_load_avg?
> Troubleshooting MDS issues on Luminous
> 
> On Mon, 2022-08-15 at 08:33 +0000, Eugen Block wrote:
> > Hi,
> > 
> > do you see high disk utilization on the OSD nodes? How is the load
> > on
> > the active MDS? How much RAM is configured for the MDS
> > (mds_cache_memory_limit)?
> > You can list all MDS sessions with 'ceph daemon mds.<MDS> session
> > ls'
> > to identify all your clients and 'ceph daemon mds.<MDS> 
> > dump_blocked_ops' to show blocked requests. But simply killing
> > sessions isn't a solution, so first you need to find out where the
> > bottleneck is. Do you see hung requests or something? Anything in
> > 'dmesg' on the client side?
> > 
> 
> Looking at the MDS ops in flight, the majority are journal_and_reply:
> 
> $ sudo ceph daemon mds.$(hostname) dump_ops_in_flight |grep
> 'flag_point' |sort |uniq -c
>      28                 "flag_point": "failed to rdlock, waiting",
>       2                 "flag_point": "failed to wrlock, waiting",
>      18                 "flag_point": "failed to xlock, waiting",
>     418                 "flag_point": "submit entry:
> journal_and_reply",
> 
> Does anyone know where I can find more info as to what
> journal_and_reply means? Is it solely about reading and writing to
> the
> metadata pool, or is it waiting for OSDs to perform some action (like
> ensure a file is gonem, so that it can then write to metadata pool,
> perhaps)?
> 
> If it is related to OSDs in some way, then I can go and focus on
> improving them (not that I shouldn't be doing that anyway, but just
> trying to work out where to focus).
> 
> For example, maybe setting osd_op_queue_cut_off to high [1] might
> help?(osd_op_queue is already set to wpq.)
> 
> I notice that when performance tanks, that the throughput on the
> metadata pool goes very spiky (including down to zero). We're not
> talking huge though, the range is between 0 and 150MB/sec, so almost
> nothing... which makes me think it is related to OSDs also.
> 
> [1]
> https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_op_queue_cut_off
> 
> Many thanks!
> -c
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx