Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

Frank Schilder <frans@xxxxxx> · Wed, 17 Aug 2022 07:31:00 +0000

Hi Chris,

setting cut-off to high is recommended, but unlikely to address your performance issues. It will remove instabilities though.

For the meta data pool your throughput is normal. An MDS writes many small pieces and 150MB/s is actually a lot given that the amount of data in the meta-data pool is usually not more than about 500MB-1G per active MDS. You are probably facing a performance bottle neck with your data pools. There is meta-data that needs to go to the data pool, its these guys: MDS_SLOW_METADATA_IO. Can you remind us of how the pools are configured and what devices they are on and how many OSDs and PGs they have? What is the network link speed? What replication method?

Personally, I think your cluster is probably doing normal given that the main storage is SATA HDD drives. Scale-out doesn't mean you can just scale one side up (client side) and forget about the other side. With 5000 clients you would probably want a cluster with 3 times the nodes and flash cache devices per OSD. My guess is that aggregated IOP/s on your SATA drives are the bottleneck. You might be able to improve this a bit by balancing the PGs across OSDs (use ceph osd reweight method) if there is a large imbalance, but don't expect wonders.

One big performance hit could be if you are having the primary FS data pool on HDD, specifically when its erasure-coded. There is a new recommendation for a 3-pool layout and there were many long ceph-user threads about it. It is de-facto so important to get the primary data pool on a replicated flash pool that I decided to migrate a 1PB file system over to the new format. This reduces the meta data IO load on the HDD pool significantly and even speeds up some operations that only operate on meta-data.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Chris Smart <distroguy@xxxxxxxxx>
Sent: 17 August 2022 09:10:30
To: Eugen Block; ceph-users@xxxxxxx
Subject:  Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

On Mon, 2022-08-15 at 08:33 +0000, Eugen Block wrote:
> Hi,
>
> do you see high disk utilization on the OSD nodes? How is the load
> on
> the active MDS? How much RAM is configured for the MDS
> (mds_cache_memory_limit)?
> You can list all MDS sessions with 'ceph daemon mds.<MDS> session
> ls'
> to identify all your clients and 'ceph daemon mds.<MDS> 
> dump_blocked_ops' to show blocked requests. But simply killing
> sessions isn't a solution, so first you need to find out where the
> bottleneck is. Do you see hung requests or something? Anything in
> 'dmesg' on the client side?
>

Looking at the MDS ops in flight, the majority are journal_and_reply:

$ sudo ceph daemon mds.$(hostname) dump_ops_in_flight |grep
'flag_point' |sort |uniq -c
     28                 "flag_point": "failed to rdlock, waiting",
      2                 "flag_point": "failed to wrlock, waiting",
     18                 "flag_point": "failed to xlock, waiting",
    418                 "flag_point": "submit entry:
journal_and_reply",

Does anyone know where I can find more info as to what
journal_and_reply means? Is it solely about reading and writing to the
metadata pool, or is it waiting for OSDs to perform some action (like
ensure a file is gonem, so that it can then write to metadata pool,
perhaps)?

If it is related to OSDs in some way, then I can go and focus on
improving them (not that I shouldn't be doing that anyway, but just
trying to work out where to focus).

For example, maybe setting osd_op_queue_cut_off to high [1] might
help?(osd_op_queue is already set to wpq.)

I notice that when performance tanks, that the throughput on the
metadata pool goes very spiky (including down to zero). We're not
talking huge though, the range is between 0 and 150MB/sec, so almost
nothing... which makes me think it is related to OSDs also.

[1]
https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_op_queue_cut_off

Many thanks!
-c
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx