On Wed, 2022-08-17 at 07:31 +0000, Frank Schilder wrote: > Hi Chris, > > setting cut-off to high is recommended, but unlikely to address your > performance issues. It will remove instabilities though. > Hi Frank, OK thanks, I will look into setting that on the OSDs. I'll test it in staging and then slowly out to prod and see what happens. Even if it doesn't solve my problems, it seems like a reasonable setting to me from the description. > For the meta data pool your throughput is normal. An MDS writes many > small pieces and 150MB/s is actually a lot given that the amount of > data in the meta-data pool is usually not more than about 500MB-1G > per active MDS. You are probably facing a performance bottle neck > with your data pools. Ahh OK, yeah I suspect that the data pool OSDs are contributing but I wanted to try and understand the direct link. For example, I can imagine that the MDS needs to make sure that certain files are deleted from data pool before it removes metadata and maybe that's holding things up. > There is meta-data that needs to go to the data pool, its these guys: > MDS_SLOW_METADATA_IO. Oh, metadata also goes into the data pool? Hmm, I had thought only the metadata pool but if not then that might explain a lot. > Can you remind us of how the pools are configured and what devices > they are on and how many OSDs and PGs they have? I don't think the design of the cluster is particularly great, but here goes: 3 x mon/mgr nodes 3 x mds nods 10 x osd nodes The OSD nodes have 72 CPU threads, 256GB RAM, a few NVMe drives and 60x (yes, sixty) SATA drives. Of the SATA drives, ~200 are 10TB and ~400 are 16TB (previous custodians replaced drives to try and increase free space in the cluster). The nodes are very busy with lots of io wait (potentially also impacting the metadata pool, as they are co-located on the same nodes). The cephfs_data is on the SATA drive across the 10 nodes and uses a couple of NVMe drives on each node for journal. This pool has ~600 OSDs, 8PB total and 1.7PB used (84%) with almost 3 billion objects and 32,768 pgnums (probably needs doubling, but I haven't wanted to do that with these performance issues still going on). The cephfs_metadata pool is running on the remaining NVMe drives on the same nodes, which is a total of 12 NVMe drives across the 10 OSD nodes. It has 115 million objects and 512 pgnum (maybe this should be increased). $ sudo ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 8.01PiB 2.94PiB 5.06PiB 63.22 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS cephfs_data 1 1.67PiB 84.78 308TiB 2692685430 cephfs_metadata 2 2.16GiB 0.16 1.33TiB 115441446 The cluster is very unbalanced, when I inherited the cluster, ~30% were over 90% full and ~20% were over 92% full. With some reweighting and data deletion the most full is now under 85%. However, it's still very unbalanced. I would like to turn upmap on and use the balancer module on, but some (not many) of the clients are showing up as jewel and I need to confirm whether they really are jewel or not before I can set min clien version to luminous... so it's another thing on the backlog. The data OSDs range between 106 and 190 pgs for the 10TB drives (average 132) and between 144 and 243 for the 16GB drives (average 185). The metadata nvme OSDs are also unbalanced, between 57 and 251 pgs per OSD, average is 118. Looking at the ops for these nvme metadata OSDs, I can see that those with more pgnums have higher ops (~1000 from ceph_osd_op metric), which makes sense. So maybe I should try and balance these out more, I had been ignoring them as I was focusing on the data OSDs and not filling the cluster. Although I'm not sure of 1000 ops is a lot or not for an NVMe in metadata pool and it won't make any difference as the bottleneck is elsewhere. So if slow metadata ops is or can be caused by slow data pool OSDs (which I think makes sense and is also what you're saying), then I'll also focus on the data OSDs. But it does seem like the NVMe metadata pool could use some love too. The fact that they are co-located with such busy OSD nodes probably isn't helping... > What is the network link speed? Ceph cluster is running with only public network (no separate cluster) and traffic is going over 2 x 40Gb/s LACP bond links to spine and leaf switches. > What replication method? > All 3x replication, no erasure coding. > Personally, I think your cluster is probably doing normal given that > the main storage is SATA HDD drives. Scale-out doesn't mean you can > just scale one side up (client side) and forget about the other side. > With 5000 clients you would probably want a cluster with 3 times the > nodes and flash cache devices per OSD. My guess is that aggregated > IOP/s on your SATA drives are the bottleneck. You might be able to > improve this a bit by balancing the PGs across OSDs (use ceph osd > reweight method) if there is a large imbalance, but don't expect > wonders. > > Make sense, thanks. IMO we need to not only rebalance the cluster, but also add more less-dense nodes to spread the load out... (upgrade the cluster and get multi-MDS, etc). > One big performance hit could be if you are having the primary FS > data pool on HDD, specifically when its erasure-coded. There is a new > recommendation for a 3-pool layout and there were many long ceph-user > threads about it. It is de-facto so important to get the primary data > pool on a replicated flash pool that I decided to migrate a 1PB file > system over to the new format. This reduces the meta data IO load on > the HDD pool significantly and even speeds up some operations that > only operate on meta-data. > Fortunately, it's just replication so I think I avoid this issue at least :-) Many thanks, again. -c > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Chris Smart <distroguy@xxxxxxxxx> > Sent: 17 August 2022 09:10:30 > To: Eugen Block; ceph-users@xxxxxxx > Subject: Re: What is client request_load_avg? > Troubleshooting MDS issues on Luminous > > On Mon, 2022-08-15 at 08:33 +0000, Eugen Block wrote: > > Hi, > > > > do you see high disk utilization on the OSD nodes? How is the load > > on > > the active MDS? How much RAM is configured for the MDS > > (mds_cache_memory_limit)? > > You can list all MDS sessions with 'ceph daemon mds.<MDS> session > > ls' > > to identify all your clients and 'ceph daemon mds.<MDS> > > dump_blocked_ops' to show blocked requests. But simply killing > > sessions isn't a solution, so first you need to find out where the > > bottleneck is. Do you see hung requests or something? Anything in > > 'dmesg' on the client side? > > > > Looking at the MDS ops in flight, the majority are journal_and_reply: > > $ sudo ceph daemon mds.$(hostname) dump_ops_in_flight |grep > 'flag_point' |sort |uniq -c > 28 "flag_point": "failed to rdlock, waiting", > 2 "flag_point": "failed to wrlock, waiting", > 18 "flag_point": "failed to xlock, waiting", > 418 "flag_point": "submit entry: > journal_and_reply", > > Does anyone know where I can find more info as to what > journal_and_reply means? Is it solely about reading and writing to > the > metadata pool, or is it waiting for OSDs to perform some action (like > ensure a file is gonem, so that it can then write to metadata pool, > perhaps)? > > If it is related to OSDs in some way, then I can go and focus on > improving them (not that I shouldn't be doing that anyway, but just > trying to work out where to focus). > > For example, maybe setting osd_op_queue_cut_off to high [1] might > help?(osd_op_queue is already set to wpq.) > > I notice that when performance tanks, that the throughput on the > metadata pool goes very spiky (including down to zero). We're not > talking huge though, the range is between 0 and 150MB/sec, so almost > nothing... which makes me think it is related to OSDs also. > > [1] > https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/#confval-osd_op_queue_cut_off > > Many thanks! > -c > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx