Re: What is client request_load_avg? Troubleshooting MDS issues on Luminous

Chris Smart <distroguy@xxxxxxxxx> · Tue, 16 Aug 2022 14:17:23 +1000

On Tue, 2022-08-16 at 13:21 +1000, distroguy@xxxxxxxxx wrote:
> 
> I'm not quite sure of the relationship of operations between MDS and
> OSD data. The MDS gets written to nvme pool and clients access data
> directly on OSD nodes, but do MDS operations also need to wait for
> OSDs
> to perform operations? I think it makes sense that they do (for
> example, to unlink a file MDS needs to check if there are any other
> hardlinks to it, and if not, then the data can be deleted from OSDs
> and
> the metadata updated to remove the file)?
> 
> So to that end, would slow performing OSDs also impact MDS
> performance?
> Maybe it's stuck waiting for the OSDs to do their thing, and they
> aren't fast enough... but then wouldn't I see much more %wa?
> 

Related datapoints I forgot to mention:

We get lots of "MDS health slow requests are blocked" error messages
every couple of minutes. Looking at August 13th logs, we had 911 log
lines about the clearing of these slow requests.

The message with the highest number was 11,193 slow requests cleared,
the average is 472.

I know we also have some OSD disks in the cluster with SMART errors,
which I'm looking to replace. However, we do not see the same number of
slow OSD requests - "only" 13 lines about blocked requests due to OSD
messages. I do plan to chase those down though and see if I can work
out if it's unhealthy disk, or intermittent network/host issues.

However, my point is that if MDS was bottlenecked due to slow OSDs, I
feel like I should see more corresponding blocked request OSD
messages?...

Cheers,
-c

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx