Re: ceph-mgr perf throttle-msgr - what is caused fails?

Eugen Block <eblock@xxxxxx> · Thu, 12 Sep 2024 16:16:15 +0000

You’re right, in my case it was clear where it came from. But if  
there’s no spike visible, it’s probably going to be difficult to get  
to the bottom of it. But did you notice any actual issues or did you  
just see that value being that high without any connection to an  
incident?

Zitat von Konstantin Shalygin <k0ste@xxxxxxxx>:

Hi, Eugene

Yes, I remember. But in that case, it was clear where/whence the  
problem was. In this case, it is completely unclear to me what  
caused the throttling, only suggestions. There was no sudden spike  
in load or significant change in cluster size. I think it slowly  
approached the limit. It remains to be seen what the limit is.

The visible impact is Prometheus module - the module does not have  
time to prepare data within 15 seconds (scrape interval).

This val is 'in flight'. In one second the val may be zero, may be  
nearest max. The idea that came to me now is to look at the msgr  
debug, but I'm not sure that will help given the number of messages

k
Sent from my iPhone

On 8 Sep 2024, at 14:09, Eugen Block <eblock@xxxxxx> wrote:

Hi,

I don't have an answer, but it reminds me of the issue we had this  
year on a customer cluster. I had created this tracker issue [0]  
where you were the only one yet to comment. Those observations  
might not be related, but do you see any impact on the cluster?
Also, in your output "val" is still smaller than "max":

 "val": 104856554,
 "max": 104857600,

So it probably doesn't have any visible impact, does it? But the  
values are not that far apart, maybe they burst sometime, leading  
to the fail_fail counter to increase? Do you have that monitored?

Thanks,
Eugen

[0] https://tracker.ceph.com/issues/66310

Zitat von Konstantin Shalygin <k0ste@xxxxxxxx>:

Hi, seems something in mgr is throttle due val > max. I'm right?

root@mon1# ceph daemon /var/run/ceph/ceph-mgr.mon1.asok perf dump  
| jq '."throttle-msgr_dispatch_throttler-mgr-0x55930f4aed20"'
{
 "val": 104856554,
 "max": 104857600,
 "get_started": 0,
 "get": 9700833,
 "get_sum": 654452218418,
 "get_or_fail_fail": 1323887918,
 "get_or_fail_success": 9700833,
 "take": 0,
 "take_sum": 0,
 "put": 9698716,
 "put_sum": 654347361864,
 "wait": {
   "avgcount": 0,
   "sum": 0,
   "avgtime": 0
 }
}

The question is - how-to determine what exactly? Another fail_fail  
in perf counters is zero. mgr is not in container, and have  
resources to work

Thanks,
k
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx