You’re right, in my case it was clear where it came from. But if
there’s no spike visible, it’s probably going to be difficult to get
to the bottom of it. But did you notice any actual issues or did you
just see that value being that high without any connection to an
incident?
Zitat von Konstantin Shalygin <k0ste@xxxxxxxx>:
Hi, Eugene
Yes, I remember. But in that case, it was clear where/whence the
problem was. In this case, it is completely unclear to me what
caused the throttling, only suggestions. There was no sudden spike
in load or significant change in cluster size. I think it slowly
approached the limit. It remains to be seen what the limit is.
The visible impact is Prometheus module - the module does not have
time to prepare data within 15 seconds (scrape interval).
This val is 'in flight'. In one second the val may be zero, may be
nearest max. The idea that came to me now is to look at the msgr
debug, but I'm not sure that will help given the number of messages
k
Sent from my iPhone
On 8 Sep 2024, at 14:09, Eugen Block <eblock@xxxxxx> wrote:
Hi,
I don't have an answer, but it reminds me of the issue we had this
year on a customer cluster. I had created this tracker issue [0]
where you were the only one yet to comment. Those observations
might not be related, but do you see any impact on the cluster?
Also, in your output "val" is still smaller than "max":
"val": 104856554,
"max": 104857600,
So it probably doesn't have any visible impact, does it? But the
values are not that far apart, maybe they burst sometime, leading
to the fail_fail counter to increase? Do you have that monitored?
Thanks,
Eugen
[0] https://tracker.ceph.com/issues/66310
Zitat von Konstantin Shalygin <k0ste@xxxxxxxx>:
Hi, seems something in mgr is throttle due val > max. I'm right?
root@mon1# ceph daemon /var/run/ceph/ceph-mgr.mon1.asok perf dump
| jq '."throttle-msgr_dispatch_throttler-mgr-0x55930f4aed20"'
{
"val": 104856554,
"max": 104857600,
"get_started": 0,
"get": 9700833,
"get_sum": 654452218418,
"get_or_fail_fail": 1323887918,
"get_or_fail_success": 9700833,
"take": 0,
"take_sum": 0,
"put": 9698716,
"put_sum": 654347361864,
"wait": {
"avgcount": 0,
"sum": 0,
"avgtime": 0
}
}
The question is - how-to determine what exactly? Another fail_fail
in perf counters is zero. mgr is not in container, and have
resources to work
Thanks,
k
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx