Re: ceph-mgr perf throttle-msgr - what is caused fails?

Konstantin Shalygin <k0ste@xxxxxxxx> · Sun, 8 Sep 2024 14:48:12 +0300

Hi, Eugene

Yes, I remember. But in that case, it was clear where/whence the problem was. In this case, it is completely unclear to me what caused the throttling, only suggestions. There was no sudden spike in load or significant change in cluster size. I think it slowly approached the limit. It remains to be seen what the limit is.

The visible impact is Prometheus module - the module does not have time to prepare data within 15 seconds (scrape interval). 

This val is 'in flight'. In one second the val may be zero, may be nearest max. The idea that came to me now is to look at the msgr debug, but I'm not sure that will help given the number of messages

k
Sent from my iPhone

> On 8 Sep 2024, at 14:09, Eugen Block <eblock@xxxxxx> wrote:
> 
> Hi,
> 
> I don't have an answer, but it reminds me of the issue we had this year on a customer cluster. I had created this tracker issue [0] where you were the only one yet to comment. Those observations might not be related, but do you see any impact on the cluster?
> Also, in your output "val" is still smaller than "max":
> 
>>  "val": 104856554,
>>  "max": 104857600,
> 
> So it probably doesn't have any visible impact, does it? But the values are not that far apart, maybe they burst sometime, leading to the fail_fail counter to increase? Do you have that monitored?
> 
> Thanks,
> Eugen
> 
> [0] https://tracker.ceph.com/issues/66310
> 
> Zitat von Konstantin Shalygin <k0ste@xxxxxxxx>:
> 
>> Hi, seems something in mgr is throttle due val > max. I'm right?
>> 
>> root@mon1# ceph daemon /var/run/ceph/ceph-mgr.mon1.asok perf dump | jq '."throttle-msgr_dispatch_throttler-mgr-0x55930f4aed20"'
>> {
>>  "val": 104856554,
>>  "max": 104857600,
>>  "get_started": 0,
>>  "get": 9700833,
>>  "get_sum": 654452218418,
>>  "get_or_fail_fail": 1323887918,
>>  "get_or_fail_success": 9700833,
>>  "take": 0,
>>  "take_sum": 0,
>>  "put": 9698716,
>>  "put_sum": 654347361864,
>>  "wait": {
>>    "avgcount": 0,
>>    "sum": 0,
>>    "avgtime": 0
>>  }
>> }
>> 
>> The question is - how-to determine what exactly? Another fail_fail in perf counters is zero. mgr is not in container, and have resources to work
>> 
>> 
>> Thanks,
>> k
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx