RE: missing counter values in sysfs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Holger,

> -----Original Message-----
> From: linux-rdma-owner@xxxxxxxxxxxxxxx <linux-rdma-owner@xxxxxxxxxxxxxxx>
> On Behalf Of Holger Hoffstätte
> Sent: Saturday, September 1, 2018 9:10 AM
> To: linux-rdma@xxxxxxxxxxxxxxx
> Subject: rxe: missing counter values in sysfs
> 
> Hello!
> 
> After playing around with libfabric for a while I've decided to get my hands dirty
> and play with rxe for real. Easy enough - configured kernel (4.18.5), ported
> rdma-core-v19 to Gentoo (which only had a very old and completely outdated
> OFED package), and voila: ibv_*_bw/lat work, perftest hums along as well. Nice!
> 
> Nevertheless I've noticed something odd. I monitor my systems with
> Prometheus and suddenly got complaints from its host metrics collector like the
> following:
> 
> ..
> time="2018-08-30T17:19:28+02:00" level=error msg="ERROR: infiniband
> collector
>    failed after 0.000130s: strconv.ParseUint: parsing \"N/A (no PMA)\":
>    invalid syntax" source="collector.go:132"
> ..
> 
> Indeed all rxe device counters in sysfs show "N/A (no PMA)", which I traced into
> drivers/infiniband/core/sysfs.c's show_pma_counter() calling get_perf_mad() on
> the IB device, failing and returning the error message as counter value.
> 
> After digging through the source I've found that rxe does expose counters, but
> apparently the sysfs "binding" is never made because rxe's "IB device" doesn't
> set up the device->process_mad callback in order to redirect to its own counter
> processing.
> 
> Is this:
> 
> - correct so far?
> - a bug/oversight/simply not yet implemented in rxe?
> - should sysfs counter values ever end up having strings in them?
> 
> I'm reluctant to send patches to Prometheus before figuring out what is going
> on here, considering that I only started all this two days ago and may well have
> messed up something.
> 
> Any advice how to address this situation?
> 
Rxe driver needs to implement alloc_hw_stats() and get_hw_stats() functions callback and those counters will be exposed in sysfs at 
/sys/class/infiniband/rxeX/ports/<Port>/counters.

There is a library implemented in golang and used in some of the orchestration tool as well available at [1].
So you can directly use the library to integrate in the application if it is in golang.

If it is missing something let me know, I could enhance it further or feel free to send github PR for enhancements.
It is tested with mlx5 ConnectX4/5 IB/RoCE single port dual port devices.
Once rxe driver implement statistics, those will be available automatically here.
It also reports connection management statistics of rdma devices.

[1] https://github.com/Mellanox/rdmamap





[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux