Hi Holger, > -----Original Message----- > From: linux-rdma-owner@xxxxxxxxxxxxxxx <linux-rdma-owner@xxxxxxxxxxxxxxx> > On Behalf Of Holger Hoffstätte > Sent: Saturday, September 1, 2018 9:10 AM > To: linux-rdma@xxxxxxxxxxxxxxx > Subject: rxe: missing counter values in sysfs > > Hello! > > After playing around with libfabric for a while I've decided to get my hands dirty > and play with rxe for real. Easy enough - configured kernel (4.18.5), ported > rdma-core-v19 to Gentoo (which only had a very old and completely outdated > OFED package), and voila: ibv_*_bw/lat work, perftest hums along as well. Nice! > > Nevertheless I've noticed something odd. I monitor my systems with > Prometheus and suddenly got complaints from its host metrics collector like the > following: > > .. > time="2018-08-30T17:19:28+02:00" level=error msg="ERROR: infiniband > collector > failed after 0.000130s: strconv.ParseUint: parsing \"N/A (no PMA)\": > invalid syntax" source="collector.go:132" > .. > > Indeed all rxe device counters in sysfs show "N/A (no PMA)", which I traced into > drivers/infiniband/core/sysfs.c's show_pma_counter() calling get_perf_mad() on > the IB device, failing and returning the error message as counter value. > > After digging through the source I've found that rxe does expose counters, but > apparently the sysfs "binding" is never made because rxe's "IB device" doesn't > set up the device->process_mad callback in order to redirect to its own counter > processing. > > Is this: > > - correct so far? > - a bug/oversight/simply not yet implemented in rxe? > - should sysfs counter values ever end up having strings in them? > > I'm reluctant to send patches to Prometheus before figuring out what is going > on here, considering that I only started all this two days ago and may well have > messed up something. > > Any advice how to address this situation? > Rxe driver needs to implement alloc_hw_stats() and get_hw_stats() functions callback and those counters will be exposed in sysfs at /sys/class/infiniband/rxeX/ports/<Port>/counters. There is a library implemented in golang and used in some of the orchestration tool as well available at [1]. So you can directly use the library to integrate in the application if it is in golang. If it is missing something let me know, I could enhance it further or feel free to send github PR for enhancements. It is tested with mlx5 ConnectX4/5 IB/RoCE single port dual port devices. Once rxe driver implement statistics, those will be available automatically here. It also reports connection management statistics of rdma devices. [1] https://github.com/Mellanox/rdmamap