rxe: missing counter values in sysfs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello!

After playing around with libfabric for a while I've decided to get my
hands dirty and play with rxe for real. Easy enough - configured kernel
(4.18.5), ported rdma-core-v19 to Gentoo (which only had a very old
and completely outdated OFED package), and voila: ibv_*_bw/lat work,
perftest hums along as well. Nice!

Nevertheless I've noticed something odd. I monitor my systems with
Prometheus and suddenly got complaints from its host metrics collector
like the following:

..
time="2018-08-30T17:19:28+02:00" level=error msg="ERROR: infiniband collector
  failed after 0.000130s: strconv.ParseUint: parsing \"N/A (no PMA)\":
  invalid syntax" source="collector.go:132"
..

Indeed all rxe device counters in sysfs show "N/A (no PMA)", which I
traced into drivers/infiniband/core/sysfs.c's show_pma_counter()
calling get_perf_mad() on the IB device, failing and returning the
error message as counter value.

After digging through the source I've found that rxe does expose
counters, but apparently the sysfs "binding" is never made because
rxe's "IB device" doesn't set up the device->process_mad callback
in order to redirect to its own counter processing.

Is this:

- correct so far?
- a bug/oversight/simply not yet implemented in rxe?
- should sysfs counter values ever end up having strings in them?

I'm reluctant to send patches to Prometheus before figuring out what
is going on here, considering that I only started all this two days
ago and may well have messed up something.

Any advice how to address this situation?

Thanks!
Holger



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux