Re: [PATCH 1/3] ib core: Make device counter infrastructure dynamic

Doug Ledford <dledford@xxxxxxxxxx> · Tue, 17 May 2016 12:00:39 -0400

On 05/17/2016 10:19 AM, Christoph Lameter wrote:
> 
> On Mon, 16 May 2016, Doug Ledford wrote:
>>
>> Thanks, this looks good now.  When the other two patches come through
> 
> The patch can stand on its own and there has been the expectation
> expressed by Mellanox that they want to see this merged first. Guess this
> is to reduce the amount of rewrite they would have to do if things change.
> Then also the team from Mellanox can directly merge the driver changes
> without my involvement.
> 

OK.  There are comments from Jason outstanding, and I found one thing
that I missed in my earlier reviews.  I think we need to refactor how we
pull out the stats, or at least consider doing so.  In particular, look
at how many stats the cxgb3 driver fills in:

+	stats->dirname = "iw_stats";
+	stats->name = names;
+
+	stats->value[IPINRECEIVES] = ((u64)m.ipInReceive_hi << 32) +
m.ipInReceive_lo;
+	stats->value[IPINHDRERRORS] = ((u64)m.ipInHdrErrors_hi << 32) +
m.ipInHdrErrors_lo;
+	stats->value[IPINADDRERRORS] = ((u64)m.ipInAddrErrors_hi << 32) +
m.ipInAddrErrors_lo;
+	stats->value[IPINUNKNOWNPROTOS] = ((u64)m.ipInUnknownProtos_hi << 32)
+ m.ipInUnknownProtos_lo;
+	stats->value[IPINDISCARDS] = ((u64)m.ipInDiscards_hi << 32) +
m.ipInDiscards_lo;
+	stats->value[IPINDELIVERS] = ((u64)m.ipInDelivers_hi << 32) +
m.ipInDelivers_lo;
+	stats->value[IPOUTREQUESTS] = ((u64)m.ipOutRequests_hi << 32) +
m.ipOutRequests_lo;
+	stats->value[IPOUTDISCARDS] = ((u64)m.ipOutDiscards_hi << 32) +
m.ipOutDiscards_lo;
+	stats->value[IPOUTNOROUTES] = ((u64)m.ipOutNoRoutes_hi << 32) +
m.ipOutNoRoutes_lo;
+	stats->value[IPREASMTIMEOUT] = 	m.ipReasmTimeout;
+	stats->value[IPREASMREQDS] = m.ipReasmReqds;
+	stats->value[IPREASMOKS] = m.ipReasmOKs;
+	stats->value[IPREASMFAILS] = m.ipReasmFails;
+	stats->value[TCPACTIVEOPENS] =	m.tcpActiveOpens;
+	stats->value[TCPPASSIVEOPENS] =	m.tcpPassiveOpens;
+	stats->value[TCPATTEMPTFAILS] = m.tcpAttemptFails;
+	stats->value[TCPESTABRESETS] = m.tcpEstabResets;
+	stats->value[TCPCURRESTAB] = m.tcpOutRsts;
+	stats->value[TCPINSEGS] = m.tcpCurrEstab;
+	stats->value[TCPOUTSEGS] = ((u64)m.tcpInSegs_hi << 32) + m.tcpInSegs_lo;
+	stats->value[TCPRETRANSSEGS] = ((u64)m.tcpOutSegs_hi << 32) +
m.tcpOutSegs_lo;
+	stats->value[TCPINERRS] = ((u64)m.tcpRetransSeg_hi << 32) +
m.tcpRetransSeg_lo,
+	stats->value[TCPOUTRSTS] = ((u64)m.tcpInErrs_hi << 32) + m.tcpInErrs_lo;
+	stats->value[TCPRTOMIN] = m.tcpRtoMin;
+	stats->value[TCPRTOMAX] = m.tcpRtoMax;

That's a lot of copies, and shifts, and everything else.  Then look at
what it does to get them:

 	ret = dev->rdev.t3cdev_p->ctl(dev->rdev.t3cdev_p, RDMA_GET_MIB, &m);

I didn't dig too deep, but that looks suspiciously like it might be an
actual mailbox command to the card.  That can be rather expensive.

Then look at how we get the stats to print them to user space:

+static ssize_t show_protocol_stats(struct ib_device *dev, int index,
+				   u8 port, char *buf)
+{
+	struct rdma_protocol_stats stats = {0};
+	ssize_t ret;
+
+	ret = dev->get_protocol_stats(dev, &stats, port);
+	if (ret)
+		return ret;
+
+	return sprintf(buf, "%llu\n", stats.value[index]);
+}

In a nutshell, we go through the effort of a suspected mailbox command,
then we fill in all of the stats including all of the copies and shifts
and everything else, then we print out precisely one and only one stat
before we throw the rest of them away.  If someone goes into the stats
directory for a card and does cat * or for i in *; do echo -ne "$i:\t";
cat $i; done, then we will issue 25 mailbox commands, and fill out all
25 stats structs 25 times, just to print out one complete set of stats.
For cxgb4 this isn't so bad, it's only got 4 items.  But the longer the
list gets, the worst this is because it makes our efficiency of
operation O(n^2).  Since we can't break out mailbox commands to only
provide part of the data, I think we need to consider using a cached
struct for each device.  If the cached data is less than a certain age
on subsequent reads, we use the cached data.  If it's too old, we
discard it and get new data.

-- 
Doug Ledford <dledford@xxxxxxxxxx>
              GPG KeyID: 0E572FDD

Attachment:
signature.asc

Description: OpenPGP digital signature