On 10/13/2017 02:29 PM, Lukas Wunner wrote:
On Fri, Oct 13, 2017 at 09:53:06AM -0700, David Daney wrote:
On 10/13/2017 04:11 AM, Lukas Wunner wrote:
The drivers
gpio-104-dio-48e.c, gpio-74x164.c, gpio-gpio-mm.c,
gpio-pca953x.c, gpio-thunderx.c, gpio-ws16c48.c,
gpio-xilinx.c
currently use an inefficient algorithm for their ->set_multiple
callback: They iterate over every chip (or bank or gpio pin),
check if any bits are set in the mask for this particular chip,
and if so, update the affected GPIOs. If the mask is sparsely
populated, you'll waste CPU time checking chips even though
they're not affected by the operation at all.
Iterating over the chips is backwards, it is more efficient to
iterate over the bits set in the mask, identify the corresponding
chip and update its affected GPIOs. The gpio-max3191x.c driver
I posted yesterday contains an example for such an algorithm and
you may want to improve your ->set_mutiple implementation accordingly:
Do you have any profiling results that demonstrate significant system-wide
performance improvements by making such a change?
Sorry, no.
For the gpio-thunderx.c driver, the words in the bits/mask exactly match the
banks, so the number of iterations doesn't change with the approach you
suggest.
No, I was specifically referring to a sparsely populated mask above.
Let's say the mask consists of several 64-bit words and only the last
word has set bits. With the existing algorithm you have as many
iterations as you have banks:
for (bank = 0; bank <= chip->ngpio / 64; bank++) {
set_bits = bits[bank] & mask[bank];
clear_bits = ~bits[bank] & mask[bank];
writeq(set_bits, txgpio->register_base + (bank * GPIO_2ND_BANK) + GPIO_TX_SET);
writeq(clear_bits, txgpio->register_base + (bank * GPIO_2ND_BANK) + GPIO_TX_CLR);
}
Whereas with the algorithm I'm proposing you only have a single iteration:
int bit = 0;
while ((bit = find_next_bit(mask, chip->ngpio, bit)) != chip->ngpio) {
unsigned int bank = bit / 64;
set_bits = bits[bank] & mask[bank];
clear_bits = ~bits[bank] & mask[bank];
writeq(set_bits, txgpio->register_base + (bank * GPIO_2ND_BANK) + GPIO_TX_SET);
writeq(clear_bits, txgpio->register_base + (bank * GPIO_2ND_BANK) + GPIO_TX_CLR);
bit = (bank + 1) * 64; /* continue search from next bank */
}
find_next_bit() uses the clz instruction on ARM, similar instructions
exist on most arches, so the search is very fast.
There are a maximum of two banks.
For some thunderx SoCs there is only a single bank, so my code has
exactly one iteration. For the SoCs with two banks, my code has exactly
two iterations. There are no function calls in the loop.
Your suggested code does a function call to the find_next_bit() function
for each iteration. I assert that your method will be slower.
Unless you can show an instruction trace where your code results in
fewer instructions being executed, I would NAK the patch.
I can't believe I have spent so much time on this issue, for which no
matter what you do, will not impact system performance in any measurable
way.
David.
--
To unsubscribe from this list: send the line "unsubscribe linux-gpio" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html