On Fri, Oct 13, 2017 at 09:53:06AM -0700, David Daney wrote: > On 10/13/2017 04:11 AM, Lukas Wunner wrote: > > The drivers > > > > gpio-104-dio-48e.c, gpio-74x164.c, gpio-gpio-mm.c, > > gpio-pca953x.c, gpio-thunderx.c, gpio-ws16c48.c, > > gpio-xilinx.c > > > > currently use an inefficient algorithm for their ->set_multiple > > callback: They iterate over every chip (or bank or gpio pin), > > check if any bits are set in the mask for this particular chip, > > and if so, update the affected GPIOs. If the mask is sparsely > > populated, you'll waste CPU time checking chips even though > > they're not affected by the operation at all. > > > > Iterating over the chips is backwards, it is more efficient to > > iterate over the bits set in the mask, identify the corresponding > > chip and update its affected GPIOs. The gpio-max3191x.c driver > > I posted yesterday contains an example for such an algorithm and > > you may want to improve your ->set_mutiple implementation accordingly: > > Do you have any profiling results that demonstrate significant system-wide > performance improvements by making such a change? Sorry, no. > For the gpio-thunderx.c driver, the words in the bits/mask exactly match the > banks, so the number of iterations doesn't change with the approach you > suggest. No, I was specifically referring to a sparsely populated mask above. Let's say the mask consists of several 64-bit words and only the last word has set bits. With the existing algorithm you have as many iterations as you have banks: for (bank = 0; bank <= chip->ngpio / 64; bank++) { set_bits = bits[bank] & mask[bank]; clear_bits = ~bits[bank] & mask[bank]; writeq(set_bits, txgpio->register_base + (bank * GPIO_2ND_BANK) + GPIO_TX_SET); writeq(clear_bits, txgpio->register_base + (bank * GPIO_2ND_BANK) + GPIO_TX_CLR); } Whereas with the algorithm I'm proposing you only have a single iteration: int bit = 0; while ((bit = find_next_bit(mask, chip->ngpio, bit)) != chip->ngpio) { unsigned int bank = bit / 64; set_bits = bits[bank] & mask[bank]; clear_bits = ~bits[bank] & mask[bank]; writeq(set_bits, txgpio->register_base + (bank * GPIO_2ND_BANK) + GPIO_TX_SET); writeq(clear_bits, txgpio->register_base + (bank * GPIO_2ND_BANK) + GPIO_TX_CLR); bit = (bank + 1) * 64; /* continue search from next bank */ } find_next_bit() uses the clz instruction on ARM, similar instructions exist on most arches, so the search is very fast. Or did you mean that find_next_bit() iterates over the words to find the first one with set bits? BTW, why isn't there a check in your code whether mask[bank] != 0, n which case the mmio writes can be skipped? > In fact, an argument could be made in the other direction. The increased > ICache pressure from code bloat and branch prediction misses resulting from > extra testing of the mask bits could easily make the system slower using > your suggestion. I don't quite follow, where should the branch prediction misses come from, the number of branch instructions is the same with both algorithms, and fewer with the proposed algorithm on sparsely populated masks. Also, 2 additional LoC is hardly code bloat. Are you referring to the call to find_next_bit() specifically? Thanks, Lukas -- To unsubscribe from this list: send the line "unsubscribe linux-gpio" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html