Greetings: Welcome to v6. Minor changes from v5 [1], please see changelog below. There were no explicit comments from reviewers on the call outs in my v5, so I'm retaining them from my previous cover letter just in case :) A few important call outs for reviewers: 1. This revision seems to work (see below for a full walk through). I think this is the behavior we talked about, but please let me know if a use case is missing. 2. Re a previous point made by Stanislav regarding "taking over a NAPI ID" when the channel count changes: mlx5 seems to call napi_disable followed by netif_napi_del for the old queues and then calls napi_enable for the new ones. In this RFC, the NAPI ID generation is deferred to napi_enable. This means we won't end up with two of the same NAPI IDs added to the hash at the same time. Can we assume all drivers will napi_disable the old queues before napi_enable the new ones? - If yes: we might not need to worry about a NAPI ID takeover function. - If no: I'll need to make a change so that the NAPI ID generation is deferred only for drivers which have opted into the config space via calls to netif_napi_add_config 3. I made the decision to remove the WARN_ON_ONCE that (I think?) Jakub previously suggested in alloc_netdev_mqs (WARN_ON_ONCE(txqs != rxqs);) because this was triggering on every kernel boot with my mlx5 NIC. 4. I left the "maxqs = max(txqs, rxqs);" in alloc_netdev_mqs despite thinking this is a bit strange. I think it's strange that we might be short some number of NAPI configs, but it seems like most people are in favor of this approach, so I've left it. I'd appreciate thoughts from reviewers on the above items, if at all possible. Now, on to the implementation. Firstly, this implementation moves certain settings to napi_struct so that they are "per-NAPI", while taking care to respect existing sysfs parameters which are interface wide and affect all NAPIs: - NAPI ID - gro_flush_timeout - defer_hard_irqs Furthermore: - NAPI ID generation and addition to the hash is now deferred to napi_enable, instead of during netif_napi_add - NAPIs are removed from the hash during napi_disable, instead of netif_napi_del. - An array of "struct napi_config" is allocated in net_device. IMPORTANT: The above changes affect all network drivers. Optionally, drivers may opt-in to using their config space by calling netif_napi_add_config instead of netif_napi_add. If a driver does this, the NAPI being added is linked with an allocated "struct napi_config" and the per-NAPI settings (including NAPI ID) are persisted even as hardware queues are destroyed and recreated. To help illustrate how this would end up working, I've added patches for 3 drivers, of which I have access to only 1: - mlx5 which is the basis of the examples below - mlx4 which has TX only NAPIs, just to highlight that case. I have only compile tested this patch; I don't have this hardware. - bnxt which I have only compiled tested. I don't have this hardware. NOTE: I only tested this on mlx5; I have no access to the other hardware for which I provided patches. Hopefully other folks can help test :) Here's how it works when I test it on my mlx5 system: # start with 2 queues $ ethtool -l eth4 | grep Combined | tail -1 Combined: 2 First, output the current NAPI settings: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 0, 'gro-flush-timeout': 0, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 0, 'gro-flush-timeout': 0, 'id': 344, 'ifindex': 7, 'irq': 327}] Now, set the global sysfs parameters: $ sudo bash -c 'echo 20000 >/sys/class/net/eth4/gro_flush_timeout' $ sudo bash -c 'echo 100 >/sys/class/net/eth4/napi_defer_hard_irqs' Output current NAPI settings again: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Now set NAPI ID 345, via its NAPI ID to specific values: $ sudo ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/netdev.yaml \ --do napi-set \ --json='{"id": 345, "defer-hard-irqs": 111, "gro-flush-timeout": 11111}' None Now output current NAPI settings again to ensure only NAPI ID 345 changed: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 111, 'gro-flush-timeout': 11111, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Now, increase gro-flush-timeout only: $ sudo ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/netdev.yaml \ --do napi-set --json='{"id": 345, "gro-flush-timeout": 44444}' None Now output the current NAPI settings once more: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 111, 'gro-flush-timeout': 44444, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Now set NAPI ID 345 to have gro_flush_timeout of 0: $ sudo ./tools/net/ynl/cli.py \ --spec Documentation/netlink/specs/netdev.yaml \ --do napi-set --json='{"id": 345, "gro-flush-timeout": 0}' None Check that NAPI ID 345 has a value of 0: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 111, 'gro-flush-timeout': 0, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Change the queue count, ensuring that NAPI ID 345 retains its settings: $ sudo ethtool -L eth4 combined 4 Check that the new queues have the system wide settings but that NAPI ID 345 remains unchanged: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 347, 'ifindex': 7, 'irq': 529}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 346, 'ifindex': 7, 'irq': 528}, {'defer-hard-irqs': 111, 'gro-flush-timeout': 0, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Now reduce the queue count below where NAPI ID 345 is indexed: $ sudo ethtool -L eth4 combined 1 Check the output: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Re-increase the queue count to ensure NAPI ID 345 is re-assigned the same values: $ sudo ethtool -L eth4 combined 2 $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [{'defer-hard-irqs': 111, 'gro-flush-timeout': 0, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Create new queues to ensure the sysfs globals are used for the new NAPIs but that NAPI ID 345 is unchanged: $ sudo ethtool -L eth4 combined 8 $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [...] {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 346, 'ifindex': 7, 'irq': 528}, {'defer-hard-irqs': 111, 'gro-flush-timeout': 0, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 100, 'gro-flush-timeout': 20000, 'id': 344, 'ifindex': 7, 'irq': 327}] Last, but not least, let's try writing the sysfs parameters to ensure all NAPIs are rewritten: $ sudo bash -c 'echo 33333 >/sys/class/net/eth4/gro_flush_timeout' $ sudo bash -c 'echo 222 >/sys/class/net/eth4/napi_defer_hard_irqs' Check that worked: $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \ --dump napi-get --json='{"ifindex": 7}' [...] {'defer-hard-irqs': 222, 'gro-flush-timeout': 33333, 'id': 346, 'ifindex': 7, 'irq': 528}, {'defer-hard-irqs': 222, 'gro-flush-timeout': 33333, 'id': 345, 'ifindex': 7, 'irq': 527}, {'defer-hard-irqs': 222, 'gro-flush-timeout': 33333, 'id': 344, 'ifindex': 7, 'irq': 327}] Thanks, Joe [1]: https://lore.kernel.org/netdev/20241009005525.13651-1-jdamato@xxxxxxxxxx/ v6: - Added Jakub's Reviewed-by to all commit messages - Added Eric's Reviewed-by to the commits for which he provided his review (i.e. all patches except 4, 5, and 6) - Updated the netlink doc for gro_flush_timeout in patch 4 - Added WARN_ON_ONCE to napi_hash_add_with_id as suggested by Eric Dumazet to patch 5 v5: https://lore.kernel.org/netdev/20241009005525.13651-1-jdamato@xxxxxxxxxx/ - Converted from an RFC to a PATCH - Moved defer_hard_irqs above control-fields comment in napi_struct in patch 1 - Moved gro_flush_timeout above control-fields comment in napi_struct in patch 3 - Removed unnecessary idpf changes from patch 3 - Removed unnecessary kdoc in patch 5 for a parameter removed in an earlier revision - Removed unnecessary NULL check in patch 5 - Used tooling to autogenerate code for patch 6, which fixed the type and range of NETDEV_A_NAPI_DEFER_HARD_IRQ. rfcv4: https://lore.kernel.org/lkml/20241001235302.57609-1-jdamato@xxxxxxxxxx/ - Updated commit messages of most patches - Renamed netif_napi_add_storage to netif_napi_add_config in patch 5 - Added a NULL check in netdev_set_defer_hard_irqs and netdev_set_gro_flush_timeout for netdev->napi_config in patch 5 - Removed the WARN_ON_ONCE suggested in an earlier revision in alloc_netdev_mqs from patch 5; it triggers every time on my mlx5 machine at boot and needlessly spams the log - Added a locking adjustment suggested by Stanislav to patch 6 to protect napi_id in patch 5 - Removed napi_hash_del from netif_napi_del in patch 5. netif_napi_del calls __netif_napi_del which itself calls napi_hash_del. The original code thus resulted in two napi_hash_del calls, which is incorrect. - Removed the napi_hash_add from netif_napi_add_weight in patch 5. NAPIs are added to the hash when napi_enable is called, instead. - Moved the napi_restore_config to the top of napi_enable in patch 5. - Simplified the logic in __netif_napi_del and removed napi_hash_del. NAPIs are removed in napi_disable. - Fixed merge conflicts in patch 6 so it applies cleanly rfcv3: https://lore.kernel.org/netdev/20240912100738.16567-8-jdamato@xxxxxxxxxx/T/ - Renamed napi_storage to napi_config - Reordered patches - Added defer_hard_irqs and gro_flush_timeout to napi_struct - Attempt to save and restore settings on napi_disable/napi_enable - Removed weight as a parameter to netif_napi_add_storage - Updated driver patches to no longer pass in weight rfcv2: https://lore.kernel.org/netdev/20240908160702.56618-1-jdamato@xxxxxxxxxx/ - Almost total rewrite from v1 Joe Damato (9): net: napi: Make napi_defer_hard_irqs per-NAPI netdev-genl: Dump napi_defer_hard_irqs net: napi: Make gro_flush_timeout per-NAPI netdev-genl: Dump gro_flush_timeout net: napi: Add napi_config netdev-genl: Support setting per-NAPI config values bnxt: Add support for persistent NAPI config mlx5: Add support for persistent NAPI config mlx4: Add support for persistent NAPI config to RX CQs Documentation/netlink/specs/netdev.yaml | 28 ++++++ .../networking/net_cachelines/net_device.rst | 3 + drivers/net/ethernet/broadcom/bnxt/bnxt.c | 3 +- drivers/net/ethernet/mellanox/mlx4/en_cq.c | 3 +- .../net/ethernet/mellanox/mlx5/core/en_main.c | 2 +- include/linux/netdevice.h | 42 +++++++- include/uapi/linux/netdev.h | 3 + net/core/dev.c | 96 ++++++++++++++++--- net/core/dev.h | 88 +++++++++++++++++ net/core/net-sysfs.c | 4 +- net/core/netdev-genl-gen.c | 18 ++++ net/core/netdev-genl-gen.h | 1 + net/core/netdev-genl.c | 57 +++++++++++ tools/include/uapi/linux/netdev.h | 3 + 14 files changed, 326 insertions(+), 25 deletions(-) base-commit: d677aebd663ddc287f2b2bda098474694a0ca875 -- 2.25.1