On 3/25/2010 10:32 AM, Eric Dumazet wrote: > Le mercredi 24 mars 2010 à 17:22 +0100, Eric Dumazet a écrit : > >> Sure this helps a lot ! >> >> You might try RPS by doing : >> >> echo f >/sys/class/net/eth3/queues/rx-0/rps_cpus >> >> (But you'll also need a new xt_hashlimit module to make it more >> scalable, I can work on this this week if necessary) >> >> > Here is patch I cooked for xt_hashlimit (on top of net-next-2.6) to make > it use RCU and scale better in your case (allowing several concurrent > cpus once RPS is activated), but also on more general cases. > > [PATCH] xt_hashlimit: RCU conversion > > xt_hashlimit uses a central lock per hash table and suffers from > contention on some workloads. > > After RCU conversion, central lock is only used when a writer wants to > add or delete an entry. For 'readers', updating an existing entry, they > use an individual lock per entry. > Eric, Awesome work, thanks for the effort! I've tried the patch and got some results. The drop rate was reduced dramatically after I activated RPS. I did the same test I did before, namely I rebooted and started flooding the machine immediately after with 300 kpps. After 5 minutes, perf top looked like this: ------------------------------------------------------------------------------------------------------------------------- PerfTop: 1962 irqs/sec kernel:99.3% [1000Hz cycles], (all, 4 CPUs) ------------------------------------------------------------------------------------------------------------------------- samples pcnt function DSO _______ _____ ________________________ _____________________________________________________________________ 4501.00 14.0% __ticket_spin_lock /lib/modules/2.6.34-rc1-net-next/build/vmlinux 2985.00 9.3% dsthash_find /lib/modules/2.6.34-rc1-net-next/kernel/net/netfilter/xt_hashlimit.ko 2346.00 7.3% __ticket_spin_unlock /lib/modules/2.6.34-rc1-net-next/build/vmlinux 1354.00 4.2% e1000_xmit_frame /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000e/e1000e.ko 1070.00 3.3% __slab_free /lib/modules/2.6.34-rc1-net-next/build/vmlinux 997.00 3.1% memcpy /lib/modules/2.6.34-rc1-net-next/build/vmlinux 809.00 2.5% dev_queue_xmit /lib/modules/2.6.34-rc1-net-next/build/vmlinux 791.00 2.5% nf_iterate /lib/modules/2.6.34-rc1-net-next/build/vmlinux 705.00 2.2% e1000_clean_tx_irq /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000e/e1000e.ko 634.00 2.0% nf_hook_slow /lib/modules/2.6.34-rc1-net-next/build/vmlinux 624.00 1.9% skb_release_head_state /lib/modules/2.6.34-rc1-net-next/build/vmlinux 584.00 1.8% e1000_intr /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000/e1000.ko 536.00 1.7% br_nf_pre_routing_finish /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko 528.00 1.6% nommu_map_page /lib/modules/2.6.34-rc1-net-next/build/vmlinux 499.00 1.6% kfree /lib/modules/2.6.34-rc1-net-next/build/vmlinux 494.00 1.5% __netif_receive_skb /lib/modules/2.6.34-rc1-net-next/build/vmlinux 472.00 1.5% __alloc_skb /lib/modules/2.6.34-rc1-net-next/build/vmlinux 448.00 1.4% br_fdb_update /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko 437.00 1.4% __slab_alloc /lib/modules/2.6.34-rc1-net-next/build/vmlinux 428.00 1.3% ipt_do_table [ip_tables] 403.00 1.3% memset /lib/modules/2.6.34-rc1-net-next/build/vmlinux 402.00 1.3% br_handle_frame /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko 389.00 1.2% e1000_clean_rx_irq /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000/e1000.ko 388.00 1.2% e1000_clean /lib/modules/2.6.34-rc1-net-next/kernel/drivers/net/e1000/e1000.ko 381.00 1.2% uhci_irq /lib/modules/2.6.34-rc1-net-next/build/vmlinux 366.00 1.1% get_rps_cpu /lib/modules/2.6.34-rc1-net-next/build/vmlinux 365.00 1.1% br_nf_pre_routing /lib/modules/2.6.34-rc1-net-next/kernel/net/bridge/bridge.ko 349.00 1.1% dst_release /lib/modules/2.6.34-rc1-net-next/build/vmlinux And iptables-save -c produced this: # Generated by iptables-save v1.4.4 on Fri Mar 26 11:24:59 2010 *filter :INPUT ACCEPT [1043:60514] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [942:282723] [99563191:3783420610] -A FORWARD -m hashlimit --hashlimit-upto 10000/sec --hashlimit-burst 100 --hashlimit-mode dstip --hashlimit-name hashtable --hashlimit-htable-max 131072 --hashlimit-htable-expire 1000 -j ACCEPT [0:0] -A FORWARD -m limit --limit 5/sec -j LOG --log-prefix "HASHLIMITED -- " [0:0] -A FORWARD -j DROP COMMIT # Completed on Fri Mar 26 11:24:59 2010 And /proc/interrupts looked like this: CPU0 CPU1 CPU2 CPU3 0: 47 0 1 0 IO-APIC-edge timer 1: 0 1 0 1 IO-APIC-edge i8042 6: 1 1 0 0 IO-APIC-edge floppy 8: 1 0 0 0 IO-APIC-edge rtc0 9: 0 0 0 0 IO-APIC-fasteoi acpi 12: 0 1 1 2 IO-APIC-edge i8042 14: 21 22 22 21 IO-APIC-edge ata_piix 15: 0 0 0 0 IO-APIC-edge ata_piix 16: 492 464 463 474 IO-APIC-fasteoi arcmsr 17: 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb1 18: 971171 971391 948171 948663 IO-APIC-fasteoi uhci_hcd:usb3, uhci_hcd:usb7, eth3 19: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb6 21: 0 0 0 0 IO-APIC-fasteoi ata_piix, uhci_hcd:usb4 23: 1 0 1 0 IO-APIC-fasteoi ehci_hcd:usb2, uhci_hcd:usb5 27: 1003145 1002952 1026174 1025671 PCI-MSI-edge eth4 NMI: 202553 185135 134999 185071 Non-maskable interrupts LOC: 20270 19227 17387 23282 Local timer interrupts SPU: 0 0 0 0 Spurious interrupts PMI: 202553 185135 134999 185071 Performance monitoring interrupts PND: 201464 183939 134067 184098 Performance pending work RES: 2216 2449 1212 1432 Rescheduling interrupts CAL: 2223380 2226493 2233481 2228957 Function call interrupts TLB: 606 584 1274 1216 TLB shootdowns TRM: 0 0 0 0 Thermal event interrupts THR: 0 0 0 0 Threshold APIC interrupts MCE: 0 0 0 0 Machine check exceptions MCP: 2 2 2 2 Machine check polls ERR: 3 MIS: 0 ifconfig reported only 2 drops after these 5 minutes. I'm thinking about removing/changing the hashing algorithm to make dsthash_find faster. All I need after all is a match against a destination IP address. Also, I'd like the limit of 10kpps to be a bit higher. I'll see if I can work on that during the weekend. Thanks again for everything! Regards, Jorrit Kronjee -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html