I'm playing with UDP a bit, and have a few questions about how my program could be made faster. Obviously I'd prefer userspace changes, but I'm running low on possibilities there: my program is 70% kernel time. Background: SMP x86 with two processors with a 2.4.4 kernel (yes, old, but udp code and performance isn't particularly changed on 2.4.10). My test app consists of three processes on one machine: one server and two clients. The clients are sending ~100 byte packets to the server using sendmsg(), and the server picks them up with recvmsg() and responds immediately with same-sized packets. The clients keep 100 requests outstanding by sending an initial flurry and then a new request each time an answer shows up. The userspace profile of these programs (with mcount, etc computed out) all show sendmsg at 45% and recvmsg at 20%. They occasionally poll; poll is at 2%. The sockets are not connected (there are ~500 peers, and polling on a mess of sockets would be poor). The sockets have checksums turned off (local switched ethernet and loopback only). In actual numbers, a dual P3-850 does 76000 packets per second in this scenario, for 76000 each of sendmsg and recvmsg calls. I'm told that this is atrociously slow, but I've been unable to find comparable numbers for other systems, so I'm just pondering ways to make it faster. Below is the top part of the kernel profile for a loopback run, converted from ticks to percents. The big mystery to me is why udp_recvmsg is so busy. The only functions called are skb_free_datagram (a real func elsewhere in the profile), skb_copy_datagram_iovec (real), skb_recv_datagram (real), sock_recv_timestamp (inline if(blah) {assignment}). There's a little memset of the zero part of the addr. And a few ifs and assignments. Am I missing something big in udp_recvmsg? I'm calling it nonblocking, and there's always something to read; none of the exception cases should be happening. It's called the same number of times as everything else here; there aren't a million EAGAINs happening or anything. I've demonstrated a noticable speedup using a multiframe syscall in another protocol, but I don't really want to go to all that trouble. Even if I do all the fiddles mentioned below the net gain would be maybe 5% wall time if I'm lucky. Gratuitous rearranging and inlining might do me another 5%, but even given this unpleasantly flat profile I don't really want to go there. What can I do? 6.15% udp_recvmsg # !? 5.38% ip_build_xmit # inlined skb fiddling, iph csum, etc 4.57% __generic_copy_to_user # life 4.45% udp_rcv # no "most recent socket" cache 4.32% udp_queue_rcv_skb # spin_lock_irqsave/restore + trivial list op 4.31% ip_rcv # iph checksum check on loopback 3.97% ip_route_output_key # could cache route for unconnected sockets 3.43% sock_alloc_send_skb 3.38% dev_queue_xmit 3.35% udp_sendmsg 3.13% ip_output 2.87% net_rx_action 2.76% __generic_copy_from_user # life 2.62% skb_release_data 2.61% __kfree_skb 2.53% do_gettimeofday 2.45% sock_def_write_space # locks, waits, etc 1.94% skb_recv_datagram 1.91% sock_def_readable # locks, waits, etc 1.76% system_call 1.70% kmalloc # no stack of free skb elements? 1.67% fget # is the last-used fd cached? 1.59% skb_copy_datagram_iovec 1.42% loopback_xmit 1.30% kfree 1.28% netif_rx 1.25% udp_v4_lookup_longway 1.18% sys_recvmsg 1.05% verify_iovec 0.99% alloc_skb # not inline? -- Grant Taylor - http://www.picante.com/~gtaylor/ - : send the line "unsubscribe linux-net" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html