[ Line-by-line profile results below! ] >>>>> kuznet@ms2.inr.ac.ru writes: >> Am I missing something big in udp_recvmsg? > What's about data copy? :-) No, that's not in udp_recvmsg; that's in copy_*_user and the various skb funcs. I think all of the copies show in the profiles I posted; I truncated at 1%, since <1% things are just noise. >> even given this unpleasantly flat profile > Well, I would say it is pleasantly :-) flat and shows that it is > pretty useless to search for bottlenecks. A flat profile is not pleasant - it means that whatever you're doing isn't handle simply and in one place, but rather by scattered code that quite probably does not fit together to provide an efficient path. Furthermore function calls alone are slightly expensive, even without the design inefficiencies suggested by the flatness. In the case of the Linux kernel code I've examined for my problem, most of the itty-bitty functions that should be inlined are. Several things which might ordinarily be inlined are not because they are referenced as function pointers in a registration interface; there's not much for that but to eliminate the interface, and things like sockets are pretty much here to stay ;) I'm not a stack implementation guru, so I can't usefully comment on the overall design of the UDP execution paths in Linux. >> 4.32% udp_queue_rcv_skb # spin_lock_irqsave/restore + trivial list op > > Inter cpu pingpong on queue? Try to do the same on UP, this function > should disappear from top-ten. Indeed, a few scenarios are much faster on a slower single processor than on my 2-processor box. However, since my actual target device is a one-chip SMP thing, this isn't too practical. And more realistic scenarios (like my four-process test case) do show a significant advantage for SMP. Anywho, here are my detailed profile results for our mystery function. I mangled readprofile to print all the samples using the same offset as an objdump of this function, and thus found the interesting spots. In my test, 1404 samples were in udp_recvmsg. The test ran for about two minutes and 12 million passes through this code. The numbers considered below account for 1211 of the samples. Most bins were 0. There were also a smattering of under-20 values, and then there were several over-20 buckets which I have highlighted below with raw sample numbers against the left margin. Around 650 samples are in the inline sin_zero memset, which seems odd since the pad size declaration works out to 0 in in.h. Is there another pad size definition elsewhere or something? Did I read the pad sizing math wrong? 00000b70 <udp_recvmsg>: udp_recvmsg(): b70: 83 ec 0c sub $0xc,%esp b73: 55 push %ebp b74: 57 push %edi b75: 56 push %esi b76: 53 push %ebx static __inline__ int __udp_checksum_complete(struct sk_buff *skb) { return (unsigned short)csum_fold(skb_checksum(skb, 0, skb->len, skb->csum)); } static __inline__ int udp_checksum_complete(struct sk_buff *skb) { return skb->ip_summed != CHECKSUM_UNNECESSARY && __udp_checksum_complete(skb); } /* * This should be easy, if there is something there we * return it, otherwise we block. */ int udp_recvmsg(struct sock *sk, struct msghdr *msg, int len, int noblock, int flags, int *addr_len) { b77: 8b 6c 24 20 mov 0x20(%esp,1),%ebp b7b: 8b 74 24 24 mov 0x24(%esp,1),%esi b7f: 8b 7c 24 28 mov 0x28(%esp,1),%edi b83: 8b 44 24 34 mov 0x34(%esp,1),%eax struct sockaddr_in *sin = (struct sockaddr_in *)msg->msg_name; b87: 8b 16 mov (%esi),%edx b89: 89 54 24 14 mov %edx,0x14(%esp,1) struct sk_buff *skb; int copied, err; /* * Check any passed addresses */ if (addr_len) b8d: 85 c0 test %eax,%eax b8f: 74 06 je b97 <udp_recvmsg+0x27> *addr_len=sizeof(*sin); b91: c7 00 10 00 00 00 movl $0x10,(%eax) if (flags & MSG_ERRQUEUE) b97: 8b 44 24 30 mov 0x30(%esp,1),%eax b9b: f6 c4 20 test $0x20,%ah b9e: 74 10 je bb0 <udp_recvmsg+0x40> return ip_recv_error(sk, msg, len); ba0: 57 push %edi ba1: 56 push %esi ba2: 55 push %ebp ba3: e8 fc ff ff ff call ba4 <udp_recvmsg+0x34> ba8: 83 c4 0c add $0xc,%esp bab: e9 e8 01 00 00 jmp d98 <udp_recvmsg+0x228> skb = skb_recv_datagram(sk, flags, noblock, &err); bb0: 8d 44 24 18 lea 0x18(%esp,1),%eax bb4: 50 push %eax bb5: 8b 44 24 30 mov 0x30(%esp,1),%eax bb9: 50 push %eax bba: 8b 54 24 38 mov 0x38(%esp,1),%edx bbe: 52 push %edx bbf: 55 push %ebp bc0: e8 fc ff ff ff call bc1 <udp_recvmsg+0x51> bc5: 89 c3 mov %eax,%ebx if (!skb) bc7: 83 c4 10 add $0x10,%esp bca: 85 db test %ebx,%ebx bcc: 0f 84 1e 01 00 00 je cf0 <udp_recvmsg+0x180> goto out; copied = skb->len - sizeof(struct udphdr); bd2: 8b 43 5c mov 0x5c(%ebx),%eax 375 bd5: 83 c0 f8 add $0xfffffff8,%eax bd8: 89 44 24 10 mov %eax,0x10(%esp,1) if (copied > len) { bdc: 39 f8 cmp %edi,%eax bde: 7e 08 jle be8 <udp_recvmsg+0x78> copied = len; be0: 89 7c 24 10 mov %edi,0x10(%esp,1) msg->msg_flags |= MSG_TRUNC; be4: 80 4e 18 20 orb $0x20,0x18(%esi) } if (skb->ip_summed==CHECKSUM_UNNECESSARY) { be8: 80 7b 6b 02 cmpb $0x2,0x6b(%ebx) bec: 75 07 jne bf5 <udp_recvmsg+0x85> err = skb_copy_datagram_iovec(skb, sizeof(struct udphdr), msg->msg_iov, 167 bee: 8b 54 24 10 mov 0x10(%esp,1),%edx bf2: 52 push %edx copied); } else if (msg->msg_flags&MSG_TRUNC) { bf3: eb 3d jmp c32 <udp_recvmsg+0xc2> bf5: f6 46 18 20 testb $0x20,0x18(%esi) bf9: 74 4c je c47 <udp_recvmsg+0xd7> bfb: 8b 43 64 mov 0x64(%ebx),%eax bfe: 50 push %eax bff: 8b 43 5c mov 0x5c(%ebx),%eax c02: 50 push %eax c03: 6a 00 push $0x0 c05: 53 push %ebx c06: e8 fc ff ff ff call c07 <udp_recvmsg+0x97> c0b: 89 c2 mov %eax,%edx */ static inline unsigned int csum_fold(unsigned int sum) { __asm__(" c0d: 25 00 00 ff ff and $0xffff0000,%eax c12: c1 e2 10 shl $0x10,%edx c15: 01 d0 add %edx,%eax c17: 15 ff ff 00 00 adc $0xffff,%eax c1c: 83 c4 10 add $0x10,%esp c1f: c1 e8 10 shr $0x10,%eax if (__udp_checksum_complete(skb)) c22: 3d ff ff 00 00 cmp $0xffff,%eax c27: 0f 85 d3 00 00 00 jne d00 <udp_recvmsg+0x190> goto csum_copy_err; err = skb_copy_datagram_iovec(skb, sizeof(struct udphdr), msg->msg_iov, c2d: 8b 44 24 10 mov 0x10(%esp,1),%eax c31: 50 push %eax c32: 8b 46 08 mov 0x8(%esi),%eax c35: 50 push %eax c36: 6a 08 push $0x8 c38: 53 push %ebx c39: e8 fc ff ff ff call c3a <udp_recvmsg+0xca> c3e: 89 44 24 28 mov %eax,0x28(%esp,1) copied); } else { c42: 83 c4 10 add $0x10,%esp c45: eb 1c jmp c63 <udp_recvmsg+0xf3> err = skb_copy_and_csum_datagram_iovec(skb, sizeof(struct udphdr), msg->msg_iov); c47: 8b 46 08 mov 0x8(%esi),%eax c4a: 50 push %eax c4b: 6a 08 push $0x8 c4d: 53 push %ebx c4e: e8 fc ff ff ff call c4f <udp_recvmsg+0xdf> c53: 89 44 24 24 mov %eax,0x24(%esp,1) if (err == -EINVAL) c57: 83 c4 0c add $0xc,%esp c5a: 83 f8 ea cmp $0xffffffea,%eax c5d: 0f 84 9d 00 00 00 je d00 <udp_recvmsg+0x190> goto csum_copy_err; } if (err) c63: 83 7c 24 18 00 cmpl $0x0,0x18(%esp,1) c68: 75 7c jne ce6 <udp_recvmsg+0x176> static __inline__ void sock_recv_timestamp(struct msghdr *msg, struct sock *sk, struct sk_buff *skb) { if (sk->rcvtstamp) c6a: 80 bd a2 00 00 00 00 cmpb $0x0,0xa2(%ebp) c71: 74 15 je c88 <udp_recvmsg+0x118> put_cmsg(msg, SOL_SOCKET, SO_TIMESTAMP, sizeof(skb->stamp), &skb->stamp); c73: 8d 43 10 lea 0x10(%ebx),%eax c76: 50 push %eax c77: 6a 08 push $0x8 c79: 6a 1d push $0x1d c7b: 6a 01 push $0x1 c7d: 56 push %esi c7e: e8 fc ff ff ff call c7f <udp_recvmsg+0x10f> c83: 83 c4 14 add $0x14,%esp c86: eb 12 jmp c9a <udp_recvmsg+0x12a> else sk->stamp = skb->stamp; c88: 8b 43 10 mov 0x10(%ebx),%eax c8b: 8b 53 14 mov 0x14(%ebx),%edx c8e: 89 85 fc 02 00 00 mov %eax,0x2fc(%ebp) c94: 89 95 00 03 00 00 mov %edx,0x300(%ebp) goto out_free; sock_recv_timestamp(msg, sk, skb); /* Copy the address. */ if (sin) c9a: 83 7c 24 14 00 cmpl $0x0,0x14(%esp,1) c9f: 74 2a je ccb <udp_recvmsg+0x15b> { sin->sin_family = AF_INET; ca1: 8b 54 24 14 mov 0x14(%esp,1),%edx ca5: 66 c7 02 02 00 movw $0x2,(%edx) sin->sin_port = skb->h.uh->source; caa: 8b 43 1c mov 0x1c(%ebx),%eax * This looks horribly ugly, but the compiler can optimize it totally, * as we by now know that both pattern and count is constant.. */ static inline void * __constant_c_and_count_memset(void * s, unsigned long pattern, size_t count) { 21 cad: 89 d7 mov %edx,%edi caf: 0f b7 00 movzwl (%eax),%eax cb2: 66 89 42 02 mov %ax,0x2(%edx) sin->sin_addr.s_addr = skb->nh.iph->saddr; cb6: 8b 43 20 mov 0x20(%ebx),%eax * This looks horribly ugly, but the compiler can optimize it totally, * as we by now know that both pattern and count is constant.. */ static inline void * __constant_c_and_count_memset(void * s, unsigned long pattern, size_t count) { 335 cb9: 83 c7 08 add $0x8,%edi switch (count) { case 0: return s; case 1: *(unsigned char *)s = pattern; return s; case 2: *(unsigned short *)s = pattern; return s; case 3: *(unsigned short *)s = pattern; *(2+(unsigned char *)s) = pattern; return s; case 4: *(unsigned long *)s = pattern; return s; } #define COMMON(x) \ __asm__ __volatile__( \ "rep ; stosl" \ x \ : "=&c" (d0), "=&D" (d1) \ : "a" (pattern),"0" (count/4),"1" ((long) s) \ : "memory") { int d0, d1; switch (count % 4) { case 0: COMMON(""); return s; cbc: b9 02 00 00 00 mov $0x2,%ecx cc1: 8b 40 0c mov 0xc(%eax),%eax 313 cc4: 89 42 04 mov %eax,0x4(%edx) * This looks horribly ugly, but the compiler can optimize it totally, * as we by now know that both pattern and count is constant.. */ static inline void * __constant_c_and_count_memset(void * s, unsigned long pattern, size_t count) { cc7: 31 c0 xor %eax,%eax switch (count) { case 0: return s; case 1: *(unsigned char *)s = pattern; return s; case 2: *(unsigned short *)s = pattern; return s; case 3: *(unsigned short *)s = pattern; *(2+(unsigned char *)s) = pattern; return s; case 4: *(unsigned long *)s = pattern; return s; } #define COMMON(x) \ __asm__ __volatile__( \ "rep ; stosl" \ x \ : "=&c" (d0), "=&D" (d1) \ : "a" (pattern),"0" (count/4),"1" ((long) s) \ : "memory") { int d0, d1; switch (count % 4) { case 0: COMMON(""); return s; cc9: f3 ab repz stos %eax,%es:(%edi) memset(sin->sin_zero, 0, sizeof(sin->sin_zero)); } if (sk->protinfo.af_inet.cmsg_flags) ccb: 83 bd b0 02 00 00 00 cmpl $0x0,0x2b0(%ebp) cd2: 74 0a je cde <udp_recvmsg+0x16e> ip_cmsg_recv(msg, skb); cd4: 53 push %ebx cd5: 56 push %esi cd6: e8 fc ff ff ff call cd7 <udp_recvmsg+0x167> cdb: 83 c4 08 add $0x8,%esp err = copied; cde: 8b 44 24 10 mov 0x10(%esp,1),%eax ce2: 89 44 24 18 mov %eax,0x18(%esp,1) out_free: skb_free_datagram(sk, skb); . . . -- Grant Taylor - x285 - http://pasta/~gtaylor/ Starent Networks - +1.978.851.1185 - : send the line "unsubscribe linux-net" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html