Re: UDP performance questions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[ Line-by-line profile results below! ]

>>>>> kuznet@ms2.inr.ac.ru writes:

>> Am I missing something big in udp_recvmsg?
> What's about data copy? :-)

No, that's not in udp_recvmsg; that's in copy_*_user and the various
skb funcs.  I think all of the copies show in the profiles I posted; I
truncated at 1%, since <1% things are just noise.

>> even given this unpleasantly flat profile
> Well, I would say it is pleasantly :-) flat and shows that it is
> pretty useless to search for bottlenecks.

A flat profile is not pleasant - it means that whatever you're doing
isn't handle simply and in one place, but rather by scattered code
that quite probably does not fit together to provide an efficient
path.  Furthermore function calls alone are slightly expensive, even
without the design inefficiencies suggested by the flatness.

In the case of the Linux kernel code I've examined for my problem,
most of the itty-bitty functions that should be inlined are.  Several
things which might ordinarily be inlined are not because they are
referenced as function pointers in a registration interface; there's
not much for that but to eliminate the interface, and things like
sockets are pretty much here to stay ;)

I'm not a stack implementation guru, so I can't usefully comment on
the overall design of the UDP execution paths in Linux.

>>   4.32% udp_queue_rcv_skb        # spin_lock_irqsave/restore + trivial list op
> 
> Inter cpu pingpong on queue? Try to do the same on UP, this function
> should disappear from top-ten.

Indeed, a few scenarios are much faster on a slower single processor
than on my 2-processor box.  However, since my actual target device is
a one-chip SMP thing, this isn't too practical.  And more realistic
scenarios (like my four-process test case) do show a significant
advantage for SMP.


Anywho, here are my detailed profile results for our mystery function.
I mangled readprofile to print all the samples using the same offset
as an objdump of this function, and thus found the interesting spots.

In my test, 1404 samples were in udp_recvmsg.  The test ran for about
two minutes and 12 million passes through this code.  The numbers
considered below account for 1211 of the samples.  Most bins were 0.
There were also a smattering of under-20 values, and then there were
several over-20 buckets which I have highlighted below with raw sample
numbers against the left margin.

Around 650 samples are in the inline sin_zero memset, which seems odd
since the pad size declaration works out to 0 in in.h.  Is there
another pad size definition elsewhere or something?  Did I read the
pad sizing math wrong?


00000b70 <udp_recvmsg>:
udp_recvmsg():
     b70:       83 ec 0c                sub    $0xc,%esp
     b73:       55                      push   %ebp
     b74:       57                      push   %edi
     b75:       56                      push   %esi
     b76:       53                      push   %ebx

static __inline__ int __udp_checksum_complete(struct sk_buff *skb)
{
        return (unsigned short)csum_fold(skb_checksum(skb, 0, skb->len, skb->csum));
}

static __inline__ int udp_checksum_complete(struct sk_buff *skb)
{
        return skb->ip_summed != CHECKSUM_UNNECESSARY &&
                __udp_checksum_complete(skb);
}

/*
 *      This should be easy, if there is something there we
 *      return it, otherwise we block.
 */

int udp_recvmsg(struct sock *sk, struct msghdr *msg, int len,
                int noblock, int flags, int *addr_len)
{
     b77:       8b 6c 24 20             mov    0x20(%esp,1),%ebp
     b7b:       8b 74 24 24             mov    0x24(%esp,1),%esi
     b7f:       8b 7c 24 28             mov    0x28(%esp,1),%edi
     b83:       8b 44 24 34             mov    0x34(%esp,1),%eax
        struct sockaddr_in *sin = (struct sockaddr_in *)msg->msg_name;
     b87:       8b 16                   mov    (%esi),%edx
     b89:       89 54 24 14             mov    %edx,0x14(%esp,1)
        struct sk_buff *skb;
        int copied, err;

        /*
         *      Check any passed addresses
         */
        if (addr_len)
     b8d:       85 c0                   test   %eax,%eax
     b8f:       74 06                   je     b97 <udp_recvmsg+0x27>
                *addr_len=sizeof(*sin);
     b91:       c7 00 10 00 00 00       movl   $0x10,(%eax)

        if (flags & MSG_ERRQUEUE)
     b97:       8b 44 24 30             mov    0x30(%esp,1),%eax
     b9b:       f6 c4 20                test   $0x20,%ah
     b9e:       74 10                   je     bb0 <udp_recvmsg+0x40>
                return ip_recv_error(sk, msg, len);
     ba0:       57                      push   %edi
     ba1:       56                      push   %esi
     ba2:       55                      push   %ebp
     ba3:       e8 fc ff ff ff          call   ba4 <udp_recvmsg+0x34>
     ba8:       83 c4 0c                add    $0xc,%esp
     bab:       e9 e8 01 00 00          jmp    d98 <udp_recvmsg+0x228>

        skb = skb_recv_datagram(sk, flags, noblock, &err);
     bb0:       8d 44 24 18             lea    0x18(%esp,1),%eax
     bb4:       50                      push   %eax
     bb5:       8b 44 24 30             mov    0x30(%esp,1),%eax
     bb9:       50                      push   %eax
     bba:       8b 54 24 38             mov    0x38(%esp,1),%edx
     bbe:       52                      push   %edx
     bbf:       55                      push   %ebp
     bc0:       e8 fc ff ff ff          call   bc1 <udp_recvmsg+0x51>
     bc5:       89 c3                   mov    %eax,%ebx
        if (!skb)
     bc7:       83 c4 10                add    $0x10,%esp
     bca:       85 db                   test   %ebx,%ebx
     bcc:       0f 84 1e 01 00 00       je     cf0 <udp_recvmsg+0x180>
                goto out;
  
        copied = skb->len - sizeof(struct udphdr);
     bd2:       8b 43 5c                mov    0x5c(%ebx),%eax
375  bd5:       83 c0 f8                add    $0xfffffff8,%eax
     bd8:       89 44 24 10             mov    %eax,0x10(%esp,1)
        if (copied > len) {
     bdc:       39 f8                   cmp    %edi,%eax
     bde:       7e 08                   jle    be8 <udp_recvmsg+0x78>
                copied = len;
     be0:       89 7c 24 10             mov    %edi,0x10(%esp,1)
                msg->msg_flags |= MSG_TRUNC;
     be4:       80 4e 18 20             orb    $0x20,0x18(%esi)
        }

        if (skb->ip_summed==CHECKSUM_UNNECESSARY) {
     be8:       80 7b 6b 02             cmpb   $0x2,0x6b(%ebx)
     bec:       75 07                   jne    bf5 <udp_recvmsg+0x85>
                err = skb_copy_datagram_iovec(skb, sizeof(struct udphdr), msg->msg_iov,
167  bee:       8b 54 24 10             mov    0x10(%esp,1),%edx
     bf2:       52                      push   %edx
                                              copied);
        } else if (msg->msg_flags&MSG_TRUNC) {
     bf3:       eb 3d                   jmp    c32 <udp_recvmsg+0xc2>
     bf5:       f6 46 18 20             testb  $0x20,0x18(%esi)
     bf9:       74 4c                   je     c47 <udp_recvmsg+0xd7>
     bfb:       8b 43 64                mov    0x64(%ebx),%eax
     bfe:       50                      push   %eax
     bff:       8b 43 5c                mov    0x5c(%ebx),%eax
     c02:       50                      push   %eax
     c03:       6a 00                   push   $0x0
     c05:       53                      push   %ebx
     c06:       e8 fc ff ff ff          call   c07 <udp_recvmsg+0x97>
     c0b:       89 c2                   mov    %eax,%edx
 */

static inline unsigned int csum_fold(unsigned int sum)
{
        __asm__("
     c0d:       25 00 00 ff ff          and    $0xffff0000,%eax
     c12:       c1 e2 10                shl    $0x10,%edx
     c15:       01 d0                   add    %edx,%eax
     c17:       15 ff ff 00 00          adc    $0xffff,%eax
     c1c:       83 c4 10                add    $0x10,%esp
     c1f:       c1 e8 10                shr    $0x10,%eax
                if (__udp_checksum_complete(skb))
     c22:       3d ff ff 00 00          cmp    $0xffff,%eax
     c27:       0f 85 d3 00 00 00       jne    d00 <udp_recvmsg+0x190>
                        goto csum_copy_err;
                err = skb_copy_datagram_iovec(skb, sizeof(struct udphdr), msg->msg_iov,
     c2d:       8b 44 24 10             mov    0x10(%esp,1),%eax
     c31:       50                      push   %eax
     c32:       8b 46 08                mov    0x8(%esi),%eax
     c35:       50                      push   %eax
     c36:       6a 08                   push   $0x8
     c38:       53                      push   %ebx
     c39:       e8 fc ff ff ff          call   c3a <udp_recvmsg+0xca>
     c3e:       89 44 24 28             mov    %eax,0x28(%esp,1)
                                              copied);
        } else {
     c42:       83 c4 10                add    $0x10,%esp
     c45:       eb 1c                   jmp    c63 <udp_recvmsg+0xf3>
                err = skb_copy_and_csum_datagram_iovec(skb, sizeof(struct udphdr), msg->msg_iov);
     c47:       8b 46 08                mov    0x8(%esi),%eax
     c4a:       50                      push   %eax
     c4b:       6a 08                   push   $0x8
     c4d:       53                      push   %ebx
     c4e:       e8 fc ff ff ff          call   c4f <udp_recvmsg+0xdf>
     c53:       89 44 24 24             mov    %eax,0x24(%esp,1)

                if (err == -EINVAL)
     c57:       83 c4 0c                add    $0xc,%esp
     c5a:       83 f8 ea                cmp    $0xffffffea,%eax
     c5d:       0f 84 9d 00 00 00       je     d00 <udp_recvmsg+0x190>
                        goto csum_copy_err;
        }

        if (err)
     c63:       83 7c 24 18 00          cmpl   $0x0,0x18(%esp,1)
     c68:       75 7c                   jne    ce6 <udp_recvmsg+0x176>

static __inline__ void
sock_recv_timestamp(struct msghdr *msg, struct sock *sk, struct sk_buff *skb)
{
        if (sk->rcvtstamp)
     c6a:       80 bd a2 00 00 00 00    cmpb   $0x0,0xa2(%ebp)
     c71:       74 15                   je     c88 <udp_recvmsg+0x118>
                put_cmsg(msg, SOL_SOCKET, SO_TIMESTAMP, sizeof(skb->stamp), &skb->stamp);
     c73:       8d 43 10                lea    0x10(%ebx),%eax
     c76:       50                      push   %eax
     c77:       6a 08                   push   $0x8
     c79:       6a 1d                   push   $0x1d
     c7b:       6a 01                   push   $0x1
     c7d:       56                      push   %esi
     c7e:       e8 fc ff ff ff          call   c7f <udp_recvmsg+0x10f>
     c83:       83 c4 14                add    $0x14,%esp
     c86:       eb 12                   jmp    c9a <udp_recvmsg+0x12a>
        else
                sk->stamp = skb->stamp;
     c88:       8b 43 10                mov    0x10(%ebx),%eax
     c8b:       8b 53 14                mov    0x14(%ebx),%edx
     c8e:       89 85 fc 02 00 00       mov    %eax,0x2fc(%ebp)
     c94:       89 95 00 03 00 00       mov    %edx,0x300(%ebp)
                goto out_free;

        sock_recv_timestamp(msg, sk, skb);

        /* Copy the address. */
        if (sin)
     c9a:       83 7c 24 14 00          cmpl   $0x0,0x14(%esp,1)
     c9f:       74 2a                   je     ccb <udp_recvmsg+0x15b>
        {
                sin->sin_family = AF_INET;
     ca1:       8b 54 24 14             mov    0x14(%esp,1),%edx
     ca5:       66 c7 02 02 00          movw   $0x2,(%edx)
                sin->sin_port = skb->h.uh->source;
     caa:       8b 43 1c                mov    0x1c(%ebx),%eax
 * This looks horribly ugly, but the compiler can optimize it totally,
 * as we by now know that both pattern and count is constant..
 */
static inline void * __constant_c_and_count_memset(void * s, unsigned long pattern, size_t count)
{
21   cad:       89 d7                   mov    %edx,%edi
     caf:       0f b7 00                movzwl (%eax),%eax
     cb2:       66 89 42 02             mov    %ax,0x2(%edx)
                sin->sin_addr.s_addr = skb->nh.iph->saddr;
     cb6:       8b 43 20                mov    0x20(%ebx),%eax
 * This looks horribly ugly, but the compiler can optimize it totally,
 * as we by now know that both pattern and count is constant..
 */
static inline void * __constant_c_and_count_memset(void * s, unsigned long pattern, size_t count)
{
335  cb9:       83 c7 08                add    $0x8,%edi
        switch (count) {
                case 0:
                        return s;
                case 1:
                        *(unsigned char *)s = pattern;
                        return s;
                case 2:
                        *(unsigned short *)s = pattern;
                        return s;
                case 3:
                        *(unsigned short *)s = pattern;
                        *(2+(unsigned char *)s) = pattern;
                        return s;
                case 4:
                        *(unsigned long *)s = pattern;
                        return s;
        }
#define COMMON(x) \
__asm__  __volatile__( \
        "rep ; stosl" \
        x \
        : "=&c" (d0), "=&D" (d1) \
        : "a" (pattern),"0" (count/4),"1" ((long) s) \
        : "memory")
{
        int d0, d1;
        switch (count % 4) {
                case 0: COMMON(""); return s;
     cbc:       b9 02 00 00 00          mov    $0x2,%ecx
     cc1:       8b 40 0c                mov    0xc(%eax),%eax
313  cc4:       89 42 04                mov    %eax,0x4(%edx)
 * This looks horribly ugly, but the compiler can optimize it totally,
 * as we by now know that both pattern and count is constant..
 */
static inline void * __constant_c_and_count_memset(void * s, unsigned long pattern, size_t count)
{
     cc7:       31 c0                   xor    %eax,%eax
        switch (count) {
                case 0:
                        return s;
                case 1:
                        *(unsigned char *)s = pattern;
                        return s;
                case 2:
                        *(unsigned short *)s = pattern;
                        return s;
                case 3:
                        *(unsigned short *)s = pattern;
                        *(2+(unsigned char *)s) = pattern;
                        return s;
                case 4:
                        *(unsigned long *)s = pattern;
                        return s;
        }
#define COMMON(x) \
__asm__  __volatile__( \
        "rep ; stosl" \
        x \
        : "=&c" (d0), "=&D" (d1) \
        : "a" (pattern),"0" (count/4),"1" ((long) s) \
        : "memory")
{
        int d0, d1;
        switch (count % 4) {
                case 0: COMMON(""); return s;
     cc9:       f3 ab                   repz stos %eax,%es:(%edi)
                memset(sin->sin_zero, 0, sizeof(sin->sin_zero));
        }
        if (sk->protinfo.af_inet.cmsg_flags)
     ccb:       83 bd b0 02 00 00 00    cmpl   $0x0,0x2b0(%ebp)
     cd2:       74 0a                   je     cde <udp_recvmsg+0x16e>
                ip_cmsg_recv(msg, skb);
     cd4:       53                      push   %ebx
     cd5:       56                      push   %esi
     cd6:       e8 fc ff ff ff          call   cd7 <udp_recvmsg+0x167>
     cdb:       83 c4 08                add    $0x8,%esp
        err = copied;
     cde:       8b 44 24 10             mov    0x10(%esp,1),%eax
     ce2:       89 44 24 18             mov    %eax,0x18(%esp,1)
  
out_free:
        skb_free_datagram(sk, skb);
.
.
.

--
Grant Taylor - x285 - http://pasta/~gtaylor/
Starent Networks - +1.978.851.1185
-
: send the line "unsubscribe linux-net" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Netdev]     [Ethernet Bridging]     [Linux 802.1Q VLAN]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Git]     [Bugtraq]     [Yosemite News and Information]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux PCI]     [Linux Admin]     [Samba]

  Powered by Linux