I am comparing AF-XDP sockets vs Linux Sockets in terms of how many packets they can process without packet-loss (packet-loss is defined as the RTP-sequence number of the current packet is not equal to the RTP-sequence number of the previous packet `+ 1`). I noticed that my AF-XDP socket program (I can't determine if this problem is related to the kernel program or the user-space program) is losing around `~25` packets per second at around `390.000` packets per second whereas an equivalent program with generic linux sockets doesn't lose any packets. I implemented a so-called `distributor`-program which loads the XDP-kernel program once, sets up a generic linux socket and adds `setsockopt(IP_ADD_MEMBERSHIP)` to this generic socket for every multicast-address I pass to the program via command line. After this, the `distributor` loads the filedescriptor of a `BPF_MAP_TYPE_HASH` placed in the XDP-kernel program and inserts routes for the traffic in case a single AF-XDP socket needs to share its umem later on. The XDP-kernel program then checks for each IPv4/UDP packet if there is an entry in that hash-map. This basically looks like this: const struct pckt_idntfy_raw raw = { .src_ip = 0, /* not used at the moment */ .dst_ip = iph->daddr, .dst_port = udh->dest, .pad = 0 }; const int *idx = bpf_map_lookup_elem(&xdp_packet_mapping, &raw); if(idx != NULL) { if (bpf_map_lookup_elem(&xsks_map, idx)) { bpf_printk("Found socket @ index: %d!\n", *idx); return bpf_redirect_map(&xsks_map, *idx, 0); } else { bpf_printk("Didn't find connected socket for index %d!\n", *idx); } } In case `idx` exists this means that there is a socket sitting behind that index in the `BPF_MAP_TYPE_XSKMAP`. After doing all that the `distributor` spawns a new process via `fork()` passing all multicast-addresses (including destination port) which should be processed by that process (one process handles one RX-Queue). In case there are not enough RX-Queues, some processes may receive multiple multicast-addresses. This then means that they are going to use `SHARED UMEM`. I basically oriented my AF-XDP user-space program on this example code: https://github.com/torvalds/linux/blob/master/samples/bpf/xdpsock_user.c I am using the same `xsk_configure_umem`, `xsk_populate_fill_ring` and `xsk_configure_socket` functions. Because I figured I don't need maximum latency for this application, I send the process to sleep for a specified time (around `1 - 2ms`) after which it loops through every AF-XDP socket (most of the time it is only one socket) and processes every received packet for that socket, verifying that no packets have been missed: while(!global_exit) { nanosleep(&spec, &remaining); for(int i = 0; i < cfg.ip_addrs_len; i++) { struct xsk_socket_info *socket = xsk_sockets[i]; if(atomic_exchange(&socket->stats_sync.lock, 1) == 0) { handle_receive_packets(socket); atomic_fetch_xor(&socket->stats_sync.lock, 1); /* release socket-lock */ } } } In my opinion there is nothing too fancy about this but somehow I lose `~25` packets at around `390.000` packets even though my UMEM is close to 1GB of RAM. In comparison, my generic linux socket program looks like this (in short): int fd = socket(AF_INET, SOCK_RAW, IPPROTO_UDP); /* setting some socket options */ struct sockaddr_in sin; memset(&sin, 0, sizeof(struct sockaddr_in)); sin.sin_family = AF_INET; sin.sin_port = cfg->ip_addrs[0]->pckt.dst_port; inet_aton(cfg->ip_addrs[0]->pckt.dst_ip, &sin.sin_addr); if(bind(fd, (struct sockaddr*)&sin, sizeof(struct sockaddr)) < 0) { fprintf(stderr, "Error on binding socket: %s\n", strerror(errno)); return - 1; } ioctl(fd, SIOCGIFADDR, &intf); The `distributor`-program creates a new process for every given multicast-ip in case generic linux sockets are used (because there are no sophisticated methods such as SHARED-UMEM in generic sockets I don't bother with multiple multicast-streams per process). Later on I of course join the multicast membership: struct ip_mreqn mreq; memset(&mreq, 0, sizeof(struct ip_mreqn)); const char *multicast_ip = cfg->ip_addrs[0]->pckt.dst_ip; if(inet_pton(AF_INET, multicast_ip, &mreq.imr_multiaddr.s_addr)) { /* Local interface address */ memcpy(&mreq.imr_address, &cfg->ifaddr, sizeof(struct in_addr)); mreq.imr_ifindex = cfg->ifindex; if(setsockopt(igmp_socket_fd, IPPROTO_IP, IP_ADD_MEMBERSHIP, &mreq, sizeof(struct ip_mreqn)) < 0) { fprintf(stderr, "Failed to set `IP_ADD_MEMBERSHIP`: %s\n", strerror(errno)); return; } else { printf("Successfully added Membership for IP: %s\n", multicast_ip); } } and start processing packets (not sleeping but in a `busy-loop` like fashion): void read_packets_recvmsg_with_latency(struct config *cfg, struct statistic *st, void *buff, const int igmp_socket_fd) { char ctrl[CMSG_SPACE(sizeof(struct timeval))]; struct msghdr msg; struct iovec iov; msg.msg_control = (char*)ctrl; msg.msg_controllen = sizeof(ctrl); msg.msg_name = &cfg->ifaddr; msg.msg_namelen = sizeof(cfg->ifaddr); msg.msg_iov = &iov; msg.msg_iovlen = 1; iov.iov_base = buff; iov.iov_len = BUFFER_SIZE; struct timeval time_user, time_kernel; struct cmsghdr *cmsg = (struct cmsghdr*)&ctrl; const int64_t read_bytes = recvmsg(igmp_socket_fd, &msg, 0); if(read_bytes == -1) { return; } gettimeofday(&time_user, NULL); if(cmsg->cmsg_level == SOL_SOCKET && cmsg->cmsg_type == SCM_TIMESTAMP) { memcpy(&time_kernel, CMSG_DATA(cmsg), sizeof(struct timeval)); } if(verify_rtp(cfg, st, read_bytes, buff)) { const double timediff = (time_user.tv_sec - time_kernel.tv_sec) * 1000000 + (time_user.tv_usec - time_kernel.tv_usec); if(timediff > st->stats.latency_us) { st->stats.latency_us = timediff; } } } int main(...) { .... while(!is_global_exit) { read_packets_recvmsg_with_latency(&cfg, &st, buffer, igmp_socket_fd); } } That's pretty much it. Please note that in the described use case where I start to lose packets I don't use `SHARED UMEM`, it's just a single RX-Queue receiving a single multicast-stream. In case I process a smaller multicast-stream of around `150.000 pps` - the AF-XDP solution doesn't lose any packets. But it is also the other way around - for around `520.000 pps` on the same RX-Queue (using `SHARED UMEM`) I get a loss of `12.000 pps`. Any ideas what I am missing?