On Thu, Mar 28, 2013 at 6:01 AM, Michael Kerrisk (man-pages) <mtk.manpages@xxxxxxxxx> wrote: > Willem, > > Thanks for sending this patch. This all looks good and authoritative. > Could I ask you to make a few small clean-ups and resubmit? See below. Thanks for reviewing the patch, Michael. I will send the revised version following this email. > On Mon, Mar 18, 2013 at 6:13 PM, Willem de Bruijn <willemb@xxxxxxxxxx> wrote: >> The packet socket manual page does not list all socket options. >> >> This patch adds descriptions of the common packet socket options >> PACKET_AUXDATA, PACKET_FANOUT, PACKET_RX_RING, PACKET_STATISTICS, >> PACKET_TX_RING >> >> and the ring-specific options >> PACKET_LOSS, PACKET_RESERVE, PACKET_TIMESTAMP, PACKET_VERSION >> >> It does not yet add descriptions for >> PACKET_COPY_THRESH, PACKET_HDRLEN, PACKET_ORIGDEV, >> PACKET_TX_HAS_OFF, PACKET_TX_TIMESTAMP, PACKET_VNET_HDR >> >> It tries to balance being informative with exposing kernel detail >> that is unlikely to be used by most readers or that may change >> frequently. For implementation details, the manpage points to the >> documentation in kernel Documentation/networking. Let me know if >> options should be added or removed. > > For the commit log message, could you just add a few lines for each of > the options stating how you determined the information. Also, if there > are specific individuals who could Ack the patch, please CC them and > ask them if they might Ack the patch. I will cc: the developers of the commits referenced in the man page. > >> Signed-off-by: Willem de Bruijn <willemb@xxxxxxxxxx> >> --- >> man7/packet.7 | 183 +++++++++++++++++++++++++++++++++++++++++++++++++++++++--- >> 1 file changed, 175 insertions(+), 8 deletions(-) >> >> diff --git a/man7/packet.7 b/man7/packet.7 >> index 006f2ac..a9cc168 100644 >> --- a/man7/packet.7 >> +++ b/man7/packet.7 >> @@ -177,17 +177,21 @@ and >> .I sll_ifindex >> are used. >> .SS Socket options >> +Packet socket options are configured by calling >> +. BR setsockopt (2) >> +with level SOL_PACKET. > > +with level > +.BR SOL_PACKET . > >> +.TP >> +.BR PACKET_ADD_MEMBERSHIP >> +.PD 0 >> +.TP >> +.BR PACKET_DROP_MEMBERSHIP >> +.PD >> Packet sockets can be used to configure physical layer multicasting >> and promiscuous mode. >> -It works by calling >> -.BR setsockopt (2) >> -on a packet socket for >> -.B SOL_PACKET >> -and one of the options >> .B PACKET_ADD_MEMBERSHIP >> -to add a binding or >> +adds a binding and >> .B PACKET_DROP_MEMBERSHIP >> -to drop it. >> +drops it. >> They both expect a >> .B packet_mreq >> structure as argument: >> @@ -227,6 +231,169 @@ In addition the traditional ioctls >> .BR SIOCADDMULTI , >> .B SIOCDELMULTI >> can be used for the same purpose. >> +.TP >> +.BR PACKET_AUXDATA " (since Linux 2.6.21)" >> +.\" commit 8dc419447 > > It's great that you include these commit IDs, but I strongly prefer to > have the full 40-char ID. Potentially useful one day for scripting, > etc. Same comment for the instances below. > >> +If this binary option is enabled, the packet socket passes a metadata >> +structure along with each packet in the >> +.BR recvmsg (2) >> +control field. The > > Please start new sentences on new source lines (see man-pages(7)). > Same comment at numerous places below. > > >> +structure can be read with >> +.BR cmsg (3). It is defined as > > Formatting broken there. Start new line after the period. > >> + >> +.in +4n >> +.nf >> +struct tpacket_auxdata { >> + __u32 tp_status; >> + __u32 tp_len; /* packet length */ >> + __u32 tp_snaplen; /* captured length */ >> + __u16 tp_mac; >> + __u16 tp_net; >> + __u16 tp_vlan_tci; >> + __u16 tp_padding; >> +}; >> +.fi >> +.in >> + >> +.B tp_net > > .I tp_net > >> +stores the offset to the network layer. If the packet socket is of type >> +.BR SOCK_DGRAM , >> +then >> +.B tp_mac >> +is the same. If it is of type >> +.B SOCK_RAW , > > .BR SOCK_RAW , > >> +then that stores the offset to the link layer frame. >> +.TP >> +.BR PACKET_FANOUT " (since Linux 3.1)" >> +.\" commit dc99f6006 >> +To scale processing across threads, packet sockets can form a fanout >> +group. In this mode, each matching packet is enqueued onto only one >> +socket in the group. A socket joins a fanout group by calling >> +.B setsockopt(2) >> +with level SOL_PACKET and option PACKET_FANOUT. > > .B SOL_PACKET > .BR PACKET_FANOUT . > >> +Each network namespace can have up to 65536 independent groups. A >> +socket selects a group by encoding the ID in the first 16 bits of >> +the integer option value. The first packet socket to join a group >> +implicitly creates it. To successfully join an existing group, >> +subsequent packet sockets must have the same >> +protocol, device settings and fanout mode and flags (see below). >> +Packet sockets can leave a fanout group only by closing the socket. >> +The group is deleted when the last socket is closed. >> + >> +Fanout supports multiple algorithms to spread traffic between sockets. >> +The default mode, >> +. BR PACKET_FANOUT_HASH , >> +sends packets from the same flow to the same socket to maintain per-flow >> +ordering. For each packet, it chooses a socket by taking the packet >> +flow hash modulo the number of sockets in the group, where a flow hash >> +is a hash over network layer address and optional transport layer port >> +fields. The load balance mode >> +. BR PACKET_FANOUT_LB >> +implements a round robin algorithm. > > round-robin > >> +. BR PACKET_FANOUT_CPU >> +selects the socket based on the cpu that the packet arrived on. > > CPU > >> + >> +Fanout modes can take additional options. IP fragmentation causes packets >> +from the same flow to have different flow hashes. The flag >> +.BR PACKET_FANOUT_FLAG_DEFRAG , >> +if set, causes packet to be defragmented before fanout is applied, to >> +preserve order even in this case. Fanout mode and options are communicated >> +in the second 16 bits of the integer option value. >> +.TP >> +.BR PACKET_LOSS " (with PACKET_TX_RING)" >> +If set, do not silently drop on transmission errors, but return the >> +packet with status set to >> +.BR TP_STATUS_WRONG_FORMAT >> +.TP >> +.BR PACKET_RESERVE " (with PACKET_RX_RING)" >> +By default, a packet receive ring writes packets immediately following the >> +metadata structure and alignment padding. This integer option reserves >> +additional headroom. >> +.TP >> +.BR PACKET_RX_RING >> +Create a memory mapped ring buffer for asynchronous packet reception. >> +The packet socket reserves a contiguous region of application address >> +space, lays it out into an array of packet slots and copies packets >> +(up to snaplen) > > .IR tp_snaplen ) > >> into subsequent slots. Each packet is preceded by a >> +metadata structure similar to >> +.B tpacket_auxdata. > > .IR tpacket_auxdata . > >> +Packet socket and application communicate the head and tail of the ring >> +through the >> +.B tp_status > > .I > >> +field. The packet socket owns all slots with status >> +.BR TP_STATUS_KERNEL . >> +After filling a slot, it changes the status of the slot to transfer >> +ownership to the application. During normal operation, the new status is >> +.BR TP_STATUS_USER , >> +to signal that a correctly received packet has been stored. When the >> +application has finished processing a packet, it transfers ownership of >> +the slot back to the socket by setting the status to >> +.BR TP_STATUS_KERNEL . >> +Packet sockets implement multiple >> +variants of the packet ring. The implementation details are described in >> +.IR Documentation/networking/packet_mmap.txt >> +in the Linux kernel source tree. >> +.TP >> +.BR PACKET_STATISTICS >> +Retrieve packet socket statistics in the form of a structure >> + >> +.in +4n >> +.nf >> +struct tpacket_stats { >> + __u32 tp_packets; /* total packet count */ >> + __u32 tp_drops; /* dropped packet count */ >> +}; >> +.fi >> +.in >> + >> +Receiving statistics resets the internal counters. The exact statistics >> +structure differs when using a ring of variant >> +.BR TPACKET_V3 . >> +.TP >> +.BR PACKET_TIMESTAMP " (with PACKET_RX_RING)" >> +The packet receive ring always stores a timestamp in the metadata header. >> +By default, this is a software generated timestamp generated when the >> +packet is copied into the ring. This integer option selects the type of >> +timestamp. Besides the default, it support the two hardware formats >> +described in >> +.IR Documentation/networking/timestamping.txt >> +in the Linux kernel source tree. >> +.TP >> +.BR PACKET_TX_RING " (since Linux 2.6.31)" >> +.\" commit 69e3c75f4 >> +Create a memory mapped ring buffer for packet transmission. This option >> +is similar to >> +.BR PACKET_RX_RING >> +and takes the same arguments. The application writes packets into slots >> +with status >> +.BR TP_STATUS_AVAILABLE >> +and schedules them for transmission by changing the status to >> +.BR TP_STATUS_SEND_REQUEST . >> +When packets are ready to be transmitted, the application calls >> +.BR send (2) >> +Or a variant thereof. The > > s/Or/or/ > >> +.B buf > > .I buf > >> +and >> +.B len > > .I len > >> +fields of this call are ignored. If an address is passed using >> +.BR sendto (2) >> +or >> +.BR sendmsg (2) , >> +then that overrides the socket default. On successful transmission, the >> +socket resets the slot to >> +.BR TP_STATUS_AVAILABLE . >> +It discards packets silently on error unless >> +.BR PACKET_LOSS >> +is set. >> +.TP >> +.BR PACKET_VERSION " (with PACKET_RX_RING)" >> +By default, >> +.BR PACKET_RX_RING >> +creates a packet receive ring of variant >> +.BR TPACKET_V1 . >> +To create another variant, configure the desired variant by setting this >> +integer option before creating the ring. >> + >> .SS Ioctls >> .B SIOCGSTAMP >> can be used to receive the timestamp of the last received packet. >> @@ -318,7 +485,7 @@ header to get a fully conforming packet. >> Incoming 802.3 packets are not multiplexed on the DSAP/SSAP protocol >> fields; instead they are supplied to the user as protocol >> .B ETH_P_802_2 >> -with the LLC header prepended. >> +with the LLC header prefixed. >> It is thus not possible to bind to >> .BR ETH_P_802_3 ; >> bind to > > Thanks, > > Michael > > -- > Michael Kerrisk > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ > Author of "The Linux Programming Interface"; http://man7.org/tlpi/ -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html