Re: Designing a safe RX-zero-copy Memory Model for Networking

Jesper Dangaard Brouer <brouer@xxxxxxxxxx> · Thu, 15 Dec 2016 09:28:41 +0100

On Wed, 14 Dec 2016 14:45:00 -0800
Alexander Duyck <alexander.duyck@xxxxxxxxx> wrote:

> On Wed, Dec 14, 2016 at 1:29 PM, Jesper Dangaard Brouer
> <brouer@xxxxxxxxxx> wrote:
> > On Wed, 14 Dec 2016 08:45:08 -0800
> > Alexander Duyck <alexander.duyck@xxxxxxxxx> wrote:
> >  
> >> I agree.  This is a no-go from the performance perspective as well.
> >> At a minimum you would have to be zeroing out the page between uses to
> >> avoid leaking data, and that assumes that the program we are sending
> >> the pages to is slightly well behaved.  If we think zeroing out an
> >> sk_buff is expensive wait until we are trying to do an entire 4K page.  
> >
> > Again, yes the page will be zero'ed out, but only when entering the
> > page_pool. Because they are recycled they are not cleared on every use.
> > Thus, performance does not suffer.  
> 
> So you are talking about recycling, but not clearing the page when it
> is recycled.  That right there is my problem with this.  It is fine if
> you assume the pages are used by the application only, but you are
> talking about using them for both the application and for the regular
> network path.  You can't do that.  If you are recycling you will have
> to clear the page every time you put it back onto the Rx ring,
> otherwise you can leak the recycled memory into user space and end up
> with a user space program being able to snoop data out of the skb.
> 
> > Besides clearing large mem area is not as bad as clearing small.
> > Clearing an entire page does cost something, as mentioned before 143
> > cycles, which is 28 bytes-per-cycle (4096/143).  And clearing 256 bytes
> > cost 36 cycles which is only 7 bytes-per-cycle (256/36).  
> 
> What I am saying is that you are going to be clearing the 4K blocks
> each time they are recycled.  You can't have the pages shared between
> user-space and the network stack unless you have true isolation.  If
> you are allowing network stack pages to be recycled back into the
> user-space application you open up all sorts of leaks where the
> application can snoop into data it shouldn't have access to.

See later, the "Read-only packet page" mode should provide a mode where
the netstack doesn't write into the page, and thus cannot leak kernel
data. (CAP_NET_ADMIN already give it access to other applications data.)

> >> I think we are stuck with having to use a HW filter to split off
> >> application traffic to a specific ring, and then having to share the
> >> memory between the application and the kernel on that ring only.  Any
> >> other approach just opens us up to all sorts of security concerns
> >> since it would be possible for the application to try to read and
> >> possibly write any data it wants into the buffers.  
> >
> > This is why I wrote a document[1], trying to outline how this is possible,
> > going through all the combinations, and asking the community to find
> > faults in my idea.  Inlining it again, as nobody really replied on the
> > content of the doc.
> >
> > -
> > Best regards,
> >   Jesper Dangaard Brouer
> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >   LinkedIn: http://www.linkedin.com/in/brouer
> >
> > [1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
> >
> > ===========================
> > Memory Model for Networking
> > ===========================
> >
> > This design describes how the page_pool change the memory model for
> > networking in the NIC (Network Interface Card) drivers.
> >
> > .. Note:: The catch for driver developers is that, once an application
> >           request zero-copy RX, then the driver must use a specific
> >           SKB allocation mode and might have to reconfigure the
> >           RX-ring.
> >
> >
> > Design target
> > =============
> >
> > Allow the NIC to function as a normal Linux NIC and be shared in a
> > safe manor, between the kernel network stack and an accelerated
> > userspace application using RX zero-copy delivery.
> >
> > Target is to provide the basis for building RX zero-copy solutions in
> > a memory safe manor.  An efficient communication channel for userspace
> > delivery is out of scope for this document, but OOM considerations are
> > discussed below (`Userspace delivery and OOM`_).
> >
> > Background
> > ==========
> >
> > The SKB or ``struct sk_buff`` is the fundamental meta-data structure
> > for network packets in the Linux Kernel network stack.  It is a fairly
> > complex object and can be constructed in several ways.
> >
> > From a memory perspective there are two ways depending on
> > RX-buffer/page state:
> >
> > 1) Writable packet page
> > 2) Read-only packet page
> >
> > To take full potential of the page_pool, the drivers must actually
> > support handling both options depending on the configuration state of
> > the page_pool.
> >
> > Writable packet page
> > --------------------
> >
> > When the RX packet page is writable, the SKB setup is fairly straight
> > forward.  The SKB->data (and skb->head) can point directly to the page
> > data, adjusting the offset according to drivers headroom (for adding
> > headers) and setting the length according to the DMA descriptor info.
> >
> > The page/data need to be writable, because the network stack need to
> > adjust headers (like TimeToLive and checksum) or even add or remove
> > headers for encapsulation purposes.
> >
> > A subtle catch, which also requires a writable page, is that the SKB
> > also have an accompanying "shared info" data-structure ``struct
> > skb_shared_info``.  This "skb_shared_info" is written into the
> > skb->data memory area at the end (skb->end) of the (header) data.  The
> > skb_shared_info contains semi-sensitive information, like kernel
> > memory pointers to other pages (which might be pointers to more packet
> > data).  This would be bad from a zero-copy point of view to leak this
> > kind of information.  
> 
> This should be the default once we get things moved over to using the
> DMA_ATTR_SKIP_CPU_SYNC DMA attribute.  It will be a little while more
> before it gets fully into Linus's tree.  It looks like the swiotlb
> bits have been accepted, just waiting on the ability to map a page w/
> attributes and the remainder of the patches that are floating around
> in mmotm and linux-next.

I'm very happy that you are working on this.

> BTW, any ETA on when we might expect to start seeing code related to
> the page_pool?  It is much easier to review code versus these kind of
> blueprints.

I've implemented a prove-of-concept of page_pool, but only the first
stage, which is the ability to replace driver specific page-caches.  It
works, but is not upstream ready, as e.g. it assumes it can get a page
flag and cleanup-on-driver-unload code is missing.  Mel Gorman have
reviewed it, but with the changes he requested I lost quite some
performance, I'm still trying to figure out a way to regain that
performance lost.  The zero-copy part is not implemented.  

> > Read-only packet page
> > ---------------------
> >
> > When the RX packet page is read-only, the construction of the SKB is
> > significantly more complicated and even involves one more memory
> > allocation.
> >
> > 1) Allocate a new separate writable memory area, and point skb->data
> >    here.  This is needed due to (above described) skb_shared_info.
> >
> > 2) Memcpy packet headers into this (skb->data) area.
> >
> > 3) Clear part of skb_shared_info struct in writable-area.
> >
> > 4) Setup pointer to packet-data in the page (in skb_shared_info->frags)
> >    and adjust the page_offset to be past the headers just copied.
> >
> > It is useful (later) that the network stack have this notion that part
> > of the packet and a page can be read-only.  This implies that the
> > kernel will not "pollute" this memory with any sensitive information.
> > This is good from a zero-copy point of view, but bad from a
> > performance perspective.  
> 
> This will hopefully become a legacy approach.

Hopefully, but this mode will have to be supported forever, and is the
current default.

> > NIC RX Zero-Copy
> > ================
> >
> > Doing NIC RX zero-copy involves mapping RX pages into userspace.  This
> > involves costly mapping and unmapping operations in the address space
> > of the userspace process.  Plus for doing this safely, the page memory
> > need to be cleared before using it, to avoid leaking kernel
> > information to userspace, also a costly operation.  The page_pool base
> > "class" of optimization is moving these kind of operations out of the
> > fastpath, by recycling and lifetime control.
> >
> > Once a NIC RX-queue's page_pool have been configured for zero-copy
> > into userspace, then can packets still be allowed to travel the normal
> > stack?
> >
> > Yes, this should be possible, because the driver can use the
> > SKB-read-only mode, which avoids polluting the page data with
> > kernel-side sensitive data.  This implies, when a driver RX-queue
> > switch page_pool to RX-zero-copy mode it MUST also switch to
> > SKB-read-only mode (for normal stack delivery for this RXq).  
> 
> This is the part that is wrong.  Once userspace has access to the
> pages in an Rx ring that ring cannot be used for regular kernel-side
> networking.  If it is, then sensitive kernel data may be leaked
> because the application has full access to any page on the ring so it
> could read the data at any time regardless of where the data is meant
> to be delivered.

Are you sure. Can you give me an example of kernel code that writes
into the page when it is attached as a read-only page to the SKB? 

That would violate how we/drivers use the DMA API today (calling DMA
unmap when packets are in-flight).

> > XDP can be used for controlling which pages that gets RX zero-copied
> > to userspace.  The page is still writable for the XDP program, but
> > read-only for normal stack delivery.  
> 
> Making the page read-only doesn't get you anything.  You still have a
> conflict since user-space can read any packet directly out of the
> page.

Giving the application CAP_NAT_ADMIN already gave it "tcpdump" read access
to all other applications packet content from that NIC.

> > Kernel safety
> > -------------
> >
> > For the paranoid, how do we protect the kernel from a malicious
> > userspace program.  Sure there will be a communication interface
> > between kernel and userspace, that synchronize ownership of pages.
> > But a userspace program can violate this interface, given pages are
> > kept VMA mapped, the program can in principle access all the memory
> > pages in the given page_pool.  This opens up for a malicious (or
> > defect) program modifying memory pages concurrently with the kernel
> > and DMA engine using them.
> >
> > An easy way to get around userspace modifying page data contents is
> > simply to map pages read-only into userspace.
> >
> > .. Note:: The first implementation target is read-only zero-copy RX
> >           page to userspace and require driver to use SKB-read-only
> >           mode.  
> 
> This allows for Rx but what do we do about Tx?  

True, I've not covered Tx.  But I believe Tx is easier from a sharing
PoV, as we don't have the early demux sharing problem, because an
application/socket will be the starting point, and simply have
associated a page_pool for TX, solving the VMA mapping overhead.
Using the skb-read-only-page mode, this would in principle allow normal
socket zero-copy TX and packet steering.

For performance reasons, when you already know what NIC you want to TX
on, you could extend this to allocate a separate queue for TX.  Which
makes it look a lot like RDMA.

> It sounds like Christoph's RDMA approach might be the way to go.

I'm getting more and more fond of Christoph's RDMA approach.  I do
think we will end-up with something close to that approach.  I just
wanted to get review on my idea first.

IMHO the major blocker for the RDMA approach is not HW filters
themselves, but a common API that applications can call to register
what goes into the HW queues in the driver.  I suspect it will be a
long project agreeing between vendors.  And agreeing on semantics.

> > Advanced: Allowing userspace write access?
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > What if userspace need write access? Flipping the page permissions per
> > transfer will likely kill performance (as this likely affects the
> > TLB-cache).
> >
> > I will argue that giving userspace write access is still possible,
> > without risking a kernel crash.  This is related to the SKB-read-only
> > mode that copies the packet headers (in to another memory area,
> > inaccessible to userspace).  The attack angle is to modify packet
> > headers after they passed some kernel network stack validation step
> > (as once headers are copied they are out of "reach").
> >
> > Situation classes where memory page can be modified concurrently:
> >
> > 1) When DMA engine owns the page.  Not a problem, as DMA engine will
> >    simply overwrite data.
> >
> > 2) Just after DMA engine finish writing.  Not a problem, the packet
> >    will go through netstack validation and be rejected.
> >
> > 3) While XDP reads data. This can lead to XDP/eBPF program goes into a
> >    wrong code branch, but the eBPF virtual machine should not be able
> >    to crash the kernel. The worst outcome is a wrong or invalid XDP
> >    return code.
> >
> > 4) Before SKB with read-only page is constructed. Not a problem, the
> >    packet will go through netstack validation and be rejected.
> >
> > 5) After SKB with read-only page has been constructed.  Remember the
> >    packet headers were copied into a separate memory area, and the
> >    page data is pointed to with an offset passed the copied headers.
> >    Thus, userspace cannot modify the headers used for netstack
> >    validation.  It can only modify packet data contents, which is less
> >    critical as it cannot crash the kernel, and eventually this will be
> >    caught by packet checksum validation.
> >
> > 6) After netstack delivered packet to another userspace process. Not a
> >    problem, as it cannot crash the kernel.  It might corrupt
> >    packet-data being read by another userspace process, which one
> >    argument for requiring elevated privileges to get write access
> >    (like NET_CAP_ADMIN).  
> 
> If userspace has access to a ring we shouldn't be using SKBs on it
> really anyway.  We should probably expect XDP to be handling all the
> packaging so items 4-6 can probably be dropped.
> 
> >
> > Userspace delivery and OOM
> > --------------------------
> >
> > These RX pages are likely mapped to userspace via mmap(), so-far so
> > good.  It is key to performance to get an efficient way of signaling
> > between kernel and userspace, e.g what page are ready for consumption,
> > and when userspace are done with the page.
> >
> > It is outside the scope of page_pool to provide such a queuing
> > structure, but the page_pool can offer some means of protecting the
> > system resource usage.  It is a classical problem that resources
> > (e.g. the page) must be returned in a timely manor, else the system,
> > in this case, will run out of memory.  Any system/design with
> > unbounded memory allocation can lead to Out-Of-Memory (OOM)
> > situations.
> >
> > Communication between kernel and userspace is likely going to be some
> > kind of queue.  Given transferring packets individually will have too
> > much scheduling overhead.  A queue can implicitly function as a
> > bulking interface, and offers a natural way to split the workload
> > across CPU cores.
> >
> > This essentially boils down-to a two queue system, with the RX-ring
> > queue and the userspace delivery queue.
> >
> > Two bad situations exists for the userspace queue:
> >
> > 1) Userspace is not consuming objects fast-enough. This should simply
> >    result in packets getting dropped when enqueueing to a full
> >    userspace queue (as queue *must* implement some limit). Open
> >    question is; should this be reported or communicated to userspace.
> >
> > 2) Userspace is consuming objects fast, but not returning them in a
> >    timely manor.  This is a bad situation, because it threatens the
> >    system stability as it can lead to OOM.
> >
> > The page_pool should somehow protect the system in case 2.  The
> > page_pool can detect the situation as it is able to track the number
> > of outstanding pages, due to the recycle feedback loop.  Thus, the
> > page_pool can have some configurable limit of allowed outstanding
> > pages, which can protect the system against OOM.
> >
> > Note, the `Fbufs paper`_ propose to solve case 2 by allowing these
> > pages to be "pageable", i.e. swap-able, but that is not an option for
> > the page_pool as these pages are DMA mapped.
> >
> > .. _`Fbufs paper`:
> >    http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.52.9688
> >
> > Effect of blocking allocation
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > The effect of page_pool, in case 2, that denies more allocations
> > essentially result-in the RX-ring queue cannot be refilled and HW
> > starts dropping packets due to "out-of-buffers".  For NICs with
> > several HW RX-queues, this can be limited to a subset of queues (and
> > admin can control which RX queue with HW filters).
> >
> > The question is if the page_pool can do something smarter in this
> > case, to signal the consumers of these pages, before the maximum limit
> > is hit (of allowed outstanding packets).  The MM-subsystem already
> > have a concept of emergency PFMEMALLOC reserves and associate
> > page-flags (e.g. page_is_pfmemalloc).  And the network stack already
> > handle and react to this.  Could the same PFMEMALLOC system be used
> > for marking pages when limit is close?
> >
> > This requires further analysis. One can imagine; this could be used at
> > RX by XDP to mitigate the situation by dropping less-important frames.
> > Given XDP choose which pages are being send to userspace it might have
> > appropriate knowledge of what it relevant to drop(?).
> >
> > .. Note:: An alternative idea is using a data-structure that blocks
> >           userspace from getting new pages before returning some.
> >           (out of scope for the page_pool)
> >  

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>