Re: [patch v3] splice: fix race with page invalidation

Jamie Lokier <jamie@xxxxxxxxxxxxx> · Thu, 31 Jul 2008 07:12:01 +0100

Linus Torvalds wrote:
> > Jamie Lokier wrote:
> > > not being able to tell when a sendfile() has finished with the pages
> > > its sending.
> > 
> > (Except by the socket fully closing or a handshake from the other end,
> > obviously.)
> 
> Well, people should realize that this is pretty fundamental to zero-copy 
> scemes. It's why zero-copy is often much less useful than doing a copy in 
> the first place. How do you know how far in a splice buffer some random 
> 'struct page' has gotten? Especially with splicing to spicing to tee to 
> splice...

Having implemented an equivalent zero-copy thing in userspace, I can
confidently say it's not fundamental at all.

What is fundamental is that you either (a) treat sendfile as an async
operation, and get a notification when it's finished with the data,
just like any other async operation, or (b) while sendfile claims those
pages, they are marked COW.

(b) is *much* more useful for the things you actually want to use
sendfile for, namely a faster copy-file-to-socket with no weird
complications.  Since you're sending files which you don't *expect* to
change (but want to behave sensibly if they do), and the pages
probably aren't mapped into any process, COW would not cost anything.

Right now, sendfile is used by servers of all kinds: http, ftp, file
servers, you name it.  They all want to believe it's purely a
performance optimisation, equivalent to write.  On many operations
systems, it is.  (I count sendfile equivalents on: Windows NT, SCO
Unixware, Solaris, FreeBSD, Dragonfly, HP-UX, Tru64, AIX and S/390 in
addition to Linux :-)

> You'd have to have some kind of barrier model (which would be really 
> complex), or perhaps a "wait for this page to no longer be shared" (which 
> has issues all its own).
> 
> IOW, splice() is very closely related to a magic kind of "mmap()+write()" 
> in another thread. That's literally what it does internally (except the 
> "mmap" is just a small magic kernel buffer rather than virtual address 
> space), and exactly as with mmap, if you modify the file, the other thread 
> will see if, even though it did it long ago.

That's fine.  But if you use a thread, the thread can tell you when
it's done.  Then you know what you're sending not an infinite time in
the future :-)

> Personally, I think the right approach is to just realize that splice() is 
> _not_ a write() system call, and never will be. If you need synchronous 
> writing, you simply shouldn't use splice().

People want zero-copy, and no weirdness like sending blocks of zeros
which the file never contained, and (if you lock the file) knowing
when to release locks for someone else to edit the file.

Sync or async doesn't matter so much; that's API stuff.

The obvious mechanism for completion notifications is the AIO event
interface.  I.e. aio_sendfile that reports completion when it's safe
to modify data it was using.  aio_splice would be logical for similar
reasons.  Note it doesn't mean when the data has reached a particular
place, it means when the pages it's holding are released.  Pity AIO
still sucks ;-)

Btw, Windows had this since forever, it's called overlapped
TransmitFile with an I/O completion event.  Don't know if it's any
good though ;-)

-- Jamie
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html