On Sat, Jan 18, 2014 at 08:10:31PM +0000, Al Viro wrote: > Ouch... No, I hadn't meant that kind of insanity, but I'd missed the > problem with scarcity of mappings completely... OK, that pretty much kills this approach. Pity... Folks, what do you think about the following: * a new data structure: struct io_source { enum {IO_IOVEC, IO_PVEC} type; union { struct iovec *iov; struct pvec { struct page *page; unsigned offset; unsigned size; } *pvec; }; } * a new method that would look like aio_write, but take struct io_source instead of iovec. * store the type in iov_iter (normally - IO_UIOVEC) and teach the code dealing with it to do the right thing depending on type. I.e. instead of __copy_from_user_inatomic() do kmap_atomic()/memcpy()/kunmap_atomic() if it's a IO_PAGEVEC. * generic_file_aio_write() analog for new method, converging with generic_file_aio_write() almost immediately (basically, as soon as iov_iter has been initialized). * new_aio_write() consisting of { struct io_source source = {.type = IO_UIOVEC, .user = iov}; return file->f_op-><new_method>(iocb, &source, nr_segs, pos); } * new_sync_write(), doing what do_sync_write() does for files that have new_aio_write() as ->aio_write(). * new_splice_write() usable for files that provide that method - it would collect pipe_buffers, put together struct pvec array and pass it to that method. All mapping the pages would happen one-by-one and only around actual copying the data. And, of course, the locking would be identical to what we do for write()/writev()/aio write Then filesystems can switch to that new method, turning their flipping their aio_write() instances to new type and replacing ->aio_write with default_aio_write, ->write with new_write and ->splice_write with new_splice_write. Actually, there's a possibility that it would be possible to use it for *all* instances of ->splice_write() - we'd need to store something a pointer to "call this to try and steal this page" function in pvec and allow the method do actual stealing. Note that pipe_buffer ->steal() only uses the page argument - they all ignore which pipe it's in (and there's nothing they could usefully do if they knew which pipe had it been in the first place). This is very preliminary, of course, and I might easily miss something - the previous idea was unworkable, after all. Comments would be very welcome... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html