On Tue, Jan 14, 2014 at 05:20:33PM +0000, Al Viro wrote: > On Tue, Jan 14, 2014 at 05:22:07AM -0800, Christoph Hellwig wrote: > > On Mon, Jan 13, 2014 at 11:56:46PM +0000, Al Viro wrote: > > > On Mon, Jan 13, 2014 at 06:14:16AM -0800, Christoph Hellwig wrote: > > > > ping? Would be nice to get this into 3.14 > > > > > > Umm... The reason for pipe_lock outside of ->i_mutex is this: > > > default_file_splice_write() calls splice_from_pipe() with > > > write_pipe_buf for callback. splice_from_pipe() calls that > > > callback under pipe_lock(pipe). And write_pipe_buf() calls > > > __kernel_write(), which certainly might want to take ->i_mutex. > > > > > > Now, this codepath isn't taken for files that have non-NULL > > > ->splice_write(), so that's not an issue for XFS and OCFS2, > > > but having pipe_lock nest between the ->i_mutex for filesystems > > > that do and do not have ->splice_write()... Ouch... > > > > What would be the alternative? Duplicating the code in even more > > filesystems to enforce an non-natural locking order for filesystems > > actually implementing splice? There don't actually seem to be a whole > > lot of real filesystems not implemting splice_write, the prime use > > would be for device drivers or synthetic ones. I'm not even sure > > how much that fallback gets used in practice. Hmm... In principle, the following would be no worse than what generic_file_splice_write() is doing: confirm and map the pages, build an iovec and use ->aio_write() to write it out, then unmap the suckers, release ones entirely written to file and adjust the partially written one. All under pipe_lock(). Hell, if we introduce kernel_writev() (either by calling vfs_writev() or taking do_readv_writev() sans copying iovec and using that under set_fs()), we could switch default_file_splice_write() to that and get rid of ->splice_write() for the majority of filesystems, if not all of them. Sure, it means copying from pipe buffers to pagecache, but we have generic_file_splice_write() do that copy anyway - conditional memcpy() in pipe_to_file() is actually unconditional; that if (page != buf->page) in there had just been forgotten by Nick back in 2007 ("1/2 splice: dont steal"). Objections, comments? The problem Christoph was talking about is that generic_file_splice_write() plays with ->i_mutex and both gets/drops it for each page of IO *and* causes PITA for any fs that wants some locks of its own taken in addition to ->i_mutex on the write paths. What ->splice_write() without page stealing is doing is pretty much a writev() from array of pages in kernel space; so it looks like we might as well just reuse writev() guts for that... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html