[Jens and Nick Cc'd] On Fri, Sep 09, 2016 at 07:06:29PM -0700, Linus Torvalds wrote: > I think you'd be better off with just a really small on-stack case > (like maybe 2-3 entries), and just allocate anything bigger > dynamically. Or you could even see how bad it is if you just > force-limit it to max 4 entries or something like that and just do > partial writes. Umm... Right now it tries to allocate as much as the output pipe could possibly hold. With default being 16 buffers, you'll end up with doing dynamic allocation in all cases (it doesn't even look at the amount of data we want to transfer). The situation with splice_pipe_desc looks very odd: * all but one instance are on stack frames of some ->splice_read() or something called by it (exception is in vmsplice) * all but one instance (a different one - see below) go through splice_grow_spd / splice_to_pipe / splice_shrink_spd sequence and nothing else sees them. The exception is skb_splice_bits() and there we have MAX_SKB_FRAGS for size, don't bother with grow/shrink and the only thing done to that spd is splice_to_pipe() (from the callback passed to skb_splice_bits()). * only one ->splice_read() instance does _not_ create splice_pipe_descriptor. It's fuse_dev_splice_read(), and it pays for that by open-coding splice_to_pipe(). The only reason for open-coding is that we don't have a "stronger SPLICE_F_NONBLOCK" that would fail if the data wouldn't fit. SPLICE_F_NONBLOCK stuffs as much as possible and buggers off without waiting, fuse_dev_splice_read() wants all or nothing (and no waiting). * incidentally, we *can't* add new flags - splice(2)/tee(2)/vmsplice(2) quietly ignore all bits they do not recognize. In fact, splice(2) ends up passing them (unsanitized) to ->splice_read and ->splice_write instances. * for splice(2) the IO size is limited by nominal capacity of output pipe. Looks fairly arbitrary (the limit is the same whether the pipe is full or empty), but I wouldn't be surprised if userland programmers would get unhappy if they have to take more iterations through their loops. * the other caller of ->splice_read() is splice_direct_to_actor() and that can be called on a fairly deep stack. However, there we loop ourselves and smaller chunk size is not a problem. * in case of skb_splice_bits(), we probably want a grow/shrink pair as well, with well below MAX_SKB_FRAGS for a default - what's the typical number of fragments per skb? > So feel free to try maxing out using only a small handful of > pipe_buffer entries. Returning partial IO from splice() is fine. Are you sure that nobody's growing the output pipe buffer before doing splice() into it as a way to reduce the amount of iterations? FWIW, I would love to replace these array of page * + array of <offset,len,private> triples with array of pipe_buffer; for one thing, this ridiculous ->sbd_release() goes away (we simply call ->ops->release() on all unwanted buffers), which gets rid of wonders like static void buffer_spd_release(struct splice_pipe_desc *spd, unsigned int i) { struct buffer_ref *ref = (struct buffer_ref *)spd->partial[i].private; if (--ref->ref) return; ring_buffer_free_read_page(ref->buffer, ref->page); kfree(ref); spd->partial[i].private = 0; } static void buffer_pipe_buf_release(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { struct buffer_ref *ref = (struct buffer_ref *)buf->private; if (--ref->ref) return; ring_buffer_free_read_page(ref->buffer, ref->page); kfree(ref); buf->private = 0; } pairs that need to be kept in sync, etc. One inconvenience created by that is stuff like spd.nr_pages = find_get_pages_contig(mapping, index, nr_pages, spd.pages); in there; granted, this one will go away with __generic_file_splice_read(), but e.g. get_iovec_page_array() is using get_user_pages_fast(), which wants to put pages next to each other. That one is from vmsplice_to_pipe() guts, and I've no idea what the normal use patterns are. OTOH, how much overhead would we get from repeated calls of get_user_pages_fast() for e.g. 16 pages or so, compared to larger chunks? It is on a shallow stack, so it's not as if we couldn't afford a 16-element array of struct page * in there... -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html