On Mon, Jan 20, 2014 at 12:32:01PM -0800, Linus Torvalds wrote: > On Mon, Jan 20, 2014 at 5:55 AM, Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: > > On Sun, Jan 19, 2014 at 05:13:35AM +0000, Al Viro wrote: > >> Folks, what do you think about the following: > > > > That's very much what Shaggy did in the aio-direct tree, except that > > it kept using a single set of methods. Linus really didn't like it > > unfortunately. > > Umm. That wasn't what I objected to. > > I objected to the incredibly ugly implementation, the crazy new flags > arguments for core helper functions, ugly naming etc etc. I even > outlined what might fix it. > > In other words, I thought the code was shit and ugly. Not that > iterators per se would be wrong. Just doing them badly is wrong. Gyahhh... OK, I should've known better than go looking into that thing after such warning. Some relatively printable notes (i.e. about 10% of the comments I really had about that) follow: * WTF bother passing 'pos' separately? It's the same mistake that was made with ->aio_read/->aio_write and just as with those, *all* callers provably have pos == iocb->ki_pos. * I'm not sure that iov_iter is a good fit here. OTOH, it probably could be reused (it has damn few users right now and they are on the codepaths involved into that thing). * We *definitely* want a variant structure with tag - unsigned long thing was just plain insane. I see at least two variants - array of iovecs and array of (at least) triples <page, offset, length>. Quite possibly - quadruples, with "here's how to try to steal this page" thrown in, if we want that as replacement for ->splice_write() as well (it looks like the few instances that do steal on pipe-to-file splices could be dealt with the same way as the dumb ones, provided that ->write_iter or whatever we end up calling it is allowed to try and steal pages). Possibly more variants on the read side of things... FWIW, I'm not sure that bio_vec makes a lot of sense here. * direction (i.e. source vs. destination) also should be a part of that tag. Which, BTW, turns ->direct_IO() into iocb x iov_iter -> int; the situation with pos is identical to aio_read/aio_write. While we are at it, KERNEL_WRITE thing (used only for __swap_writepage()) looks very much like a special case of "array of <page,offset,size>" we want for splice_write... * having looked through the ->read and ->write instances in drivers, I'd say that surprisingly few helpers would go a _long_ way towards converting those guys to the same methods. And we need such helpers anyway - there's a whole lot of (badly) open-coded "copy the whole user buffer into kmalloc'ed array and NUL-terminate the sucker" logics in ->write() instances, for example. Even more "copy up to that much into given array and NUL-terminate it", with rather amusing bugs in there - e.g. NUL going into the end of array, regardless of the actual amount of data to copy; junk is left in the middle, complete with printk of the entire thing if it doesn't make sense. IOW, spewing random piece of kernel stack into dmesg. Off-by-ones galore, too... BTW, speaking of bogosities - grep for generic_write_checks(). Exactly one caller (in __generic_file_aio_write()) has any business looking at S_ISBLK(inode->i_mode) - it can be called by blkdev_aio_write(). All other callers have copied that, even though it makes absolutely no sense for them... * file_readable/file_writable looks really wrong; if nothing else, I would rather check that after open() and set a couple of FMODE bits, then check those. Possibly even "knock FMODE_READ/FMODE_WRITE out if there's no method"... * pipe_buffer_operations ->map()/->unmap() should die; let the caller do k{,un}map{,_atomic}(). All instances have the same method there and there's no point to make it different. PIPE_BUF_FLAG_ATOMIC should also go. * WTF do iov_iter_copy_from_user_atomic() callers keep doing pagefault_disable and pagefault_enable around it? The sucker starts with kmap_atomic() and ends with kunmap_atomic(); all instances of those guys have pagefaul disabling/enabling (and I suspect that it might make sense to lift it out of arch-specific variants - rename them to e.g. __kmap_atomic(), rip pagefault_disable() out of those and put kmap_atomic() into highmem.h outside of ifdef, with !HIGHMEM side of ifdef having #define __kmap_atomic page_address; move pagefault_enable() from __kunmap_atomic() to kunmap_atomic() while we are at it). Note that e.g. pipe.c relies on kmap_atomic() disabling pagefaults - we have __copy_from_user_inatomic() done there under kmap_atomic(), and we really don't want to block in such conditions. * pipe_iov_copy_from_user() ought to return how much has it managed to bring in, not 0 or -EFAULT as it does now. Then it won't need the non-atomic side, AFAICS. Moreover, we'll just be able to use iov_iter_copy_from_user_atomic() (which badly needs a more palatable name, BTW). * sure, we want to call do_generic_file_read() once, passing the entire iovec to file_read_actor(). But what the hell does that have to do with introduction of new methods? It's a change that makes sense on its own. Moreover, it's a damn good preparation to adding those - we get generic_file_aio_read() into "set iov_iter up, then do <this>", with <this> using iov_iter instead of iovec. Once we get to introducing those methods, it's just a matter of taking the rest of generic_file_aio_read() into a separate function and making that function an instance of new method... * Unrelated to this patchset, but... may I politely inquire about the reasons why ->is_partially_uptodate() gets read_descriptor_t? The whole point of read_descriptor_t (if it has any) is that its interpretation is up to whoever's doing the reading, _not_ what they are reading from. So desc->arg is off-limits for any instance of ->is_partially_uptodate(). desc->written is obviously pointless for them; the need (or lack thereof) to do something to the page doesn't depend on how much have we already read from the file. Moreover, reporting an error is rather dubious in such method; if there's something fishy with the page, we'll just return "no" and let ->readpage() complain. Which leaves only desc->count, which, unsurprisingly, is the only thing looked at by (the only) instance of that method. And "tell me if the part of this page that long starting from that offset is up to date" is much more natural that "is what this read_descriptor_t would have had us read from this offset in this page up to date?"... * do we really need separate non-atomic variant of iov_iter_copy_from_user()? We have only two users right now (cifs and ceph) and both can use the fault-in / atomic copy loop without much pain... * in addition to having too many methods, I'm not convinced that we want them to be methods. Let's start with explicit checks in the primitives and see where it goes from there; if we start to grow too many variants, we can always introduce some methods, but then we'll be in better position to decide what is and what is not a good method... * on the read side, I really don't believe that exposing atomic and non-atomic variants is a good idea. Look at the existing users of __copy_to_user_inatomic(); leaving aside the i915_gem weirdness, all of them are used to implement the exact same thing: given a page, offset and length, feed its contents to iovec/buffer/whatnot. Unlike the write side of things, there's nothing between prefaulting pages and actual attempts to copy. So let's make _that_ an exposed primitive and let it deal with kmap/kmap_atomic/etc. Variants that don't have to worry about blocking (vector of <page,offset,length>, etc.) would simply not bother with non-atomic kmap, etc. Sure, it should take iov_iter as destination. And deal with mapping the damn thing internally... * ntfs_file_buffered_write() should switch to iov_iter as well. It's open-coding a lot of iov_iter stuff. It's not entirely trivial and I'd really like to hear from ntfs folks on that, though, and the current code looks deadlock-prone. We prefault everything, then lock the pages to which we'll be copying, then attempt to do __copy_from_user_inatomic(), falling back to __copy_from_user() if that fails. Fine, but what happens if the source of write() is mmaped from the same file and we lose CPU while prefaulting the last page, have memory pressure evict the first one, then have __ntfs_grab_cache_pages() allocate and lock (nonuptodate) pages we'll be copying to and have __copy_from_user() try to copy *from* those same pages? We are doing it while holding these pages locked, so pagefault will have really fun time with them... Anton? BTW, Linus, when did you comment on that patchset? I've found an iteration of that patchset circa last October (v9, apparently the latest posted), but it looks like your comments had either got lost or had been on the earlier iteration of that thing... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html