On Thu, Jan 18, 2018 at 08:49:31AM -0800, Linus Torvalds wrote: > On Thu, Jan 18, 2018 at 8:38 AM, Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: > > > > > But there are about ~100 set_fs() calls in generic code, and some of > > > those really are pretty fundamental. Doing things like "kernel_read()" > > > without set_fs() is basically impossible. > > > > Not if we move to iov_iter or iov_iter-like behavior for all reads > > and writes. > > Not going to happen. Really. We have how many tens of thousands of > drivers again, all doing "copy_to_user()". The real PITA is not even that (we could provide helpers making conversion from ->read() to ->read_iter() easy for char devices, etc.). It's the semantics of readv(2). Consider e.g. readv() from /dev/rtc, with iovec array consisting of 10 segments, each int-sized. Right now we'll get rtc_dev_read() called in a loop, once for each segment. Single read() into 40-byte buffer will fill one long and bugger off. Converting it to ->read_iter() will mean more than just "use copy_to_iter() instead of put_user()" - that would be trivial. But to preserve the current behaviour we would need something like total = 0; while (iov_iter_count(to)) { count = iov_iter_single_seg_count(to); /* current body of rtc_dev_read(), with * put_user() replaced with copy_to_iter() */ .... if (res < 0) { if (!total) total = res; break; } total += res; if (res != count) break; } return total; in that thing. And similar boilerplates would be needed in a whole lot of drivers. Sure, they are individually trivial, but they would add up to shitloads of code to get wrong. These are basically all ->read() instances that ignore *ppos and, unlike pipes, do not attempt to fill as much of the buffer as possible. We do have quite a few of such. Some ->read() instances can be easily converted to ->read_iter() and will, in fact, be better off that way. We had patches of that sort and I'm certain that we still have such places left. Ditto for ->write() and ->write_iter(). But those are not even close to being the majority. Sorry. We could, in principle, do something like dev_rtc_read_iter(iocb, to) { return loop_read_iter(iocb, to, modified_dev_rtc_read); } with modified_dev_rtc_read() being the result of minimal conversion (put_user() and copy_to_user() replaced with used of copy_to_iter()). It would be less boilerplate that way, but I really don't see big benefits from doing that. On the write side the things are just as unpleasant - we have a lot of ->write() instances that parse the beginning of the buffer, ignore the rest and report that everything got written. writev() on those will parse each iovec segment, ignoring the junk in the end of each. Again, that loop needs to go somewhere. And we do have a bunch of "parse the buffer and do some action once" ->write() instances - in char devices, debugfs, etc.