On Mon, Jan 16, 2023 at 11:10:11PM +0000, David Howells wrote: > Convert the generic direct-I/O code to use iov_iter_extract_pages() instead > of iov_iter_get_pages(). This will pin pages or leave them unaltered > rather than getting a ref on them as appropriate to the iterator. > > The pages need to be pinned for DIO-read rather than having refs taken on > them to prevent VM copy-on-write from malfunctioning during a concurrent > fork() (the result of the I/O would otherwise end up only visible to the > child process and not the parent). Several observations: 1) fs/direct-io.c is ancient, grotty and has few remaining users. The case of block devices got split off first; these days it's in block/fops.c. Then iomap-using filesystems went to fs/iomap/direct-io.c, leaving this sucker used only by affs, ext2, fat, exfat, hfs, hfsplus, jfs, nilfs2, ntfs3, reiserfs, udf and ocfs2. And frankly, the sooner it dies the better off we are. IOW, you've picked an uninteresting part and left the important ones untouched. 2) if you look at the "should_dirty" logics in either of those (including fs/direct-io.c itself) you'll see a funny thing. First of all, dio->should_dirty (or its counterparts) gets set iff we have a user-backed iter and operation is a read. I.e. precisely the case when you get bio marked with BIO_PAGE_PINNED. And that's not an accident - look at the places where we check that predicate: dio_bio_submit() calls bio_set_pages_dirty() if that predicate is true before submitting the sucker and dio_bio_complete() uses it to choose between bio_check_pages_dirty() and bio_release_pages() + bio_put(). Look at bio_check_pages_dirty() - it checks if any of the pages we were reading into had managed to lose the dirty bit; if none had it does bio_release_pages(bio, false) + bio_put(bio) and returns. If some had, it shoves bio into bio_dirty_list and arranges for bio_release_pages(bio, true) + bio_put(bio) called from upper half (via schedule_work()). The effect of the second argument of bio_release_pages() is to (re)dirty the pages; it can't be done from interrupt, so we have to defer it to process context. Now, do we need to redirty anything there? Recall that page pinning had been introduced precisely to prevent writeback while the page is getting DMA into it. Who is going to mark it clean before we unpin it? Unless I misunderstand something fundamental about the whole thing, this crap should become useless with that conversion. And it's not just the ->should_dirty and its equivalents - bio_check_pages_dirty() and the stuff around it should also be gone once block/fops.c and fs/iomap/direct-io.c are switched to your iov_iter_extract_pages. Moreover, that way the only places legitimately passing true to bio_release_pages() are blk_rq_unmap_user() (on completion of bypass read request mapping a user page) and __blkdev_direct_IO_simple() (on completion of short sync O_DIRECT read from block device). Both could just as well call bio_set_pages_dirty(bio) + bio_release_pages(bio, false), killing the "dirty on release" logics and losing the 'mark_dirty' argument. BTW, where do we dirty the pages on IO_URING_OP_READ_FIXED with O_DIRECT file? AFAICS, bio_set_pages_dirty() won't be called (ITER_BVEC iter) and neither will bio_release_pages() do anything (BIO_NO_PAGE_REF set on the bio by bio_iov_bvec_set() called due to the same ITER_BVEC iter). Am I missing something trivial here? Jens?