On Wed, Feb 2, 2022 at 11:03 AM Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: > > On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote: > > Well, blkdev_direct_IO() gets references for all pages, and on READ > > operations it only sets them dirty _later_. > > > > So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for > > direct IO read from block devices, and page reclaim happens during > > __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages() > > returns, but BEFORE the pages are set dirty, the situation happens. > > > > The direct IO read eventually completes. Now, when userspace reads > > the buffers, the PTE is no longer there and the page fault handler > > do_anonymous_page() services that with the zero-page, NOT the data! > > So why not just set the pages dirty early like the other direct I/O > implementations? Or if this is fine with the patch should we remove > the early dirtying elsewhere? In general, since this particular problem is specific to MADV_FREE, it seemed about right to go for a more contained/particular solution (than changes with broader impact/risk to things that aren't broken). This isn't to say either approach shouldn't be pursued, but just that the larger changes aren't strictly needed to actually fix _this_ issue (and might complicate landing the fix into the stable/distro kernels.) Now, specifically on the 2 suggestions you mentioned, I'm not very familiar with other implementations, thus I can't speak to that, sorry. However, on the 1st suggestion (set pages dirty early), John noted [1] there might be issues with that and advised not going there. > > > Reproducer: > > ========== > > > > @ test.c (simplified, but works) > > Can you add this to blktests or some other regularly run regression > test suite? Sure. The test also needs the kernel-side change (to trigger memory reclaim), which can probably be wired for blktests with a fault-injection capability. Does that sound good? Maybe there's a better way to do it. > > > + smp_rmb(); > > + > > + /* > > + * The only page refs must be from the isolation > > + * plus one or more rmap's (dropped by discard:). > > Overly long line. Hmm, checkpatch.pl didn't complain about it. Ah, it checks for 100 chars. Ok; v4. > > > + */ > > + if ((ref_count == 1 + map_count) && > > No need for the inner braces. > Ok; v4. I'll wait a bit in case more changes are needed, and send v4 w/ the above. Thanks! [1] https://lore.kernel.org/linux-mm/7094dbd6-de0c-9909-e657-e358e14dc6c3@xxxxxxxxxx/ -- Mauricio Faria de Oliveira