Re: [PATCH v3] mm: fix race between MADV_FREE reclaim and blkdev direct IO read

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Feb 2, 2022 at 11:03 AM Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote:
>
> On Mon, Jan 31, 2022 at 08:02:55PM -0300, Mauricio Faria de Oliveira wrote:
> > Well, blkdev_direct_IO() gets references for all pages, and on READ
> > operations it only sets them dirty _later_.
> >
> > So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for
> > direct IO read from block devices, and page reclaim happens during
> > __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages()
> > returns, but BEFORE the pages are set dirty, the situation happens.
> >
> > The direct IO read eventually completes. Now, when userspace reads
> > the buffers, the PTE is no longer there and the page fault handler
> > do_anonymous_page() services that with the zero-page, NOT the data!
>
> So why not just set the pages dirty early like the other direct I/O
> implementations?  Or if this is fine with the patch should we remove
> the early dirtying elsewhere?

In general, since this particular problem is specific to MADV_FREE,
it seemed about right to go for a more contained/particular solution
(than changes with broader impact/risk to things that aren't broken).

This isn't to say either approach shouldn't be pursued, but just that
the larger changes aren't strictly needed to actually fix _this_ issue
(and might complicate landing the fix into the stable/distro kernels.)

Now, specifically on the 2 suggestions you mentioned, I'm not very
familiar with other implementations, thus I can't speak to that, sorry.

However, on the 1st suggestion (set pages dirty early), John noted
[1] there might be issues with that and advised not going there.

>
> > Reproducer:
> > ==========
> >
> > @ test.c (simplified, but works)
>
> Can you add this to blktests or some other regularly run regression
> test suite?

Sure.

The test also needs the kernel-side change (to trigger memory reclaim),
which can probably be wired for blktests with a fault-injection capability.

Does that sound good? Maybe there's a better way to do it.

>
> > +                             smp_rmb();
> > +
> > +                             /*
> > +                              * The only page refs must be from the isolation
> > +                              * plus one or more rmap's (dropped by discard:).
>
> Overly long line.

Hmm, checkpatch.pl didn't complain about it. Ah, it checks for 100 chars.
Ok; v4.

>
> > +                              */
> > +                             if ((ref_count == 1 + map_count) &&
>
> No need for the inner braces.
>

Ok; v4.

I'll wait a bit in case more changes are needed, and send v4 w/ the above.

Thanks!

[1] https://lore.kernel.org/linux-mm/7094dbd6-de0c-9909-e657-e358e14dc6c3@xxxxxxxxxx/

-- 
Mauricio Faria de Oliveira



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux