Re: [patch] mm: fix PageUptodate memory ordering bug

Nick Piggin <npiggin@xxxxxxx> · Sun, 23 Dec 2007 07:54:46 +0100

On Sat, Dec 22, 2007 at 12:14:42PM +0000, Hugh Dickins wrote:
> On Sat, 22 Dec 2007, Andrew Morton wrote:
> > On Tue, 18 Dec 2007 02:26:32 +0100 Nick Piggin <npiggin@xxxxxxx> wrote:
> > 
> > > After running SetPageUptodate, preceeding stores to the page contents to
> > > actually bring it uptodate may not be ordered with the store to set the page
> > > uptodate.
> > > 
> > > Therefore, another CPU which checks PageUptodate is true, then reads the
> > > page contents can get stale data.
> > > 
> > > Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
> > > PageUptodate.
> > > 
> > > Many places that test PageUptodate, do so with the page locked, and this
> > > would be enough to ensure memory ordering in those places if SetPageUptodate
> > > were only called while the page is locked. Unfortunately that is not always
> > > the case for some filesystems, but it could be an idea for the future.
> > > 
> > > One thing I like about it is that it brings the handling of anonymous page
> > > uptodateness in line with that of file backed page management, by marking anon
> > > pages as uptodate when they _are_ uptodate, rather than when our implementation
> > > requires that they be marked as such.
> 
> Nick, you're welcome to make that a separate, less controversial patch,
> to send in ahead.  Though I think the last time this came around, I hit
> one of your BUGs in testing shmem.c swapout or swapin or swapoff:
> something missing there that I've lost the record of - please do
> try testing that, maybe it's already fixed this time around.

I've given it some hours in your patented swapping kbuild-on-ext2-on-loop-on-tmpfs
stress testing (including swapoff). Haven't seen a problem as yet (except the tmpfs
swapin deadlock, which I've been patching out).

But if you see anything, please let me know...

> > >  #ifdef CONFIG_S390
> > > +	page_clear_dirty(page);
> > > +#endif
> > > +}
> 
> That's an odd little extract, since page_clear_dirty only does anything
> on s390.

Ah yeah, we could just get rid of the ifdef. Although I don't mind it too much,
as it kind of helps the reader match the other ifdef there...

> > For an overall 0.5% increase in the i386 size of several core mm files.  If
> > you don't blow us up on the spot, you'll slowly bleed us to death.
> > 
> > Can it be improved?
> 
> I do wish it could be.
> 
> I never find the time to give it the thought it needs; and any criticism
> I make is probably unjust, probably patiently answered by Nick on a
> previous round.
> 
> I'm never convinced that SetPageUptodate is the right place for
> this: what's wrong with doing it in those page copying functions?
> Or flush_dcache_page?

There are various places we _could_ do it, but I think PG_uptodate macros
are logically the best, without being too intrusive.

Let me explain. Normally I think the convention would be to open-code the
barriers in the callees (ie. between memset(); SetPageUptodate();, and
if (PageUptodate()) { read from page }).

However I think that would require going through quite a bit of code (including
filesystems) to audit. So I think having them in these macros is pretty
reasonable, and amounts to less thinking required by others.

Why don't I like doing it in page copying functions? Just because there are more
and more varied uses. I can't think of any reasons to rather do it in the page
copying functions, and some reasons against.

flush_dcache_page? Well this bug really is a problem ordering stores to the
page with store to page flags against loads from the same; nothing to do with
cache aliasing. So putting the smp_wmb in flush_dcache_page leaves you without
a natural complement to put the smp_rmb. Although it could be done, I think it
makes it more tangled than having the ordering done in the macros.  We also
only need to order the *initial* stores which bring the page uptodate, rather
than for each store, in the case of flush_dcache_page.

>  Don't we need different kinds of barrier
> according to how the data got into the page (by DMA or not)?

I had thought of that (my previous patch had an XXX: help...) for this
very issue. Without actually knowing what the underlying architecture does,
I "concluded" that it should be done somewhere down at the block layer. I
think it would be silly for the block layer to signal completion if the
results are still incoherent with the CPU cache... but if the experts have
a different opinion, then this needs to be solved with another call anyway
(not in the page uptodate macros and it's not exactly a memory ordering issue).
eg. direct IO reads would have the same DMA cache synchronisation before it
completes to userspace, and this is completely independent of PG_uptodate...

> Doesn't that enter territory discussed down the years between
> James Bottomley and Russell King?  Worth CC'ing them the original?

... but since you bring this up again, I think that would be worthwhile. In
the interest of maintaining this thread I'll just link the original:

http://marc.info/?l=linux-mm&m=119794127303483&w=2

The question is this:

Must read from net/disk/etc into page P.
Device DMAs into P, signals completion

CPU0: handles completion, store to ram to mark P uptodate

CPU0/1: load from ram sees P uptodate, load from P must only see uptodate data

Are we guaranteed to get uptodate data from above the block layer, or do we
need to do anything special?
-
To unsubscribe from this list: send the line "unsubscribe linux-arch" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html