On 23.11.22 06:14, Hugh Dickins wrote:
On Wed, 23 Nov 2022, Gavin Shan wrote:
The issue is reported when removing memory through virtio_mem device.
The transparent huge page, experienced copy-on-write fault, is wrongly
regarded as pinned. The transparent huge page is escaped from being
isolated in isolate_migratepages_block(). The transparent huge page
can't be migrated and the corresponding memory block can't be put
into offline state.
Fix it by replacing page_mapcount() with total_mapcount(). With this,
the transparent huge page can be isolated and migrated, and the memory
block can be put into offline state.
Fixes: 3917c80280c9 ("thp: change CoW semantics for anon-THP")
Cc: stable@xxxxxxxxxxxxxxx # v5.8+
Reported-by: Zhenyu Zhang <zhenyzha@xxxxxxxxxx>
Suggested-by: David Hildenbrand <david@xxxxxxxxxx>
Signed-off-by: Gavin Shan <gshan@xxxxxxxxxx>
Interesting, good catch, looked right to me: except for the Fixes line
and mention of v5.8. That CoW change may have added a case which easily
demonstrates the problem, but it would have been the wrong test on a THP
for long before then - but only in v5.7 were compound pages allowed
through at all to reach that test, so I think it should be
Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations")
Cc: stable@xxxxxxxxxxxxxxx # v5.7+
Oh, no, stop: this is not so easy, even in the latest tree.
Because at the time of that "admittedly racy check", we have no hold
at all on the page in question: and if it's PageLRU or PageCompound
at one instant, it may be different the next instant. Which leaves it
vulnerable to whatever BUG_ON()s there may be in the total_mapcount()
path - needs research. *Perhaps* there are no more BUG_ON()s in the
total_mapcount() path than in the existing page_mapcount() path.
I suspect that for this to be safe (before your patch and more so after),
it will be necessary to shift the "admittedly racy check" down after the
get_page_unless_zero() (and check the sequence of operations when a
compound page is initialized).
Grabbing a reference first sounds like the right approach to me.
The races I'm talking about are much much rarer than the condition you
are trying to avoid, so it's frustrating; but such races are real,
and increasing stable's exposure to them is not so good.
Such checks are always racy and the code has to be able to deal with
false negatives/postives (we're not even holding the page lock); as you
state, we just don't want to trigger undefined behavior/BUG.
I'm also curious how that migration code handles a THP that's in the
swapcache. It better should handle such pages correctly, for example, by
removing them from the swapcache first, otherwise that could block
migration.
For example, in mm/ksm.c:write_protect_page() we have
"page_mapcount(page) + 1 + swapped != page_count(page)"
page_mapcount() and "swapped==0/1" makes sense to me, because KSM only
cares about order-0 pages, so no need for THP games.
But we do have an even better helper in place already:
mm/huge_memory.c:can_split_folio()
Which cares about
a) Swapcache for THP: each subpage could be in the swapcache
b) Requires the caller to hold one reference to be safe
But I am a bit confused about the "extra_pins" for !anon. Where do the
folio_nr_pages() references come from?
So *maybe* it makes sense to factor out can_split_folio() and call it
something like: "folio_maybe_additionally_referenced" [to clearly
distinguish it from "folio_maybe_dma_pinned" that cares about actual
page pinning (read/write page content)].
Such a function could return false positives/negatives due to races and
the caller would have to hold one reference and be able to deal with the
semantics.
--
Thanks,
David / dhildenb