On Thu, Jun 27, 2024 at 1:25 AM Muchun Song <muchun.song@xxxxxxxxx> wrote: > > > > On 2024/6/27 12:43, Yu Zhao wrote: > > While investigating HVO for THPs [1], it turns out that speculative > > PFN walkers like compaction can race with vmemmap modifications, e.g., > > > > CPU 1 (vmemmap modifier) CPU 2 (speculative PFN walker) > > ------------------------------- ------------------------------ > > Allocates an LRU folio page1 > > Sees page1 > > Frees page1 > > > > Allocates a hugeTLB folio page2 > > (page1 being a tail of page2) > > > > Updates vmemmap mapping page1 > > get_page_unless_zero(page1) > > > > Even though page1->_refcount is zero after HVO, get_page_unless_zero() > > can still try to modify this read-only field, resulting in a crash. > > > > An independent report [2] confirmed this race. > > > > There are two discussed approaches to fix this race: > > 1. Make RO vmemmap RW so that get_page_unless_zero() can fail without > > triggering a PF. > > 2. Use RCU to make sure get_page_unless_zero() either sees zero > > page->_refcount through the old vmemmap or non-zero page->_refcount > > through the new one. > > > > The second approach is preferred here because: > > 1. It can prevent illegal modifications to struct page[] that has been > > HVO'ed; > > 2. It can be generalized, in a way similar to ZERO_PAGE(), to fix > > similar races in other places, e.g., arch_remove_memory() on x86 > > [3], which frees vmemmap mapping offlined struct page[]. > > > > While adding synchronize_rcu(), the goal is to be surgical, rather > > than optimized. Specifically, calls to synchronize_rcu() on the error > > handling paths can be coalesced, but it is not done for the sake of > > Simplicity: noticeably, this fix removes ~50% more lines than it adds. > > I suggest adding some user-visible effect here like for use > case of nr_overcommit_hugepages, synchronize_rcu() will make > this use case worse. > > > > > [1] https://lore.kernel.org/20240229183436.4110845-4-yuzhao@xxxxxxxxxx/ > > [2] https://lore.kernel.org/917FFC7F-0615-44DD-90EE-9F85F8EA9974@xxxxxxxxx/ > > [3] https://lore.kernel.org/be130a96-a27e-4240-ad78-776802f57cad@xxxxxxxxxx/ > > > > Signed-off-by: Yu Zhao <yuzhao@xxxxxxxxxx> > > Acked-by: Muchun Song <muchun.song@xxxxxxxxx> > > A nit below. Thanks for reviewing! I've addressed all your suggestions in v2.