On 01.09.22 20:35, Yang Shi wrote: > On Thu, Sep 1, 2022 at 11:07 AM Peter Xu <peterx@xxxxxxxxxx> wrote: >> >> On Thu, Sep 01, 2022 at 10:50:48AM -0700, Yang Shi wrote: >>> Yeah, because THP collapse does copy the data before clearing pte. If >>> we want to remove pmdp_collapse_flush() by just clearing pmd, we >>> should clear *AND* flush pte before copying the data IIRC. >> >> Yes tlb flush is still needed. IIUC the generic pmdp_collapse_flush() will >> still be working (with the pte level flushing there) but it should just >> start to work for all archs, so potentially we could drop the arch-specific >> pmdp_collapse_flush()s, mostly the ppc impl. > > I'm don't know why powperpc needs to have its specific > pmdp_collapse_flush() in the first place, not only the mandatory IPI > broadcast, but also the specific implementation of pmd tlb flush. But > anyway the IPI broadcast could be removed at least IMO. > pmdp_collapse_flush() is overwritten on book3s only. It either translates to radix__pmdp_collapse_flush() or hash__pmdp_collapse_flush(). radix__pmdp_collapse_flush() has a comment explaining the situation: + /* + * pmdp collapse_flush need to ensure that there are no parallel gup + * walk after this call. This is needed so that we can have stable + * page ref count when collapsing a page. We don't allow a collapse page + * if we have gup taken on the page. We can ensure that by sending IPI + * because gup walk happens with IRQ disabled. + */ The comment for hash__pmdp_collapse_flush() is a bit more involved: /* * Wait for all pending hash_page to finish. This is needed * in case of subpage collapse. When we collapse normal pages * to hugepage, we first clear the pmd, then invalidate all * the PTE entries. The assumption here is that any low level * page fault will see a none pmd and take the slow path that * will wait on mmap_lock. But we could very well be in a * hash_page with local ptep pointer value. Such a hash page * can result in adding new HPTE entries for normal subpages. * That means we could be modifying the page content as we * copy them to a huge page. So wait for parallel hash_page * to finish before invalidating HPTE entries. We can do this * by sending an IPI to all the cpus and executing a dummy * function there. */ I'm not sure if that implies that the IPI is needed for some other hash-magic. Maybe Aneesh can clarify. >> >> This also reminded me that the s390 version of pmdp_collapse_flush() is a >> bit weird, since it doesn't even have the tlb flush there. I feel like >> it's broken but I can't really tell whether something I've overlooked. >> Worth an eye on. > > I don't know why. But if s390 doesn't flush tlb in > pmdp_collapse_flush(), then there may be data integrity problem since > the page is still writable when copying the data because pte is > cleared after data copying. Or s390 hardware does flush tlb > automatically? s390x does a pmdp_huge_get_and_clear(). pmdp_huge_get_and_clear() does an pmdp_xchg_direct(). pmdp_xchg_direct() does an pmdp_flush_direct(). pmdp_flush_direct() issues an IDTE, which is a TLB flush. Note that this matches ptep_get_and_clear() behavior on s390x. Quoting the comment in there: /* * This is hard to understand. ptep_get_and_clear and ptep_clear_flush * both clear the TLB for the unmapped pte. The reason is that * ptep_get_and_clear is used in common code (e.g. change_pte_range) * to modify an active pte. The sequence is * 1) ptep_get_and_clear * 2) set_pte_at * 3) flush_tlb_range * On s390 the tlb needs to get flushed with the modification of the pte * if the pte is active. The only way how this can be implemented is to * have ptep_get_and_clear do the tlb flush. In exchange flush_tlb_range * is a nop. */ -- Thanks, David / dhildenb