Hi Phil, On Thu, Sep 12, 2024 at 7:03 AM Phil Elwell <phil@xxxxxxxxxxxxxxx> wrote: > > Hi, > > I've spent many hours recently trying to diagnose a problem that > manifests as a CPU spin, under load and memory pressure, that can last > for many seconds. The problem can be seen on our downstream kernels > from 6.5 onwards, when built for ARCH=arm, running on a Pi 3B (BCM2837 > - quad A53). I've not tested a pure Linux 6.5, but this is not a bug > report. > > Pi 3B has limited RAM (1GB), and it was discovered that restricting > this further to 512MB made the spins more frequent, as did adding > other processes. Running an ARM64 kernel in the same configuration > leads to normal OOM behaviour. > > I traced the spin to a loop in __copy_to_user_memcpy where > pin_page_for_write fails repeatedly, sometimes for hundreds of > thousands of times. The pin is failing because the user page in > question is marked as being old (L_PTE_YOUNG is unset). When this > happens, the code tries to freshen the page using __put_user, but in > this case it is not triggering the required page fault. Digging > deeper, it can be seen that the PTE in the ARM's shadow hardware PTE > is 0 as expected, but clearly the MMU is not seeing this otherwise it > would be faulting; a TLB flush for that PTE fixes it. > > The TLB non-coherency for that PTE can be attributed to a call to > ptep_test_and_clear_young from lru_gen_look_around, which clears the > L_PTE_YOUNG bit in the Linux PTE Yes, it does that. > and zeroes the hardware PTE I don't see how it can happen, or why it's needed. Could you explain? > but doesn't call flush_tlb_cache. Correct, and this is because that arch-specific API currently doesn't require TLB flushes, from the MM's POV. None of the current callers does, I doubt they were used on arm (32 bit) at all, except MGLRU. > Two possible "fixes" are: > > a. Replace ptep_test_and_clear_young with ptep_clear_flush_young, > which includes the TLB flush. > b. After the loop over the page range from "start" to "end", include a > call to flush_tlb_range from "start" to "end" if the "young" count is > non-zero. > > My questions are: > > 1. Which bit of code is meant to take care of TLB coherency where > lru_gen_look_around has made changes? None, since the API doesn't explicitly require it (or at least the MM assumes), as I mentioned above. > 2. Between the two patches a) and b), which is preferable? b) would > seem better if IPIs are needed to broadcast the TLB flushes, but it > seems that BCM2837 has new enough CPU cores not to require such > broadcasts. Could this be fixed within arm? If not, we would have to update the requirement of that arch-specific API. This would affect other archs that don't require TLB flushes, assuming they exist. And we would need to fix all callers of ptep_test_and_clear_young() in MM. > 3. walk_pte_range has a similar loop, but it seems it doesn't need to > be patched to fix my spin, possibly because it isn't called. Correct. > If a > patch to lru_gen_look_around is needed, might one be needed here as > well? No, because that code is disabled, unless hardware can set A-bit, e.g., arm64 v8.2. Thanks.