On Thu, 21 Dec 2023 13:40:11 +0800 Jiajun Xie <jiajun.xie.sh@xxxxxxxxx> wrote: > > (obviously bad, but it's good to spell it out) and under what > > circumstances it occurs? > > Thanks for the quick reply. > > The issue happens in Heterogeneous computing, where the > device(e.g. gpu) and host share the same virtual address space. > > A simple workflow pattern which hit the issue is: > /* host */ > 1. userspace first mmap a file backed VA range with specified offset. > e.g. (offset=0x800..., mmap return: va_a) > 2. write some data to the corresponding sys page > e.g. (va_a = 0xAABB) > /* device */ > 3. gpu workload touches VA, triggers gpu fault and notify the host. > /* host */ > 4. reviced gpu fault notification, then it will: > 4.1 unmap host pages and also takes care of cpu tlb > (use unmap_mapping_range with offset=0x800...) > 4.2 migrate sys page to device > 4.3 setup device page table and resolve device fault. > /* device */ > 5. gpu workload continued, it accessed va_a and got 0xAABB. > 6. gpu workload continued, it wrote 0xBBCC to va_a. > /* host */ > 7. userspace access va_a, as expected, it will: > 7.1 trigger cpu vm fault. > 7.2 driver handling fault to migrate gpu local page to host. > 8. userspace then could correctly get 0xBBCC from va_a > 9. done > > But in step 4.1, if we hitted the bug this patch mentioned, then user space > would never trigger cpu fault, and still get the old value: 0xAABB. Thanks. Based on the above, I added cc:stable to the changelog so the fix will be backported into earlier kernels (it looks like that's 20+ years worth!). And I pasted the above text into that changelog.