On Thu, Mar 9, 2017 at 12:11 AM, Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote: > On Wed, 8 Mar 2017 16:15:32 +0100 Dmitry Vyukov <dvyukov@xxxxxxxxxx> wrote: > >> quarantine_remove_cache() frees all pending objects that belong to the >> cache, before we destroy the cache itself. However there are currently >> two possibilities how it can fail to do so. >> >> First, another thread can hold some of the objects from the cache in >> temp list in quarantine_put(). quarantine_put() has a windows of enabled >> interrupts, and on_each_cpu() in quarantine_remove_cache() can finish >> right in that window. These objects will be later freed into the >> destroyed cache. >> >> Then, quarantine_reduce() has the same problem. It grabs a batch of >> objects from the global quarantine, then unlocks quarantine_lock and >> then frees the batch. quarantine_remove_cache() can finish while some >> objects from the cache are still in the local to_free list in >> quarantine_reduce(). >> >> Fix the race with quarantine_put() by disabling interrupts for the >> whole duration of quarantine_put(). In combination with on_each_cpu() >> in quarantine_remove_cache() it ensures that quarantine_remove_cache() >> either sees the objects in the per-cpu list or in the global list. >> >> Fix the race with quarantine_reduce() by protecting quarantine_reduce() >> with srcu critical section and then doing synchronize_srcu() at the end >> of quarantine_remove_cache(). >> >> ... >> >> I suspect that these races are the root cause of some GPFs that >> I episodically hit. Previously I did not have any explanation for them. > > The changelog doesn't convey a sense of how serious this bug is, so I'm > not in a good position to decide whether this fix should be backported. > The patch looks fairly intrusive so I tentatively decided that it > needn't be backported. Perhaps that was wrong. > > Please be more careful in describing the end-user visible impact of > bugs when fixing them. Will try to do better next time. Thanks for the feedback. I am not sure myself about backporting. The back is quite hard to trigger, I've seen it few times during our massive continuous testing (however, it could be cause of some other episodic stray crashes as it leads to memory corruption...). If it is triggered, the consequences are very bad -- almost definite bad memory corruption. The fix is non trivial and has chances of introducing new bugs. I am also not sure how actively people use KASAN on older releases. Can we flag it for backporting later if/when we see real need? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>