On Wed, 9 Jul 2014, Hugh Dickins wrote: > On Wed, 9 Jul 2014, Vlastimil Babka wrote: > > On 07/09/2014 06:03 PM, Sasha Levin wrote: > > > > > > We can see that it's not blocked since it's in the middle of a spinlock > > > unlock > > > call, and we can guess it's been in that function for a while because of > > > the hung > > > task timer, and other processes waiting on that i_mmap_mutex: > > > > Hm, zap_pte_range has potentially an endless loop due to the 'goto again' > > path. Could it be a somewhat similar situation to the fallocate problem, but > > where parallel faulters on shared memory are preventing a process from > > exiting? Although they don't fault the pages into the same address space, > > they could maybe somehow interact through the TLB flushing code? And only > > after fixing the original problem we can observe this one? > > That's a good thought. It ought to make forward progress nonetheless, > but I believe (please check, I'm rushing) that there's an off-by-one in > that path which could leave us hanging - but only when __tlb_remove_page() > repeatedly fails, which would only happen if exceptionally low on memory?? > > Does this patch look good, and does it make any difference to the hang? I should add that I think that this patch is correct in itself, but won't actually make any difference to anything. I'm still looking through Sasha's log for clues (but shall have to give up soon). Hugh > > --- mmotm/mm/memory.c 2014-07-02 15:32:22.212311544 -0700 > +++ linux/mm/memory.c 2014-07-09 09:56:33.724159443 -0700 > @@ -1145,6 +1145,7 @@ again: > if (unlikely(page_mapcount(page) < 0)) > print_bad_pte(vma, addr, ptent, page); > if (unlikely(!__tlb_remove_page(tlb, page))) { > + addr += PAGE_SIZE; > force_flush = 1; > break; > } -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html