Re: [PATCH] mm: vmscan: unlock_page page when forcing reclaim

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 07/18/2014 12:38 PM, Johannes Weiner wrote:
> I don't really understand how the scenario you describe can happen.
> 
> Successfully reclaiming a page means that __remove_mapping() was able
> to freeze a page count of 2 (page cache and LRU isolation), but
> filemap_fault() increases the refcount on the page before trying to
> lock the page.  If __remove_mapping() wins, find_get_page() does not
> work and the fault does not lock the page.  If find_get_page() wins,
> __remove_mapping() does not work and the reclaimer aborts and does a
> regular unlock_page().
> 
> page_check_references() is purely about reclaim strategy, it should
> not be essential for correctness.
> 

You are right that something else is happened here. I had not spotted
the cmpxchg being done in __remove_mapping(). If I spot something that
looks like it could be what went wrong doing this, I will propose a new
fix to the list for review. Thanks for your time.

P.S. The system had ECC RAM, so this was not a bit flip. My current
method for debugging this involves using cscope to construct possible
call paths under a couple of assumptions:

1. Something set PG_locked without calling unlock_page().
2. The only ways of doing #1 that I see in the code are calling
__clear_page_locked() or failing to clear the bit. I do not believe that
a patch was accepted that did the latter, so I assume the former.

I have root access to the system, so each time I do a lookup using
cscope, I go through the list to logically eliminate possibilities by
inspecting the system where the problem occurred. When I cannot
eliminate a possibility, I recurse. This is prone to fail positives
should I miss a subtle piece of code that prevents a problem and it is
very tedious, but I do not see a better way of debugging based on what I
have at my disposal. If anyone has any suggestions, I would appreciate them.

P.P.S. I *really* wish that I had used kdump when this issue happened,
but sadly, the system is not setup for kdump.

Attachment: signature.asc
Description: OpenPGP digital signature


[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]