On 07/18/2014 12:38 PM, Johannes Weiner wrote: > I don't really understand how the scenario you describe can happen. > > Successfully reclaiming a page means that __remove_mapping() was able > to freeze a page count of 2 (page cache and LRU isolation), but > filemap_fault() increases the refcount on the page before trying to > lock the page. If __remove_mapping() wins, find_get_page() does not > work and the fault does not lock the page. If find_get_page() wins, > __remove_mapping() does not work and the reclaimer aborts and does a > regular unlock_page(). > > page_check_references() is purely about reclaim strategy, it should > not be essential for correctness. > You are right that something else is happened here. I had not spotted the cmpxchg being done in __remove_mapping(). If I spot something that looks like it could be what went wrong doing this, I will propose a new fix to the list for review. Thanks for your time. P.S. The system had ECC RAM, so this was not a bit flip. My current method for debugging this involves using cscope to construct possible call paths under a couple of assumptions: 1. Something set PG_locked without calling unlock_page(). 2. The only ways of doing #1 that I see in the code are calling __clear_page_locked() or failing to clear the bit. I do not believe that a patch was accepted that did the latter, so I assume the former. I have root access to the system, so each time I do a lookup using cscope, I go through the list to logically eliminate possibilities by inspecting the system where the problem occurred. When I cannot eliminate a possibility, I recurse. This is prone to fail positives should I miss a subtle piece of code that prevents a problem and it is very tedious, but I do not see a better way of debugging based on what I have at my disposal. If anyone has any suggestions, I would appreciate them. P.P.S. I *really* wish that I had used kdump when this issue happened, but sadly, the system is not setup for kdump.
Attachment:
signature.asc
Description: OpenPGP digital signature