On 2022/2/15 20:51, Oscar Salvador wrote: > On Sat, Feb 12, 2022 at 09:37:40PM -0500, Rik van Riel wrote: >> Sometimes the page offlining code can leave behind a hwpoisoned clean >> page cache page. This can lead to programs being killed over and over >> and over again as they fault in the hwpoisoned page, get killed, and >> then get re-spawned by whatever wanted to run them. > > Hi Rik, > > Do you know how that exactly happens? We should not be really leaving > anything behind, and soft-offline (not hard) code works with the premise > of only poisoning a page in case it was contained, so I am wondering > what is going on here. > > In-use pagecache pages are migrated away, and the actual page is > contained, and for clean ones, we already do the invalidate_inode_page() > and then contain it in case we succeed. > IIUC, this could not happen when soft-offlining a pagecache page. They're either invalidated or migrated away and then we set PageHWPoison. I think this may happen on a clean pagecache page when it's isolated. So it's !PageLRU. And identify_page_state treats it as me_unknown because it's non reserved, slab, swapcache and so on ...(see error_states for details). Or am I miss anything? Thanks. > One scenario I can imagine this can happen is if by the time we call > page_handle_poison(), someone has taken another refcount on the page, > and the put_page() does not really free it, but I am not sure that > can happen. >