On Tue 15-07-14 03:33:02, Hugh Dickins wrote: > shmem_fault() is the actual culprit in trinity's hole-punch starvation, > and the most significant cause of such problems: since a page faulted is > one that then appears page_mapped(), needing unmap_mapping_range() and > i_mmap_mutex to be unmapped again. > > But it is not the only way in which a page can be brought into a hole in > the radix_tree while that hole is being punched; and Vlastimil's testing > implies that if enough other processors are busy filling in the hole, > then shmem_undo_range() can be kept from completing indefinitely. > > shmem_file_splice_read() is the main other user of SGP_CACHE, which > can instantiate shmem pagecache pages in the read-only case (without > holding i_mutex, so perhaps concurrently with a hole-punch). Probably > it's silly not to use SGP_READ already (using the ZERO_PAGE for holes): > which ought to be safe, but might bring surprises - not a change to be > rushed. > > shmem_read_mapping_page_gfp() is an internal interface used by > drivers/gpu/drm GEM (and next by uprobes): it should be okay. And > shmem_file_read_iter() uses the SGP_DIRTY variant of SGP_CACHE, when > called internally by the kernel (perhaps for a stacking filesystem, > which might rely on holes to be reserved): it's unclear whether it > could be provoked to keep hole-punch busy or not. > > We could apply the same umbrella as now used in shmem_fault() to > shmem_file_splice_read() and the others; but it looks ugly, and use > over a range raises questions - should it actually be per page? can > these get starved themselves? > > The origin of this part of the problem is my v3.1 commit d0823576bf4b > ("mm: pincer in truncate_inode_pages_range"), once it was duplicated > into shmem.c. It seemed like a nice idea at the time, to ensure > (barring RCU lookup fuzziness) that there's an instant when the entire > hole is empty; but the indefinitely repeated scans to ensure that make > it vulnerable. > > Revert that "enhancement" to hole-punch from shmem_undo_range(), but > retain the unproblematic rescanning when it's truncating; add a couple > of comments there. > > Remove the "indices[0] >= end" test: that is now handled satisfactorily > by the inner loop, and mem_cgroup_uncharge_start()/end() are too light > to be worth avoiding here. > > But if we do not always loop indefinitely, we do need to handle the case > of swap swizzled back to page before shmem_free_swap() gets it: add a > retry for that case, as suggested by Konstantin Khlebnikov; and for the > case of page swizzled back to swap, as suggested by Johannes Weiner. > > Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx> > Reported-by: Sasha Levin <sasha.levin@xxxxxxxxxx> > Suggested-by: Vlastimil Babka <vbabka@xxxxxxx> > Cc: Konstantin Khlebnikov <koct9i@xxxxxxxxx> > Cc: Johannes Weiner <hannes@xxxxxxxxxxx> > Cc: Lukas Czerner <lczerner@xxxxxxxxxx> > Cc: Dave Jones <davej@xxxxxxxxxx> > Cc: <stable@xxxxxxxxxxxxxxx> [3.1+] Reviewed-by: Michal Hocko <mhocko@xxxxxxx> > --- > Please replace mmotm's > shmem-fix-faulting-into-a-hole-while-its-punched-take-2.patch > by this patch: which is substantially the same as that, but with > updated commit comment, and a page retry as indicated by Hannes. > > mm/shmem.c | 24 +++++++++++++++--------- > 1 file changed, 15 insertions(+), 9 deletions(-) > > --- 3.16-rc5+/mm/shmem.c 2014-07-14 20:34:28.196153828 -0700 > +++ 3.16-rc5++/mm/shmem.c 2014-07-14 20:35:14.156154916 -0700 > @@ -468,23 +468,20 @@ static void shmem_undo_range(struct inod > return; > > index = start; > - for ( ; ; ) { > + while (index < end) { > cond_resched(); > > pvec.nr = find_get_entries(mapping, index, > min(end - index, (pgoff_t)PAGEVEC_SIZE), > pvec.pages, indices); > if (!pvec.nr) { > - if (index == start || unfalloc) > + /* If all gone or hole-punch or unfalloc, we're done */ > + if (index == start || end != -1) > break; > + /* But if truncating, restart to make sure all gone */ > index = start; > continue; > } > - if ((index == start || unfalloc) && indices[0] >= end) { > - pagevec_remove_exceptionals(&pvec); > - pagevec_release(&pvec); > - break; > - } > mem_cgroup_uncharge_start(); > for (i = 0; i < pagevec_count(&pvec); i++) { > struct page *page = pvec.pages[i]; > @@ -496,8 +493,12 @@ static void shmem_undo_range(struct inod > if (radix_tree_exceptional_entry(page)) { > if (unfalloc) > continue; > - nr_swaps_freed += !shmem_free_swap(mapping, > - index, page); > + if (shmem_free_swap(mapping, index, page)) { > + /* Swap was replaced by page: retry */ > + index--; > + break; > + } > + nr_swaps_freed++; > continue; > } > > @@ -506,6 +507,11 @@ static void shmem_undo_range(struct inod > if (page->mapping == mapping) { > VM_BUG_ON_PAGE(PageWriteback(page), page); > truncate_inode_page(mapping, page); > + } else { > + /* Page was replaced by swap: retry */ > + unlock_page(page); > + index--; > + break; > } > } > unlock_page(page); > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@xxxxxxxxx. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> -- Michal Hocko SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html