On Wed, Oct 12, 2016 at 08:43:20AM +0200, Jan Kara wrote: > On Wed 12-10-16 00:53:49, Kirill A. Shutemov wrote: > > On Tue, Oct 11, 2016 at 05:58:15PM +0200, Jan Kara wrote: > > > On Thu 15-09-16 14:54:55, Kirill A. Shutemov wrote: > > > > invalidate_inode_page() has expectation about page_count() of the page > > > > -- if it's not 2 (one to caller, one to radix-tree), it will not be > > > > dropped. That condition almost never met for THPs -- tail pages are > > > > pinned to the pagevec. > > > > > > > > Let's drop them, before calling invalidate_inode_page(). > > > > > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> > > > > --- > > > > mm/truncate.c | 11 +++++++++++ > > > > 1 file changed, 11 insertions(+) > > > > > > > > diff --git a/mm/truncate.c b/mm/truncate.c > > > > index a01cce450a26..ce904e4b1708 100644 > > > > --- a/mm/truncate.c > > > > +++ b/mm/truncate.c > > > > @@ -504,10 +504,21 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping, > > > > /* 'end' is in the middle of THP */ > > > > if (index == round_down(end, HPAGE_PMD_NR)) > > > > continue; > > > > + /* > > > > + * invalidate_inode_page() expects > > > > + * page_count(page) == 2 to drop page from page > > > > + * cache -- drop tail pages references. > > > > + */ > > > > + get_page(page); > > > > + pagevec_release(&pvec); > > > > > > I'm not quite sure why this is needed. When you have multiorder entry in > > > the radix tree for your huge page, then you should not get more entries in > > > the pagevec for your huge page. What do I miss? > > > > For compatibility reason find_get_entries() (which is called by > > pagevec_lookup_entries()) collects all subpages of huge page in the range > > (head/tails). See patch [07/41] > > > > So huge page, which is fully in the range it will be pinned up to > > PAGEVEC_SIZE times. > > Yeah, I see. But then won't it be cleaner to provide iteration method that > would add to pagevec each radix tree entry (regardless of its order) only > once and then use it in places where we care? Instead of strange dances > like you do here? Maybe. It would require doubling number of find_get_* helpers or additional flag in each. We have too many already. And multi-order entries interface for radix-tree has not yet settled in. I would rather defer such rework until it will be shaped fully. Let's come back to this later. > Ultimately we could convert all the places to use these new iteration > methods but I don't see that as immediately necessary and maybe there are > places where getting all the subpages in the pagevec actually makes life > simpler for us (please point me if you know about such place). I did the way I did to now evaluate each use of find_get_*() one-by-one. I guessed most of the callers of find_get_page() would be confused by getting head page instead relevant subpage. Maybe I was wrong and it was easier to make caller work with that. I don't know... > On a somewhat unrelated note: I've noticed that you don't invalidate > a huge page when only part of it should be invalidated. That actually > breaks some assumptions filesystems make. In particular direct IO code > assumes that if you do > > filemap_write_and_wait_range(inode, start, end); > invalidate_inode_pages2_range(inode, start, end); > > all the page cache covering start-end *will* be invalidated. Your skipping > of partial pages breaks this assumption and thus can bring consistency > issues (e.g. write done using direct IO won't be seen by following buffered > read). Acctually, invalidate_inode_pages2_range does invalidate whole page if part of it is in the range. I've catched this problem during testing. -- Kirill A. Shutemov -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>