On Fri, Sep 06, 2019 at 03:59:28PM +0300, Kirill A. Shutemov wrote: > > @@ -248,6 +248,15 @@ pgoff_t page_cache_prev_miss(struct address_space *mapping, > > #define FGP_NOFS 0x00000010 > > #define FGP_NOWAIT 0x00000020 > > #define FGP_FOR_MMAP 0x00000040 > > +/* > > + * If you add more flags, increment FGP_ORDER_SHIFT (no further than 25). > > Maybe some BUILD_BUG_ON()s to ensure FGP_ORDER_SHIFT is sane? Yeah, probably a good idea. > > +/** > > + * __find_get_page - Find and get a page cache entry. > > + * @mapping: The address_space to search. > > + * @offset: The page cache index. > > + * @order: The minimum order of the entry to return. > > + * > > + * Looks up the page cache entries at @mapping between @offset and > > + * @offset + 2^@order. If there is a page cache page, it is returned with > > Off by one? :P Hah! I thought it reasonable to be ambiguous in the English description ... it's not entirely uncommon to describe something being 'between A and B' when meaning ">= A and < B". > > +static struct page *__find_get_page(struct address_space *mapping, > > + unsigned long offset, unsigned int order) > > +{ > > + XA_STATE(xas, &mapping->i_pages, offset); > > + struct page *page; > > + > > + rcu_read_lock(); > > +repeat: > > + xas_reset(&xas); > > + page = xas_find(&xas, offset | ((1UL << order) - 1)); > > Hm. '|' is confusing. What is expectation about offset? > Is round_down(offset, 1UL << order) expected to be equal offset? > If yes, please use '+' instead of '|'. Might make sense to put in ... VM_BUG_ON(offset & ((1UL << order) - 1)); > > + if (xas_retry(&xas, page)) > > + goto repeat; > > + /* > > + * A shadow entry of a recently evicted page, or a swap entry from > > + * shmem/tmpfs. Skip it; keep looking for pages. > > + */ > > + if (xa_is_value(page)) > > + goto repeat; > > + if (!page) > > + goto out; > > + if (compound_order(page) < order) { > > + page = XA_RETRY_ENTRY; > > + goto out; > > + } > > compound_order() is not stable if you don't have pin on the page. > Check it after page_cache_get_speculative(). Maybe check both before and after? If we check it before, we don't bother to bump the refcount on a page which is too small. > > @@ -1632,6 +1696,10 @@ EXPORT_SYMBOL(find_lock_entry); > > * - FGP_FOR_MMAP: Similar to FGP_CREAT, only we want to allow the caller to do > > * its own locking dance if the page is already in cache, or unlock the page > > * before returning if we had to add the page to pagecache. > > + * - FGP_PMD: We're only interested in pages at PMD granularity. If there > > + * is no page here (and FGP_CREATE is set), we'll create one large enough. > > + * If there is a smaller page in the cache that overlaps the PMD page, we > > + * return %NULL and do not attempt to create a page. > > Is it really the best inteface? > > Maybe allow user to ask bitmask of allowed orders? For THP order-0 is fine > if order-9 has failed. That's the semantics that filemap_huge_fault() wants. If the page isn't available at order-9, it needs to return VM_FAULT_FALLBACK (and the VM will call into filemap_fault() to handle the regular sized fault). Now, maybe there are other users who want to specify "create a page of this size if you can, but if there's already something there smaller, return that". We can add another FGP flag when those show up ;-) Thanks for the review.