On Fri, Dec 20, 2019 at 03:54:55PM -0800, John Hubbard wrote: > On 12/20/19 10:29 AM, Leon Romanovsky wrote: > ... > >> $ ./build.sh > >> $ build/bin/run_tests.py > >> > >> If you get things that far I think Leon can get a reproduction for you > > > > I'm not so optimistic about that. > > > > OK, I'm going to proceed for now on the assumption that I've got an overflow > problem that happens when huge pages are pinned. If I can get more information, > great, otherwise it's probably enough. > > One thing: for your repro, if you know the huge page size, and the system > page size for that case, that would really help. Also the number of pins per > page, more or less, that you'd expect. Because Jason says that only 2M huge > pages are used... > > Because the other possibility is that the refcount really is going negative, > likely due to a mismatched pin/unpin somehow. > > If there's not an obvious repro case available, but you do have one (is it easy > to repro, though?), then *if* you have the time, I could point you to a github > branch that reduces GUP_PIN_COUNTING_BIAS by, say, 4x, by applying this: I'll see what I can do this Sunday. > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index bb44c4d2ada7..8526fd03b978 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1077,7 +1077,7 @@ static inline void put_page(struct page *page) > * get_user_pages and page_mkclean and other calls that race to set up page > * table entries. > */ > -#define GUP_PIN_COUNTING_BIAS (1U << 10) > +#define GUP_PIN_COUNTING_BIAS (1U << 8) > > void unpin_user_page(struct page *page); > void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages, > > If that fails to repro, then we would be zeroing in on the root cause. > > The branch is here (I just tested it and it seems healthy): > > git@xxxxxxxxxx:johnhubbard/linux.git pin_user_pages_tracking_v11_with_diags > > > > thanks, > -- > John Hubbard > NVIDIA