On 12/27/19 1:56 PM, John Hubbard wrote: ... >> It is ancient verification test (~10y) which is not an easy task to >> make it understandable and standalone :). >> > > Is this the only test that fails, btw? No other test failures or hints of > problems? > > (Also, maybe hopeless, but can *anyone* on the RDMA list provide some > characterization of the test, such as how many pins per page, what page > sizes are used? I'm still hoping to write a test to trigger something > close to this...) > > I do have a couple more ideas for test runs: > > 1. Reduce GUP_PIN_COUNTING_BIAS to 1. That would turn the whole override of > page->_refcount into a no-op, and so if all is well (it may not be!) with the > rest of the patch, then we'd expect this problem to not reappear. > > 2. Active /proc/vmstat *foll_pin* statistics unconditionally (just for these > tests, of course), so we can see if there is a get/put mismatch. However, that > will change the timing, and so it must be attempted independently of (1), in > order to see if it ends up hiding the repro. > > I've updated this branch to implement (1), but not (2), hoping you can give > this one a spin? > > git@xxxxxxxxxx:johnhubbard/linux.git pin_user_pages_tracking_v11_with_diags > > Also, looking ahead: a) if the problem disappears with the latest above test, then we likely have a huge page refcount overflow, and there are a couple of different ways to fix it. b) if it still reproduces with the above, then it's some other random mistake, and in that case I'd be inclined to do a sort of guided (or classic, unguided) git bisect of the series. Because it could be any of several patches. If that's too much trouble, then I'd have to fall back to submitting a few patches at a time and working my way up to the tracking patch... thanks, -- John Hubbard NVIDIA