On 8/10/18 2:24 PM, Ross Zwisler wrote: > On Fri, Aug 10, 2018 at 9:23 AM Dave Jiang <dave.jiang@xxxxxxxxx> wrote: >> On 08/10/2018 11:31 AM, Eric Sandeen wrote: >>> On 8/8/18 12:31 PM, Dave Jiang wrote: >>>> This patch is the duplicate of ross's fix for ext4 for xfs. >>>> >>>> If the refcount of a page is lowered between the time that it is returned >>>> by dax_busy_page() and when the refcount is again checked in >>>> xfs_break_layouts() => ___wait_var_event(), the waiting function >>>> xfs_wait_dax_page() will never be called. This means that >>>> xfs_break_layouts() will still have 'retry' set to false, so we'll stop >>>> looping and never check the refcount of other pages in this inode. >>>> >>>> Instead, always continue looping as long as dax_layout_busy_page() gives us >>>> a page which it found with an elevated refcount. >>> >>> Hi Dave, does this have a testcase? Have you seen the issue using Ross's >>> xfstest generic/503 or is there some other test? Apologies if I missed >>> prior discussion on a testcase or race frequency... >> >> I do not have a testcase. I know Ross replicated it on ext4. And Jan >> asked to create the same fix with XFS when he reviewed Ross's fix for ext4. > > In my testing I couldn't get this race to hit with XFS. I couldn't > even get a failure with generic/503 when testing XFS before Dan's > initial patches went in which added xfs_break_layouts() et al. I > think that Dan had to manually insert timing delays to get the warning > to hit for XFS when testing his patches. > > The race we're fixing happens consistently with ext4 and through code > inspection we can see that the race exists in XFS. Ok, thanks for the info Dave & Ross! -Eric