On Thu, May 20, 2021 at 3:30 PM Jan Kara <jack@xxxxxxx> wrote: > On Thu 20-05-21 14:25:36, Andreas Gruenbacher wrote: > > Now that we handle self-recursion on the inode glock in gfs2_fault and > > gfs2_page_mkwrite, we need to take care of more complex deadlock > > scenarios like the following (example by Jan Kara): > > > > Two independent processes P1, P2. Two files F1, F2, and two mappings M1, > > M2 where M1 is a mapping of F1, M2 is a mapping of F2. Now P1 does DIO > > to F1 with M2 as a buffer, P2 does DIO to F2 with M1 as a buffer. They > > can race like: > > > > P1 P2 > > read() read() > > gfs2_file_read_iter() gfs2_file_read_iter() > > gfs2_file_direct_read() gfs2_file_direct_read() > > locks glock of F1 locks glock of F2 > > iomap_dio_rw() iomap_dio_rw() > > bio_iov_iter_get_pages() bio_iov_iter_get_pages() > > <fault in M2> <fault in M1> > > gfs2_fault() gfs2_fault() > > tries to grab glock of F2 tries to grab glock of F1 > > > > Those kinds of scenarios are much harder to reproduce than > > self-recursion. > > > > We deal with such situations by using the LM_FLAG_OUTER flag to mark > > "outer" glock taking. Then, when taking an "inner" glock, we use the > > LM_FLAG_TRY flag so that locking attempts that don't immediately succeed > > will be aborted. In case of a failed locking attempt, we "unroll" to > > where the "outer" glock was taken, drop the "outer" glock, and fault in > > the first offending user page. This will re-trigger the "inner" locking > > attempt but without the LM_FLAG_TRY flag. Once that has happened, we > > re-acquire the "outer" glock and retry the original operation. > > > > Reported-by: Jan Kara <jack@xxxxxxx> > > Signed-off-by: Andreas Gruenbacher <agruenba@xxxxxxxxxx> > > ... > > > diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c > > index 7d88abb4629b..8b26893f8dc6 100644 > > --- a/fs/gfs2/file.c > > +++ b/fs/gfs2/file.c > > @@ -431,21 +431,30 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf) > > vm_fault_t ret = VM_FAULT_LOCKED; > > struct gfs2_holder gh; > > unsigned int length; > > + u16 flags = 0; > > loff_t size; > > int err; > > > > sb_start_pagefault(inode->i_sb); > > > > - gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh); > > + if (current_holds_glock()) > > + flags |= LM_FLAG_TRY; > > + > > + gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, flags, &gh); > > if (likely(!outer_gh)) { > > err = gfs2_glock_nq(&gh); > > if (err) { > > ret = block_page_mkwrite_return(err); > > + if (err == GLR_TRYFAILED) { > > + set_current_needs_retry(true); > > + ret = VM_FAULT_SIGBUS; > > + } > > I've checked to make sure but do_user_addr_fault() indeed calls do_sigbus() > which raises the SIGBUS signal. So if the application does not ignore > SIGBUS, your retry will be visible to the application and can cause all > sorts of interesting results... I would have noticed that, but no SIGBUS signals were actually delivered. So we probably end up in kernelmode_fixup_or_oops() when in kernel mode, which just does nothing in that case. Andy Lutomirski, you've been involved with this, could you please shed some light? > So you probably need to add a new VM_FAULT_ > return code that will behave like VM_FAULT_SIGBUS except it will not raise > the signal. A new VM_FAULT_* flag might make the code easier to read, but I don't know if we can have one. > Otherwise it seems to me your approach should work. Thanks a lot, Andreas