Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()

Jan Kara <jack@xxxxxxx> · Fri, 16 Feb 2024 11:15:46 +0100

On Thu 15-02-24 16:07:42, Liam R. Howlett wrote:
> * Jan Kara <jack@xxxxxxx> [240215 12:16]:
> > On Thu 15-02-24 12:00:08, Liam R. Howlett wrote:
> > > * Jan Kara <jack@xxxxxxx> [240215 08:16]:
> > > > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > > > From: Chuck Lever <chuck.lever@xxxxxxxxxx>
> > > > > 
> > > > > Liam says that, unlike with xarray, once the RCU read lock is
> > > > > released ma_state is not safe to re-use for the next mas_find() call.
> > > > > But the RCU read lock has to be released on each loop iteration so
> > > > > that dput() can be called safely.
> > > > > 
> > > > > Thus we are forced to walk the offset tree with fresh state for each
> > > > > directory entry. mt_find() can do this for us, though it might be a
> > > > > little less efficient than maintaining ma_state locally.
> > > > > 
> > > > > Since offset_iterate_dir() doesn't build ma_state locally any more,
> > > > > there's no longer a strong need for offset_find_next(). Clean up by
> > > > > rolling these two helpers together.
> > > > > 
> > > > > Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
> > > > 
> > > > Well, in general I think even xas_next_entry() is not safe to use how
> > > > offset_find_next() was using it. Once you drop rcu_read_lock(),
> > > > xas->xa_node could go stale. But since you're holding inode->i_rwsem when
> > > > using offset_find_next() you should be protected from concurrent
> > > > modifications of the mapping (whatever the underlying data structure is) -
> > > > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
> > > > maple tree? Am I missing something?
> > > 
> > > If you are stopping, you should be pausing the iteration.  Although this
> > > works today, it's not how it should be used because if we make changes
> > > (ie: compaction requires movement of data), then you may end up with a
> > > UAF issue.  We'd have no way of knowing you are depending on the tree
> > > structure to remain consistent.
> > 
> > I see. But we have versions of these structures that have locking external
> > to the structure itself, don't we?
> 
> Ah, I do have them - but I don't want to propagate its use as the dream
> is that it can be removed.
> 
> 
> > Then how do you imagine serializing the
> > background operations like compaction? As much as I agree your argument is
> > "theoretically clean", it seems a bit like a trap and there are definitely
> > xarray users that are going to be broken by this (e.g.
> > tag_pages_for_writeback())...
> 
> I'm not sure I follow the trap logic.  There are locks for the data
> structure that need to be followed for reading (rcu) and writing
> (spinlock for the maple tree).  If you don't correctly lock the data
> structure then you really are setting yourself up for potential issues
> in the future.
> 
> The limitations are outlined in the documentation as to how and when to
> lock.  I'm not familiar with the xarray users, but it does check for
> locking with lockdep, but the way this is written bypasses the lockdep
> checking as the locks are taken and dropped without the proper scope.
> 
> If you feel like this is a trap, then maybe we need to figure out a new
> plan to detect incorrect use?

OK, I was a bit imprecise. What I wanted to say is that this is a shift in
the paradigm in the sense that previously, we mostly had (and still have)
data structure APIs (lists, rb-trees, radix-tree, now xarray) that were
guaranteeing that unless you call into the function to mutate the data
structure it stays intact. Now maple trees are shifting more in a direction
of black-box API where you cannot assume what happens inside. Which is fine
but then we have e.g. these iterators which do not quite follow this
black-box design and you have to remember subtle details like calling
"mas_pause()" before unlocking which is IMHO error-prone. Ideally, users of
the black-box API shouldn't be exposed to the details of the internal
locking at all (but then the performance suffers so I understand why you do
things this way). Second to this ideal variant would be if we could detect
we unlocked the lock without calling xas_pause() and warn on that. Or maybe
xas_unlock*() should be calling xas_pause() automagically and we'd have
similar helpers for RCU to do the magic for you?

> Looking through tag_pages_for_writeback(), it does what is necessary to
> keep a safe state - before it unlocks it calls xas_pause().  We have the
> same on maple tree; mas_pause().  This will restart the next operation
> from the root of the tree (the root can also change), to ensure that it
> is safe.

OK, I've missed the xas_pause(). Thanks for correcting me.

> If you have other examples you think are unsafe then I can have a look
> at them as well.

I'm currently not aware of any but I'll let you know if I find some.
Missing xas/mas_pause() seems really easy.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR