Re: locking/refcount problems in cachefiles.

NeilBrown <neilb@xxxxxxx> · Mon, 3 Feb 2014 11:32:45 +1100

On Wed, 29 Jan 2014 12:17:15 +0000 David Howells <dhowells@xxxxxxxxxx> wrote:

> NeilBrown <neilb@xxxxxxx> wrote:
> 
> >  Analysis of the crash dump suggests that fscache_object_destroy, and thus
> >  __rb_erase_colour, is being called on an object that has already been
> >  destroy and is no longer in the rb tree.  The rbtree code gets upset and
> >  crashes.
> 
> Not unreasonably...  But which rb_tree?  There are two:
> 
>  (1) struct cachefiles_cache::active_nodes.
> 
>      This is governed by struct cachefiles_cache::active_lock.
> 
>  (2) fscache_object_list.
> 
>      This is governed by fscache_object_list_lock.
> 
>      Unless you have CONFIG_FSCACHE_OBJECT_LIST=y this isn't present and
>      fscache_objlist_remove() does nothing - in which case all
>      fscache_object_destroy() does is release the cookie.

Ahhh.  Two rb_trees.  Of course.  That does invalidate my theory somewhat.

> 
> Can you poke around in the registers, see if any of them point to tree (2)
> (which is a global variable).

The only rb_tree that I've looked at the the one for the object being
destroyed when the machine crashes.
I walk up to the root and that exactly matches fscache_object_list.rb_node.
It then walked down to enumerate the whole tree and the object being deleted
isn't in it.  Nor is the parent of the object being deleted.

> 
> >   Thus you can get a race
> > ...
> >                                     cachefiles_mark_object_active increments
> >                                        ->usage (to 1) and drops the lock
> 
> This is tree (1).
> 
> >      cachefiles_put_object calls
> >         fscache_object_destroy which
> >         unlinks from the rb tree.
> 
> And this is tree (2).
> 
> >   cachefiles_objects live in an rbtree which does not imply a reference to
> >   the object.
> 
> Whilst that is true, they're not allowed to be in the rbtree unless they still
> have at least one reference outstanding.
> 
> Apart from cachefiles_walk_to_object()'s "check_error" labelled part, objects
> are only rb_erase()'d in cachefiles_drop_object().  This is called from the
> fscache object state machine (fscache_drop_object) which holds a ref on the
> cachefiles object until fscache_object_work_func() releases it just prior to
> returning.

Thanks - I see that now.  The code does look safe.

Which makes it all the more surprising that I have this crash-dump :-(

I also have a crash-dump of a crash in fscache_put_operation() because
FSCACHE_OP_DEAD is already set.  This is in 3.0 - the code is quite different
in mainline though none of the changes look like they would fix any
whatever caused the bug.

So I have two bug reports which both seem to suggest that something was freed
twice. Yet I can't see anything that would allow that to happen.
Most frustrating.

If you have any ideas, I'd love to hear them :-)

Thanks,
NeilBrown

Attachment:
signature.asc

Description: PGP signature
--
Linux-cachefs mailing list
Linux-cachefs@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cachefs