On Wed, 29 Jan 2014 12:17:15 +0000 David Howells <dhowells@xxxxxxxxxx> wrote: > NeilBrown <neilb@xxxxxxx> wrote: > > > Analysis of the crash dump suggests that fscache_object_destroy, and thus > > __rb_erase_colour, is being called on an object that has already been > > destroy and is no longer in the rb tree. The rbtree code gets upset and > > crashes. > > Not unreasonably... But which rb_tree? There are two: > > (1) struct cachefiles_cache::active_nodes. > > This is governed by struct cachefiles_cache::active_lock. > > (2) fscache_object_list. > > This is governed by fscache_object_list_lock. > > Unless you have CONFIG_FSCACHE_OBJECT_LIST=y this isn't present and > fscache_objlist_remove() does nothing - in which case all > fscache_object_destroy() does is release the cookie. Ahhh. Two rb_trees. Of course. That does invalidate my theory somewhat. > > Can you poke around in the registers, see if any of them point to tree (2) > (which is a global variable). The only rb_tree that I've looked at the the one for the object being destroyed when the machine crashes. I walk up to the root and that exactly matches fscache_object_list.rb_node. It then walked down to enumerate the whole tree and the object being deleted isn't in it. Nor is the parent of the object being deleted. > > > Thus you can get a race > > ... > > cachefiles_mark_object_active increments > > ->usage (to 1) and drops the lock > > This is tree (1). > > > cachefiles_put_object calls > > fscache_object_destroy which > > unlinks from the rb tree. > > And this is tree (2). > > > cachefiles_objects live in an rbtree which does not imply a reference to > > the object. > > Whilst that is true, they're not allowed to be in the rbtree unless they still > have at least one reference outstanding. > > Apart from cachefiles_walk_to_object()'s "check_error" labelled part, objects > are only rb_erase()'d in cachefiles_drop_object(). This is called from the > fscache object state machine (fscache_drop_object) which holds a ref on the > cachefiles object until fscache_object_work_func() releases it just prior to > returning. Thanks - I see that now. The code does look safe. Which makes it all the more surprising that I have this crash-dump :-( I also have a crash-dump of a crash in fscache_put_operation() because FSCACHE_OP_DEAD is already set. This is in 3.0 - the code is quite different in mainline though none of the changes look like they would fix any whatever caused the bug. So I have two bug reports which both seem to suggest that something was freed twice. Yet I can't see anything that would allow that to happen. Most frustrating. If you have any ideas, I'd love to hear them :-) Thanks, NeilBrown
Attachment:
signature.asc
Description: PGP signature
-- Linux-cachefs mailing list Linux-cachefs@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cachefs