Le mardi 25 mars 2014 à 20:55 -0500, Alex Elder a écrit : > On 03/25/2014 08:50 PM, Olivier Bonvalet wrote: > > Le mercredi 26 mars 2014 à 02:33 +0100, Olivier Bonvalet a écrit : > >> Thanks for your patch. > >> > >> This is an output of a crash case : > >> > >> Mar 26 02:31:18 alg kernel: [ 965.366895] rbd_img_obj_callback: bad image object request information: > >> Mar 26 02:31:18 alg kernel: [ 965.366905] obj_request ffff880224bc9528 > >> Mar 26 02:31:18 alg kernel: [ 965.366909] ->object_name <(null)> > >> Mar 26 02:31:18 alg kernel: [ 965.366913] ->offset 0 > >> Mar 26 02:31:18 alg kernel: [ 965.366917] ->length 4096 > >> Mar 26 02:31:18 alg kernel: [ 965.366921] ->type 0x1 > >> Mar 26 02:31:18 alg kernel: [ 965.366925] ->flags 0x3 > >> Mar 26 02:31:18 alg kernel: [ 965.366929] ->img_request (null) > >> Mar 26 02:31:18 alg kernel: [ 965.366933] ->which 4294967295 > >> Mar 26 02:31:18 alg kernel: [ 965.366936] ->xferred 4096 > >> Mar 26 02:31:18 alg kernel: [ 965.366940] ->result 0 > >> Mar 26 02:31:18 alg kernel: [ 965.366943] ->kref 0 > >> Mar 26 02:31:18 alg kernel: [ 965.366947] img_request ffff880222f4fb50 > >> Mar 26 02:31:18 alg kernel: [ 965.366950] ->snap 0xfffffffffffffffe > >> Mar 26 02:31:18 alg kernel: [ 965.366954] ->offset 1417662464 > >> Mar 26 02:31:18 alg kernel: [ 965.366957] ->length 16384 > >> Mar 26 02:31:18 alg kernel: [ 965.366960] ->flags 0x0 > >> Mar 26 02:31:18 alg kernel: [ 965.366963] ->obj_request_count 0 > >> Mar 26 02:31:18 alg kernel: [ 965.366966] ->next_completion 2 > >> Mar 26 02:31:18 alg kernel: [ 965.366969] ->xferred 16384 > >> Mar 26 02:31:18 alg kernel: [ 965.366973] ->result 0 > >> Mar 26 02:31:18 alg kernel: [ 965.366976] ->obj_requests head ffff880222f4fbb0 > >> Mar 26 02:31:18 alg kernel: [ 965.366980] ->kref 0 > >> Mar 26 02:31:18 alg kernel: [ 965.366985] > >> Mar 26 02:31:18 alg kernel: [ 965.366985] Assertion failure in rbd_img_obj_callback() at line 2165: > >> Mar 26 02:31:18 alg kernel: [ 965.366985] > >> Mar 26 02:31:18 alg kernel: [ 965.366985] rbd_assert(which == img_request->next_completion); > >> Mar 26 02:31:18 alg kernel: [ 965.366985] > >> Mar 26 02:31:18 alg kernel: [ 965.367185] ------------[ cut here ]------------ > >> Mar 26 02:31:18 alg kernel: [ 965.367241] kernel BUG at drivers/block/rbd.c:2165! > >> > >> > >> I hope it can help. > >> > >> > > > Thanks for sending these. > > > > > and a second one, very similar : > > > > Mar 26 02:48:27 alg kernel: [ 681.167833] rbd_img_obj_callback: bad image object request information: > > Mar 26 02:48:27 alg kernel: [ 681.167836] obj_request ffff88022e1e2828 > > Mar 26 02:48:27 alg kernel: [ 681.167837] ->object_name <(null)> > > Mar 26 02:48:27 alg kernel: [ 681.167838] ->offset 0 > > Mar 26 02:48:27 alg kernel: [ 681.167839] ->length 4096 > > Mar 26 02:48:27 alg kernel: [ 681.167840] ->type 0x1 > > Mar 26 02:48:27 alg kernel: [ 681.167840] ->flags 0x3 > > Mar 26 02:48:27 alg kernel: [ 681.167841] ->img_request (null) > > Mar 26 02:48:27 alg kernel: [ 681.167842] ->which 4294967295 > > Mar 26 02:48:27 alg kernel: [ 681.167843] ->xferred 4096 > > Mar 26 02:48:27 alg kernel: [ 681.167844] ->result 0 > > Mar 26 02:48:27 alg kernel: [ 681.167844] ->kref 0 > > This confirms the reference count of the object request has gone > to zero. This object request has already been destroyed (yet > we're handling a callback for it). > > > Mar 26 02:48:27 alg kernel: [ 681.167845] img_request ffff88021f555f10 > > Mar 26 02:48:27 alg kernel: [ 681.167846] ->snap 0xfffffffffffffffe > > Mar 26 02:48:27 alg kernel: [ 681.167847] ->offset 28072464384 > > Mar 26 02:48:27 alg kernel: [ 681.167847] ->length 16384 > > Mar 26 02:48:27 alg kernel: [ 681.167848] ->flags 0x0 > > Mar 26 02:48:27 alg kernel: [ 681.167849] ->obj_request_count 0 > > Mar 26 02:48:27 alg kernel: [ 681.167850] ->next_completion 2 > > Mar 26 02:48:27 alg kernel: [ 681.167850] ->xferred 16384 > > Mar 26 02:48:27 alg kernel: [ 681.167851] ->result 0 > > Mar 26 02:48:27 alg kernel: [ 681.167852] ->obj_requests head ffff88021f555f70 > > The object request list is empty. > > > Mar 26 02:48:27 alg kernel: [ 681.167853] ->kref 0 > > This confirms the reference count of the image request has gone > to zero. So not only has the object request already completed, > the image request has as well. > > I'm almost done composing a very large e-mail with some detailed > analysis. No answer quite yet, but I am certain that we're > getting duplicate callbacks on the second object request of > an image request that spans two objects. That should help > narrow the search for the root cause. > > -Alex Thanks again to took time to analyze that problem. All my RBD images have daily snapshots, can this bug be related to snapshots ? Maybe it's a stupid question, but is there a workaround that I could use to reduce that problem in production, until a proper fix is found ? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html