Hi Jeff,
On 27/04/16 20:01, Jeff Darcy wrote:
One of the "rewards" of reviewing and merging people's patches is getting email if the next regression-test-burn-in should fail - even if it fails for a completely unrelated reason. Today I got one that's not among the usual suspects. The failure was a core dump in tests/bugs/disperse/bug-1304988.t, weighing in at a respectable 42 frames.
#0 0x00007fef25976cb9 in dht_rename_lock_cbk
#1 0x00007fef25955f62 in dht_inodelk_done
#2 0x00007fef25957352 in dht_blocking_inodelk_cbk
#3 0x00007fef32e02f8f in default_inodelk_cbk
#4 0x00007fef25c029a3 in ec_manager_inodelk
#5 0x00007fef25bf9802 in __ec_manager
#6 0x00007fef25bf990c in ec_manager
#7 0x00007fef25c03038 in ec_inodelk
#8 0x00007fef25bee7ad in ec_gf_inodelk
#9 0x00007fef25957758 in dht_blocking_inodelk_rec
#10 0x00007fef25957b2d in dht_blocking_inodelk
#11 0x00007fef2597713f in dht_rename_lock
#12 0x00007fef25977835 in dht_rename
#13 0x00007fef32e0f032 in default_rename
#14 0x00007fef32e0f032 in default_rename
#15 0x00007fef32e0f032 in default_rename
#16 0x00007fef32e0f032 in default_rename
#17 0x00007fef32e0f032 in default_rename
#18 0x00007fef32e07c29 in default_rename_resume
#19 0x00007fef32d8ed40 in call_resume_wind
#20 0x00007fef32d98b2f in call_resume
#21 0x00007fef24cfc568 in open_and_resume
#22 0x00007fef24cffb99 in ob_rename
#23 0x00007fef24aee482 in mdc_rename
#24 0x00007fef248d68e5 in io_stats_rename
#25 0x00007fef32e0f032 in default_rename
#26 0x00007fef2ab1b2b9 in fuse_rename_resume
#27 0x00007fef2ab12c47 in fuse_fop_resume
#28 0x00007fef2ab107cc in fuse_resolve_done
#29 0x00007fef2ab108a2 in fuse_resolve_all
#30 0x00007fef2ab10900 in fuse_resolve_continue
#31 0x00007fef2ab0fb7c in fuse_resolve_parent
#32 0x00007fef2ab1077d in fuse_resolve
#33 0x00007fef2ab10879 in fuse_resolve_all
#34 0x00007fef2ab10900 in fuse_resolve_continue
#35 0x00007fef2ab0fb7c in fuse_resolve_parent
#36 0x00007fef2ab1077d in fuse_resolve
#37 0x00007fef2ab10824 in fuse_resolve_all
#38 0x00007fef2ab1093e in fuse_resolve_and_resume
#39 0x00007fef2ab1b40e in fuse_rename
#40 0x00007fef2ab2a96a in fuse_thread_proc
#41 0x00007fef3204daa1 in start_thread
In other words we started at FUSE, went through a bunch of performance translators, through DHT to EC, and then crashed on the way back. It seems a little odd that we turn the fop around immediately in EC, and that we have default_inodelk_cbk at frame 3. Could one of the DHT or EC people please take a look at it? Thanks!
The part regarding to ec seems ok. This is uncommon, but can happen.
When ec_gf_inodelk() is called, it sends a inodelk request to all its
subvolumes. It may happen that the callbacks of all these requests are
received before returning from ec_gf_inodelk() itself. This executes the
callback inside the same thread of the caller.
The reason why default_inodelk_cbk() is seen is because ec uses this
function to report the result back to the caller (instead of calling
STACK_UNWIND() itself).
This seems what have happened here.
The frames returned by ec to upper xlators are the same used by them
(the frame in dht_blocking_lock() is the same that receives
dht_blocking_inodelk_cbk()) and ec doesn't touch them, however the frame
at 0x7fef1003ca5c is absolutely corrupted.
We can see the call state from the core:
(gdb) f 4
#4 0x00007fef25c029a3 in ec_manager_inodelk (fop=0x7fef1000d37c,
state=5) at
/home/jenkins/root/workspace/regression-test-burn-in/xlators/cluster/ec/src/ec-locks.c:645
645 fop->cbks.inodelk(fop->req_frame, fop, fop->xl,
(gdb) print fop->answer
$30 = (ec_cbk_data_t *) 0x7fef180094ac
(gdb) print fop->answer->op_ret
$31 = 0
(gdb) print fop->answer->op_errno
$32 = 0
(gdb) print fop->answer->count
$33 = 6
(gdb) print fop->answer->mask
$34 = 63
As we can see there's an actual answer to the request with a success
result (op_ret == 0 and op_errno == 0) composed of the combination of
answers from 6 subvolumes (count == 6).
Looking at the dht code I have been unable to see any possible cause either.
The test is doing renames where source and target directories are
different. At the same time a new ec-set is added and rebalance started.
Rebalance will cause dht to also move files between bricks. Maybe this
is causing some race in dht ?
I'll try to continue investigating when I have some time.
Xavi
https://build.gluster.org/job/regression-test-burn-in/868/console
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel