Hey Kotresh, This is a known issue. We are evaluating some possible solutions. The failure is because of changes introduced by https://review.gluster.org/10147 . The GlusterD rpcsvc uses the synctask framework to provide multi threading. GlusterD uses synclock_t provided by the synctask framework to implement its big lock. The synclock framework provides a userspace M:N thread multiplexing, where M tasks are mapped onto N threads. When a synctask thread tries to acquire an already locked synclock, the synctask framework will yield the thread and put it to sleep to allow other tasks to execute. Once the lock can be acquired, the synctask framework will resume the swapped out thread. The task can be resumed on a completely different thread from the one it was put to sleep. Review-10147 introduced changes to the GlusterD transaction framework to make peerinfo access within the framework RCU compatible. In the transaction framework, GlusterD iterates over the list of peers and sends requests to other peers. The pseudo code for this is as below, ``` Transaction Starts Get BIG_LOCK . . Do other stuff . . rcu_read_lock for each peer in peers list, do Release BIG_LOCK Send request to peer Get BIG_LOCK done rcu_read_unlock . other stuff . . Release BIG_LOCK ``` During the iteration, we give up the big-lock when sending a RPC request to prevent a deadlock, and obtain it after sending the request. During the period when the transaction thread has given up the big-lock, another thread could have obtained it. The transaction thread is one the threads started by the GlusterD rpcsvc using synctask. So when the thread tries to obtain the big-lock after sending the rpc request, it could get swapped out and resumed on another thread by synctask (as explained above). If this thread swapping happens, it means that we are calling rcu_read_lock() on one thread, but rcu_read_unlock() on another thread. This by itself is a problem, as liburcu doesn't support a read critical section starting in one thread and ending in another. The particular flavour of liburcu, bulletproof/bp (no longer seems to be though), we are using leads to the crash. liburcu requires every thread that will enter a read-critical section to do a thread registration (call rcu_thread_register). The BP flavour does this registration automatically if required when rcu_read_lock is called. In this case rcu_read_lock was called on one thread, but rcu_read_unlock was called in another thread which was unregistered. rcu_read_unlock tried to access some TLS variables, which would have been created on thread registration, and caused a segfault. Using an alternate flavour of liburcu and manually registering every thread with urcu, would lead to other problems (RCU deadlocks!) as rcu_read_lock and rcu_read_unlock could be called from different thread due to the thread swapping. We are currently evaluating some possible solutions to this. We are trying to see if we can prevent the thread from being swapped, as this is the only way we can get correct liburcu functionality. I'll update here once we have a better plan. Thanks, Kaushal On Thu, Apr 16, 2015 at 3:29 PM, Kotresh Hiremath Ravishankar <khiremat@xxxxxxxxxx> wrote: > Hi All, > > I see glusterd SEGFAULT for my patch with the following stack trace. I see that is not related to my patch. > Could someone look into this? I will retrigger regression for my patch. > > #0 0x00007f86f0968d16 in rcu_read_unlock_bp () from /home/kotresh/Downloads/regression/usr/lib64/liburcu-bp.so.1 > (gdb) bt > #0 0x00007f86f0968d16 in rcu_read_unlock_bp () from /home/kotresh/Downloads/regression/usr/lib64/liburcu-bp.so.1 > #1 0x00007f86f1235467 in gd_commit_op_phase (op=GD_OP_START_VOLUME, op_ctx=0x7f86f9d5a230, req_dict=0x7f86f9d5bf2c, op_errstr=0x7f86e0244260, > txn_opinfo=0x7f86e02441e0) at /home/jenkins/root/workspace/rackspace-regression-2GB-triggered/xlators/mgmt/glusterd/src/glusterd-syncop.c:1360 > #2 0x00007f86f1236366 in gd_sync_task_begin (op_ctx=0x7f86f9d5a230, req=0xcb6b8c) > at /home/jenkins/root/workspace/rackspace-regression-2GB-triggered/xlators/mgmt/glusterd/src/glusterd-syncop.c:1736 > #3 0x00007f86f123654b in glusterd_op_begin_synctask (req=0xcb6b8c, op=GD_OP_START_VOLUME, dict=0x7f86f9d5a230) > at /home/jenkins/root/workspace/rackspace-regression-2GB-triggered/xlators/mgmt/glusterd/src/glusterd-syncop.c:1787 > #4 0x00007f86f1221402 in __glusterd_handle_cli_start_volume (req=0xcb6b8c) > at /home/jenkins/root/workspace/rackspace-regression-2GB-triggered/xlators/mgmt/glusterd/src/glusterd-volume-ops.c:471 > #5 0x00007f86f1190291 in glusterd_big_locked_handler (req=0xcb6b8c, actor_fn=0x7f86f122110d <__glusterd_handle_cli_start_volume>) > at /home/jenkins/root/workspace/rackspace-regression-2GB-triggered/xlators/mgmt/glusterd/src/glusterd-handler.c:83 > #6 0x00007f86f12214a3 in glusterd_handle_cli_start_volume (req=0xcb6b8c) > at /home/jenkins/root/workspace/rackspace-regression-2GB-triggered/xlators/mgmt/glusterd/src/glusterd-volume-ops.c:489 > #7 0x00007f86fc375f66 in synctask_wrap (old_task=0x7f86e0041760) at /home/jenkins/root/workspace/rackspace-regression-2GB-triggered/libglusterfs/src/syncop.c:375 > #8 0x00007f86fb1508f0 in ?? () from /home/kotresh/Downloads/regression/lib64/libc.so.6 > #9 0x0000000000000000 in ?? () > > > Link to the core file: > http://slave27.cloud.gluster.org/archived_builds/build-install-20150416:07:11:15.tar.bz2 > > > Thanks and Regards, > Kotresh H R > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxxx > http://www.gluster.org/mailman/listinfo/gluster-devel _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel