Re: glusterd stuck for glusterfs with version 3.12.15

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I made a patch and according to my test, this glusterd stuck issue disappear with my patch. Only need to move  event_handled to the end of socket_event_poll_in function.

 

--- a/rpc/rpc-transport/socket/src/socket.c   

+++ b/rpc/rpc-transport/socket/src/socket.c

@@ -2305,9 +2305,9 @@ socket_event_poll_in (rpc_transport_t *this, gf_boolean_t notify_handled)

         }

 

-        if (notify_handled && (ret != -1))

-                event_handled (ctx->event_pool, priv->sock, priv->idx,

-                               priv->gen);

@@ -2330,6 +2327,9 @@ socket_event_poll_in (rpc_transport_t *this, gf_boolean_t notify_handled)

                 }

                 pthread_mutex_unlock (&priv->notify.lock);

         }

+        if (notify_handled && (ret != -1))

+                event_handled (ctx->event_pool, priv->sock, priv->idx,

+                               priv->gen);

         return ret;

}

 

cynthia

From: Zhou, Cynthia (NSB - CN/Hangzhou)
Sent: Tuesday, April 09, 2019 3:57 PM
To: 'Raghavendra Gowdappa' <rgowdapp@xxxxxxxxxx>
Cc: gluster-devel@xxxxxxxxxxx
Subject: RE: glusterd stuck for glusterfs with version 3.12.15

 

Can you figure out some possible reason why iobref is corrupted, is it possible that thread 8 has called poll in and iobref has been relased, but the lock within it is not properly released (as I can not find any free lock operation in iobref_destroy), then thread 9 called rpc_transport_pollin_destroy again, and so stuck on this lock

Also, there should not be two thread handling the same socket at the same time, although there has been a patch claimed to tackle this issue.

 

cynthia

 

From: Raghavendra Gowdappa <rgowdapp@xxxxxxxxxx>
Sent: Tuesday, April 09, 2019 3:52 PM
To: Zhou, Cynthia (NSB - CN/Hangzhou) <cynthia.zhou@xxxxxxxxxxxxxxx>
Cc: gluster-devel@xxxxxxxxxxx
Subject: Re: glusterd stuck for glusterfs with version 3.12.15

 

 

 

On Mon, Apr 8, 2019 at 7:42 AM Zhou, Cynthia (NSB - CN/Hangzhou) <cynthia.zhou@xxxxxxxxxxxxxxx> wrote:

Hi glusterfs experts,

Good day!

In my test env, sometimes glusterd stuck issue happened, and it is not responding to any gluster commands, when I checked this issue I find that glusterd thread 9 and thread 8 is dealing with the same socket, I thought following patch should be able to solve this issue, however after I merged this patch this issue still exist. When I looked into this code, it seems socket_event_poll_in called event_handled before rpc_transport_pollin_destroy, I think this gives the chance for another poll for the exactly the same socket. And caused this glusterd stuck issue, also, I find there is no   LOCK_DESTROY(&iobref->lock)

In iobref_destroy, I think it is better to add destroy lock.

Following is the gdb info when this issue happened, I would like to know your opinion on this issue, thanks!

 

SHA-1: f747d55a7fd364e2b9a74fe40360ab3cb7b11537

 

* socket: fix issue on concurrent handle of a socket

 

 

 

GDB INFO:

Thread 8 is blocked on pthread_cond_wait, and thread 9 is blocked in iobref_unref, I think

Thread 9 (Thread 0x7f9edf7fe700 (LWP 1933)):

#0  0x00007f9ee9fd785c in __lll_lock_wait () from /lib64/libpthread.so.0

#1  0x00007f9ee9fda657 in __lll_lock_elision () from /lib64/libpthread.so.0

#2  0x00007f9eeb24cae6 in iobref_unref (iobref=0x7f9ed00063b0) at iobuf.c:944

#3  0x00007f9eeafd2f29 in rpc_transport_pollin_destroy (pollin=0x7f9ed00452d0) at rpc-transport.c:123

#4  0x00007f9ee4fbf319 in socket_event_poll_in (this=0x7f9ed0049cc0, notify_handled=_gf_true) at socket.c:2322

#5  0x00007f9ee4fbf932 in socket_event_handler (fd=36, idx=27, gen=4, data="" poll_in=1, poll_out=0, poll_err=0) at socket.c:2471

#6  0x00007f9eeb2825d4 in event_dispatch_epoll_handler (event_pool=0x17feb00, event=0x7f9edf7fde84) at event-epoll.c:583

#7  0x00007f9eeb2828ab in event_dispatch_epoll_worker (data="" at event-epoll.c:659

#8  0x00007f9ee9fce5da in start_thread () from /lib64/libpthread.so.0

#9  0x00007f9ee98a4eaf in clone () from /lib64/libc.so.6

 

Thread 8 (Thread 0x7f9edffff700 (LWP 1932)):

#0  0x00007f9ee9fd785c in __lll_lock_wait () from /lib64/libpthread.so.0

#1  0x00007f9ee9fd2b42 in __pthread_mutex_cond_lock () from /lib64/libpthread.so.0

#2  0x00007f9ee9fd44a8 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#3  0x00007f9ee4fbadab in socket_event_poll_err (this=0x7f9ed0049cc0, gen=4, idx=27) at socket.c:1201

#4  0x00007f9ee4fbf99c in socket_event_handler (fd=36, idx=27, gen=4, data="" poll_in=1, poll_out=0, poll_err=0) at socket.c:2480

#5  0x00007f9eeb2825d4 in event_dispatch_epoll_handler (event_pool=0x17feb00, event=0x7f9edfffee84) at event-epoll.c:583

#6  0x00007f9eeb2828ab in event_dispatch_epoll_worker (data="" at event-epoll.c:659

#7  0x00007f9ee9fce5da in start_thread () from /lib64/libpthread.so.0

#8  0x00007f9ee98a4eaf in clone () from /lib64/libc.so.6

 

(gdb) thread 9

[Switching to thread 9 (Thread 0x7f9edf7fe700 (LWP 1933))]

#0  0x00007f9ee9fd785c in __lll_lock_wait () from /lib64/libpthread.so.0

(gdb) bt

#0  0x00007f9ee9fd785c in __lll_lock_wait () from /lib64/libpthread.so.0

#1  0x00007f9ee9fda657 in __lll_lock_elision () from /lib64/libpthread.so.0

#2  0x00007f9eeb24cae6 in iobref_unref (iobref=0x7f9ed00063b0) at iobuf.c:944

#3  0x00007f9eeafd2f29 in rpc_transport_pollin_destroy (pollin=0x7f9ed00452d0) at rpc-transport.c:123

#4  0x00007f9ee4fbf319 in socket_event_poll_in (this=0x7f9ed0049cc0, notify_handled=_gf_true) at socket.c:2322

#5  0x00007f9ee4fbf932 in socket_event_handler (fd=36, idx=27, gen=4, data="" poll_in=1, poll_out=0, poll_err=0) at socket.c:2471

#6  0x00007f9eeb2825d4 in event_dispatch_epoll_handler (event_pool=0x17feb00, event=0x7f9edf7fde84) at event-epoll.c:583

#7  0x00007f9eeb2828ab in event_dispatch_epoll_worker (data="" at event-epoll.c:659

#8  0x00007f9ee9fce5da in start_thread () from /lib64/libpthread.so.0

#9  0x00007f9ee98a4eaf in clone () from /lib64/libc.so.6

(gdb) frame 2

#2  0x00007f9eeb24cae6 in iobref_unref (iobref=0x7f9ed00063b0) at iobuf.c:944

944 iobuf.c: No such file or directory.

(gdb) print *iobref

$1 = {lock = {spinlock = 2, mutex = {__data = {__lock = 2, __count = 222, __owner = -2120437760, __nusers = 1, __kind = 8960, __spins = 512,

        __elision = 0, __list = {__prev = 0x4000, __next = 0x7f9ed00063b000}},

      __size = "\002\000\000\000\336\000\000\000\000\260\234\201\001\000\000\000\000#\000\000\000\002\000\000\000@\000\000\000\000\000\000\000\260c\000О\177", __align = 953482739714}}, ref = -256, iobrefs = 0xffffffffffffffff, alloced = -1, used = -1}

 

looks like the iobref is corrupted here. It seems to be a use-after-free issue. We need to dig into why a freed iobref is being accessed here.

 

(gdb) quit

A debugging session is active.

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux