答复: Pipe "deadlock" in Hammer, 0.94.5

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



In our test, when one machine is under heavy packet loss, OSDs on other machines can be brought down and sometimes more than one OSD go down, because they may try to read a message that comes from that machine. So we are really concerned, please help us. Thanks


-----邮件原件-----
发件人: 许雪寒 
发送时间: 2017年1月11日 15:13
收件人: 'Sage Weil'
抄送: ceph-devel@xxxxxxxxxxxxxxx
主题: 答复: Pipe "deadlock" in Hammer, 0.94.5

Thanks for your reply, sir:-)

Actually, this case is not very rare in our test. When iptables drop ip packets with the probability around 98%~99%, this case occurs about every three times our test runs. 
I checked #14120 as you recommended, however, it doesn't seem to be our problem. Because, as http://tracker.ceph.com/issues/18184 says, #14120 is caused by a commit that changed Pipe::tcp_read_wait() to return -errno instead of "-1" and this commit is not applied in our tested hammer verion(0.94.5).
And I truly agree with you in that the invocation of "recv" function passes MSG_DONTWAIT flag, so it shouldn't block, however, every time we encounter this problem it's always the Pipe::reader_thread who's calling recv that hold the lock when suicide happens, this is really confusing.

Please help us, thank you:-)



-----邮件原件-----
发件人: Sage Weil [mailto:sage@xxxxxxxxxxxx]
发送时间: 2017年1月10日 21:12
收件人: 许雪寒
抄送: ceph-devel@xxxxxxxxxxxxxxx
主题: Re: Pipe "deadlock" in Hammer, 0.94.5

On Tue, 10 Jan 2017, 许雪寒 wrote:
> Hi, everyone.
> 
> Recently, we did some experiment to test the stability of the ceph cluster. We used Hammer version which is the mostly used version of online cluster. One of the scenarios that we simulated is poor network connectivity, in which we used iptables to drop TCP/IP packet under some probability. And sometimes, we can see that some OSD suicide themselves. 
> 
> We used gdb to debug the core dumped by linux. We found that the thread that hit the suicide time threshold is a peering thread who is trying to send a pg_notify message, the ceph-osd log file and gdb output is as follows:
> 
> Log file:
>     -3> 2017-01-10 17:02:13.469949 7fd446ff7700  1 heartbeat_map 
> is_healthy 'OSD::osd_tp thread 0x7fd440bed700' had timed out after 15
>     -2> 2017-01-10 17:02:13.469952 7fd446ff7700  1 heartbeat_map 
> is_healthy 'OSD::osd_tp thread 0x7fd440bed700' had suicide timed out 
> after 150
>     -1> 2017-01-10 17:02:13.469954 7fd4451f4700  1 --
> 10.160.132.157:6818/10014122 <== osd.20 10.160.132.156:0/24908 163 
> ==== osd_ping(ping e4030 stamp 2017-01-10 17:02:13.450374) v2 ====
> 47+0+0 (3247646131 0 0) 0x7fd418ca8600 con 0x7fd413c89700
>      0> 2017-01-10 17:02:13.496895 7fd446ff7700 -1 error_msg
> common/HeartbeatMap.cc: In function 'bool 
> ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, 
> time_t)' thread 7fd446ff7700 time 2017-01-10 17:02:13.469969
> common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
> 
> GDB OUTPUT:
> (gdb) thread 8
> [Switching to thread 8 (Thread 0x7fd440bed700 (LWP 15302))]#0
> 0x0000003c5d80e334 in __lll_lock_wait () from /lib64/libpthread.so.0
> (gdb) bt
> #0  0x0000003c5d80e334 in __lll_lock_wait () from
> /lib64/libpthread.so.0
> #1  0x0000003c5d8095d8 in _L_lock_854 () from /lib64/libpthread.so.0
> #2  0x0000003c5d8094a7 in pthread_mutex_lock () from
> /lib64/libpthread.so.0
> #3  0x0000000001a54ae4 in Mutex::Lock (this=0x7fd426453598,
> no_lockdep=false) at common/Mutex.cc:96
> #4  0x0000000001409285 in Mutex::Locker::Locker (this=0x7fd440beb6c0,
> m=...) at common/Mutex.h:115
> #5  0x0000000001c46446 in PipeConnection::try_get_pipe 
> (this=0x7fd426453580, p=0x7fd440beb908) at
> msg/simple/PipeConnection.cc:38
> #6  0x0000000001c05809 in SimpleMessenger::submit_message 
> (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580, 
> dest_addr=..., dest_type=4, already_locked=false) at
> msg/simple/SimpleMessenger.cc:443
> #7  0x0000000001c033fa in SimpleMessenger::_send_message 
> (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580) at
> msg/simple/SimpleMessenger.cc:136
> #8  0x0000000001c467c7 in SimpleMessenger::send_message 
> (this=0x7fd482029400, m=0x7fd425538d00, con=0x7fd426453580) at
> msg/simple/SimpleMessenger.h:139
> #9  0x0000000001c466a1 in PipeConnection::send_message 
> (this=0x7fd426453580, m=0x7fd425538d00) at
> msg/simple/PipeConnection.cc:78
> #10 0x00000000013b3ff2 in OSDService::send_map (this=0x7fd4821e76c8, 
> m=0x7fd425538d00, con=0x7fd426453580) at osd/OSD.cc:1054
> #11 0x00000000013b45e7 in OSDService::send_incremental_map 
> (this=0x7fd4821e76c8, since=4028, con=0x7fd426453580, 
> osdmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at
> osd/OSD.cc:1087
> #12 0x00000000013b215f in OSDService::share_map_peer 
> (this=0x7fd4821e76c8, peer=9, con=0x7fd426453580, 
> map=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at osd/OSD.cc:887
> #13 0x00000000013f43cc in OSD::do_notifies (this=0x7fd4821e6000, 
> notify_list=std::map with 7 elements = {...}, 
> curmap=std::tr1::shared_ptr (count 49) 0x7fd426c0f480) at
> osd/OSD.cc:7246
> #14 0x00000000013f3c99 in OSD::dispatch_context (this=0x7fd4821e6000, 
> ctx=..., pg=0x0, curmap=std::tr1::shared_ptr (count 49) 
> 0x7fd426c0f480, handle=0x7fd440becb40) at osd/OSD.cc:7198
> #15 0x000000000140043e in OSD::process_peering_events 
> (this=0x7fd4821e6000, pgs=std::list = {...}, handle=...) at
> osd/OSD.cc:8539
> #16 0x000000000141e094 in OSD::PeeringWQ::_process 
> (this=0x7fd4821e7070, pgs=std::list = {...}, handle=...) at
> osd/OSD.h:1601
> #17 0x00000000014b94bf in
> ThreadPool::BatchWorkQueue<PG>::_void_process (this=0x7fd4821e7070, 
> p=0x7fd425419040, handle=...) at common/WorkQueue.h:107
> #18 0x0000000001b2d2e8 in ThreadPool::worker (this=0x7fd4821e64b0,
> wt=0x7fd4761db430) at common/WorkQueue.cc:128
> #19 0x0000000001b313f7 in ThreadPool::WorkThread::entry
> (this=0x7fd4761db430) at common/WorkQueue.h:318
> #20 0x0000000001b33d40 in Thread::entry_wrapper (this=0x7fd4761db430) 
> at common/Thread.cc:61
> #21 0x0000000001b33cb2 in Thread::_entry_func (arg=0x7fd4761db430) at
> common/Thread.cc:45
> #22 0x0000003c5d807aa1 in start_thread () from /lib64/libpthread.so.0
> #23 0x0000003c5d4e8aad in clone () from /lib64/libc.so.6
> 
> And the thread that is holding the mutex lock which this thread is trying to get is a pipe reader_thread who is trying to read a full message that sent from another OSD:
> 
> (gdb) frame 2
> #2  0x0000003c5d8094a7 in pthread_mutex_lock () from
> /lib64/libpthread.so.0
> (gdb) info reg
> rax            0xfffffffffffffe00            -512 rbx
> 0x7fd425538d00             140549136026880 rcx 0xffffffffffffffff               
> -1 rdx            0x10    16 rsi
> 0x80      128 rdi            0x7fd4264535a8
> 140549151864232 rbp            0x7fd440beb680
> 0x7fd440beb680 rsp            0x7fd440beb648            0x7fd440beb648
> r8             0x7fd4264535a8               140549151864232
> r9             0x3bc6 15302
> r10            0x1       1
> r11            0x246  582
> r12            0x4       4
> r13            0x7fd440bed9c0             140549596043712
> r14            0x0       0
> r15            0x3       3
> rip            0x3c5d8094a7   0x3c5d8094a7 <pthread_mutex_lock+55> 
> eflags         0x246               [ PF ZF IF ] cs             0x33
> 51 ss             0x2b     43 ds             0x0       0 es
> 0x0       0 fs             0x0        0 gs             0x0        0
> (gdb) p *(pthread_mutex_t*)0x7fd4264535a8
> $4 = {__data = {__lock = 2, __count = 0, __owner = 5008, __nusers = 1, 
> __kind = 2, __spins = 0, __list = {__prev = 0x0, __next = 0x0}}, 
> __size = 
> "\002\000\000\000\000\000\000\000\220\023\000\000\001\000\000\000\002"
> , '\000' <repeats 22 times>, __align = 2}
> (gdb) thread 50
> [Switching to thread 50 (Thread 0x7fd4001f1700 (LWP 5008))]#0 
> 0x0000003c5d80ec2c in recv () from /lib64/libpthread.so.0
> (gdb) bt
> #0  0x0000003c5d80ec2c in recv () from /lib64/libpthread.so.0
> #1  0x0000000001c3a6f8 in Pipe::do_recv (this=0x7fd426561000,
> buf=0x7fd42649c000 "\a\220\006", len=4096, flags=64) at
> msg/simple/Pipe.cc:2428
> #2  0x0000000001c3aa4d in Pipe::buffered_recv (this=0x7fd426561000,
> buf=0x7fd4224c01d3 "", len=4096, flags=64) at msg/simple/Pipe.cc:2474
> #3  0x0000000001c3ab0d in Pipe::tcp_read_nonblocking 
> (this=0x7fd426561000, buf=0x7fd4224c0000 "\t\to\030", len=4096) at
> msg/simple/Pipe.cc:2492

It's supposed to be a non-blocking read, so in principle this shouldn't block and hold the lock.

I suspect what you're seeing is a variation of #14120, fixed by 63e44e32974c9bae17bb1bfd4261dcb024ad845c.

How reproducible is this?  We can push a build that includes the fix, but it would be nice to have some confirmation that it is the right one since so far this has been a very rare case.

Thanks!
sage


> #4  0x0000000001c37a25 in Pipe::read_message (this=0x7fd426561000, 
> pm=0x7fd4001f0b80, auth_handler=0x7fd422481080) at
> msg/simple/Pipe.cc:2032
> #5  0x0000000001c33146 in Pipe::reader (this=0x7fd426561000) at
> msg/simple/Pipe.cc:1581
> #6  0x0000000001c3b0c8 in Pipe::Reader::entry (this=0x7fd426561018) at
> msg/simple/Pipe.h:50
> #7  0x0000000001b33d40 in Thread::entry_wrapper (this=0x7fd426561018) 
> at common/Thread.cc:61
> #8  0x0000000001b33cb2 in Thread::_entry_func (arg=0x7fd426561018) at
> common/Thread.cc:45
> #9  0x0000003c5d807aa1 in start_thread () from /lib64/libpthread.so.0
> #10 0x0000003c5d4e8aad in clone () from /lib64/libc.so.6
> 
> Because we are dropping a large amount of ip packets, so we think the “recv” call is blocked for a long time, which led to the suicide timeout.
> 
> We read the source code, and found that the lock that directly made the thread hit suicide timeout is Connection::lock. Why is this lock used in both the submit_message and read_message?  Is it supposed to be this way?
> 
> Please help me, thank you☺
> 
> 
> N?????r??y??????X??ǧv???)޺{.n?????z?]z????ay?ʇڙ??j ??f???h??????w???
???j:+v???w???????? ????zZ+???????j"????i
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux