----- Original Message ----- > From: "GuangYang" <yguang11@xxxxxxxxxxx> > To: "Yehuda Sadeh-Weinraub" <yehuda@xxxxxxxxxx> > Cc: ceph-devel@xxxxxxxxxxxxxxx, ceph-users@xxxxxxxxxxxxxx > Sent: Wednesday, June 24, 2015 2:12:23 PM > Subject: RE: radosgw crash within libfcgi > > ---------------------------------------- > > Date: Wed, 24 Jun 2015 17:04:05 -0400 > > From: yehuda@xxxxxxxxxx > > To: yguang11@xxxxxxxxxxx > > CC: ceph-devel@xxxxxxxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx > > Subject: Re: radosgw crash within libfcgi > > > > > > > > ----- Original Message ----- > >> From: "GuangYang" <yguang11@xxxxxxxxxxx> > >> To: "Yehuda Sadeh-Weinraub" <yehuda@xxxxxxxxxx> > >> Cc: ceph-devel@xxxxxxxxxxxxxxx, ceph-users@xxxxxxxxxxxxxx > >> Sent: Wednesday, June 24, 2015 1:53:20 PM > >> Subject: RE: radosgw crash within libfcgi > >> > >> Thanks Yehuda for the response. > >> > >> We already patched libfcgi to use poll instead of select to overcome the > >> limitation. > >> > >> Thanks, > >> Guang > >> > >> > >> ---------------------------------------- > >>> Date: Wed, 24 Jun 2015 14:40:25 -0400 > >>> From: yehuda@xxxxxxxxxx > >>> To: yguang11@xxxxxxxxxxx > >>> CC: ceph-devel@xxxxxxxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx > >>> Subject: Re: radosgw crash within libfcgi > >>> > >>> > >>> > >>> ----- Original Message ----- > >>>> From: "GuangYang" <yguang11@xxxxxxxxxxx> > >>>> To: ceph-devel@xxxxxxxxxxxxxxx, ceph-users@xxxxxxxxxxxxxx, > >>>> yehuda@xxxxxxxxxx > >>>> Sent: Wednesday, June 24, 2015 10:09:58 AM > >>>> Subject: radosgw crash within libfcgi > >>>> > >>>> Hello Cephers, > >>>> Recently we have several radosgw daemon crashes with the same following > >>>> kernel log: > >>>> > >>>> Jun 23 14:17:38 xxx kernel: radosgw[68180]: segfault at f0 ip > >>>> 00007ffa069996f2 sp 00007ff55c432710 error 6 in > > > > error 6 is sigabrt, right? With invalid pointer I'd expect to get segfault. > > Is the pointer actually invalid? > With (ip - {address_load_the_sharded_library}) to get the instruction which > caused this crash, the objdump shows the crash happened at instruction 46f2 > (see below), which was to assign '-1' to the CGX_Request::ipcFd to -1, but I > don't quite understand how/why it could crash there. > > 0000000000004690 <FCGX_Free>: > 4690: 48 89 5c 24 f0 mov %rbx,-0x10(%rsp) > 4695: 48 89 6c 24 f8 mov %rbp,-0x8(%rsp) > 469a: 48 83 ec 18 sub $0x18,%rsp > 469e: 48 85 ff test %rdi,%rdi > 46a1: 48 89 fb mov %rdi,%rbx > 46a4: 89 f5 mov %esi,%ebp > 46a6: 74 28 je 46d0 <FCGX_Free+0x40> > 46a8: 48 8d 7f 08 lea 0x8(%rdi),%rdi > 46ac: e8 67 e3 ff ff callq 2a18 <FCGX_FreeStream@plt> > 46b1: 48 8d 7b 10 lea 0x10(%rbx),%rdi > 46b5: e8 5e e3 ff ff callq 2a18 <FCGX_FreeStream@plt> > 46ba: 48 8d 7b 18 lea 0x18(%rbx),%rdi > 46be: e8 55 e3 ff ff callq 2a18 <FCGX_FreeStream@plt> > 46c3: 48 8d 7b 28 lea 0x28(%rbx),%rdi > 46c7: e8 d4 f4 ff ff callq 3ba0 <FCGX_PutS+0x40> > 46cc: 85 ed test %ebp,%ebp > 46ce: 75 10 jne 46e0 <FCGX_Free+0x50> > 46d0: 48 8b 5c 24 08 mov 0x8(%rsp),%rbx > 46d5: 48 8b 6c 24 10 mov 0x10(%rsp),%rbp > 46da: 48 83 c4 18 add $0x18,%rsp > 46de: c3 retq > 46df: 90 nop > 46e0: 31 f6 xor %esi,%esi > 46e2: 83 7b 4c 00 cmpl $0x0,0x4c(%rbx) > 46e6: 8b 7b 30 mov 0x30(%rbx),%edi > 46e9: 40 0f 94 c6 sete %sil > 46ed: e8 86 e6 ff ff callq 2d78 <OS_IpcClose@plt> > 46f2: c7 43 30 ff ff ff ff movl $0xffffffff,0x30(%rbx) info registers? Not too familiar with the specific message, but it could be that OS_IpcClose() aborts (not highly unlikely) and it only dumps the return address of the current function (shouldn't be referenced as ip though). What's rbx? Is the memory at %rbx + 0x30 valid? Also, did you by any chance upgrade the binaries while the code was running? is the code running over nfs? Yehuda > > > > Yehuda > > > > > >>>> libfcgi.so.0.0.0[7ffa06995000+a000] in > >>>> libfcgi.so.0.0.0[7ffa06995000+a000] > >>>> > >>>> Looking at the assembly, it seems crashing at this point - > >>>> http://github.com/sknown/fcgi/blob/master/libfcgi/fcgiapp.c#L2035, which > >>>> confused me. I tried to see if there is any other reference holding the > >>>> FCGX_Request which release the handle without any luck. > >>>> > >>>> There are also other observations: > >>>> 1> Several radosgw daemon across different hosts crashed around the same > >>>> time. > >>>> 2> Apache's error log has some fcgi error complaining ##idle timeout## > >>>> during the time. > >>>> > >>>> Does anyone experience similar issue? > >>>> > >>> > >>> In the past we've had issues with libfcgi that were related to the number > >>> of open fds on the process (> 1024). The issue was a buggy libfcgi that > >>> was using select() instead of poll(), so this might be the issue you're > >>> noticing. > >>> > >>> Yehuda > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > >>> the body of a message to majordomo@xxxxxxxxxxxxxxx > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >> N嫥叉靣笡y氊b瞂千v豝�藓{.n�壏渮榏z鳐妠ay�蕠跈�jf"穐殝鄗�畐ア�⒎:+v墾妛鑚豰稛�珣赙zZ+凒殠娸"濟!秈 > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com