Re: Another segfault on client side (only sporadic)

"Bernhard J. M. Grün" <bernhard.gruen@xxxxxxxxxxxxxx> · Wed, 29 Aug 2007 14:56:24 +0200

Amar,

I'll check it this night. But as the error is only sporadic it is hard
to say if it is gone or not.

Thank you for your work!

Bernhard

2007/8/29, Amar S. Tumballi <amar@xxxxxxxxxxxxx>:
> Hi Bernhard, Krishna,
>  There were three issues which caused all these segfaults in afr. One of it
> was in fuse-bridge code, where handling inode was a problem. Other two were
> in unify. All these problems should be fixed in patch-469.
>
> Bernhard, can you check with the latest tla and confirm all these bugs are
> fixed?
>
> -amar
>
>
> On 8/24/07, Bernhard J. M. Grün < bernhard.gruen@xxxxxxxxxxxxxx> wrote:
> > Hi Krishna,
> >
> > here is your requested information. The following is the information
> > from the first mail:
> > #0  0x00002aaaaacbc2bd in afr_stat (frame=0x2aaabce32cb0,
> >     this=<value optimized out>, loc=0x2aaaac0fe168) at afr.c:2602
> > 2602    afr.c: No such file or directory.
> >         in afr.c
> > (gdb) p *loc
> > $1 = {path = 0x2aaaaf21f000
> "/imagecache/galerie/4197/thumbnail/419775.jpg",
> >   ino = 1744179, inode = 0x2aaab237c360}
> > (gdb) p *loc->inode
> > $2 = {lock = 1, table = 0x60c590, nlookup = 1, generation = 0, ref = 2,
> >   ino = 1744179, st_mode = 33188, fds = {next = 0x2aaab237c38c,
> >     prev = 0x2aaab237c38c}, ctx = 0x0, dentry = {inode_list = {
> >       next = 0x2aaab237c3a4, prev = 0x2aaab237c3a4}, name_hash = {
> >       next = 0x2aaab93809a4, prev = 0x2aaac4483bf4}, inode =
> 0x2aaab237c360,
> >     name = 0x2aaab1533820 "419775.jpg", parent = 0x2aaab4dc0c90},
> >   inode_hash = {next = 0x2aaab65afb5c, prev = 0x2aaaac1a4fdc}, list = {
> >     next = 0x2aaabada476c, prev = 0x60c5f0}}
> >
> >
> > Now here is the information from two crashes of the later mail.
> > The first crash from the last mail:
> > Program terminated with signal 11, Segmentation fault.
> > #0  0x00002aaaaacbc2bd in afr_stat (frame=0x2aaab0831c80,
> >     this=<value optimized out>, loc=0x2aaaed03abf8) at afr.c:2602
> > 2602    afr.c: No such file or directory.
> >         in afr.c
> > (gdb) p *loc
> > $1 = {path = 0x2aaabf51f790
> "/imagecache/galerie/4482/thumbnail/448221.jpg",
> >   ino = 1879050, inode = 0x2aab132f86b0}
> > (gdb) p *loc->inode
> > $2 = {lock = 1, table = 0x60c590, nlookup = 1, generation = 0, ref = 2,
> >   ino = 1879050, st_mode = 33188, fds = {next = 0x2aab132f86dc,
> >     prev = 0x2aab132f86dc}, ctx = 0x0, dentry = {inode_list = {
> >       next = 0x2aab132f86f4, prev = 0x2aab132f86f4}, name_hash = {
> >       next = 0x2aaacb980464, prev = 0x2aaaab567dd0}, inode =
> 0x2aab132f86b0,
> >     name = 0x2aab04c6b030 "448221.jpg", parent = 0x2aaabfe97300},
> >   inode_hash = {next = 0x2aaaaca36d0c, prev = 0x2aaaab523fe0}, list = {
> >     next = 0x2aaad4ba01dc, prev = 0x60c5f0}}
> >
> > The second crash:
> > Program terminated with signal 11, Segmentation fault.
> > #0  0x00002aaaaacbc2bd in afr_stat (frame=0x2aab10d94830,
> >     this=<value optimized out>, loc=0x2aab1105bd68) at afr.c:2602
> > 2602    afr.c: No such file or directory.
> >         in afr.c
> > (gdb) p *loc
> > $1 = {path = 0x2aab1208aa00 "", ino = 39422758, inode = 0x2aaae69be260}
> > (gdb) p *loc->inode
> > $2 = {lock = 1, table = 0x60c590, nlookup = 1, generation = 0, ref = 3,
> >   ino = 39422758, st_mode = 16877, fds = {next = 0x2aaae69be28c,
> >     prev = 0x2aaae69be28c}, ctx = 0x0, dentry = {inode_list = {
> >       next = 0x2aaae69be2a4, prev = 0x2aaae69be2a4}, name_hash = {
> >       next = 0x2aaae69be2b4, prev = 0x2aaae69be2b4}, inode =
> 0x2aaae69be260,
> >     name = 0x0, parent = 0x0}, inode_hash = {next = 0x15a5c7c,
> >     prev = 0x2aaaab51a130}, list = {next = 0x2aaac0fa2fac,
> >     prev = 0x2aaab7d5ad3c}}
> >
> > We can also meet in a chat if you like. I think this will speed up
> > debugging. Just give me some time frame when we can meet and where we
> > can meet.
> >
> > Bernhard
> >
> > 2007/8/24, Krishna Srinivas < krishna@xxxxxxxxxxxxx>:
> > > Bernhard,
> > >
> > > Can you do "p *loc" and "p *loc->inode"
> > >
> > > Thanks
> > > Krishna
> > >
> > > On 8/24/07, Bernhard J. M. Grün < bernhard.gruen@xxxxxxxxxxxxxx> wrote:
> > > > Hi Krishna,
> > > >
> > > > Unfortunately I can't give you access to our production systems. At
> > > > least not at the moment.
> > > > What I can do is to give you the compiled version of glusterfs, the
> > > > system (Ubuntu 7.04 x86-64) and the core dumps.
> > > >
> > > > But I have two new back traces for you. They are from the second
> > > > glusterfs client but the binaries of boths clients are the same:
> > > > First back trace:
> > > > Core was generated by `[glusterfs]
> > > >                               '.
> > > > Program terminated with signal 11, Segmentation fault.
> > > > #0  0x00002aaaaacbc2bd in afr_stat (frame=0x2aaab0831c80, this=<value
> > > > optimized out>, loc=0x2aaaed03abf8) at afr.c:2602
> > > > 2602    afr.c: No such file or directory.
> > > >         in afr.c
> > > > (gdb) bt
> > > > #0  0x00002aaaaacbc2bd in afr_stat (frame=0x2aaab0831c80, this=<value
> > > > optimized out>, loc=0x2aaaed03abf8) at afr.c:2602
> > > > #1  0x00002aaaaaece1bb in iot_stat (frame=0x2aaab10bfd50,
> > > > this=0x6126d0, loc=0x2aaaed03abf8) at io-threads.c:651
> > > > #2  0x00002ab5c53e1382 in default_stat (frame=0x2aaae2a881a0,
> > > > this=0x612fe0, loc=0x2aaaed03abf8) at defaults.c:112
> > > > #3  0x00002aaaab2db252 in wb_stat (frame=0x2aaac90c5420,
> > > > this=0x613930, loc=0x2aaaed03abf8) at write-behind.c:236
> > > > #4  0x0000000000405fd2 in fuse_getattr (req=<value optimized out>,
> > > > ino=<value optimized out>, fi=<value optimized out>) at
> > > > fuse-bridge.c:496
> > > > #5  0x0000000000407139 in fuse_transport_notify (xl=<value optimized
> > > > out>, event=<value optimized out>, data=<value optimized out>) at
> > > > fuse-bridge.c:2067
> > > > #6  0x00002ab5c53e3632 in sys_epoll_iteration (ctx=<value optimized
> > > > out>) at epoll.c:53
> > > > #7  0x000000000040356b in main (argc=5, argv=0x7fffe58f3348) at
> glusterfs.c :387
> > > >
> > > > Second back trace:
> > > > Program terminated with signal 11, Segmentation fault.
> > > > #0  0x00002aaaaacbc2bd in afr_stat (frame=0x2aab10d94830, this=<value
> > > > optimized out>, loc=0x2aab1105bd68) at afr.c:2602
> > > > 2602    afr.c: No such file or directory.
> > > >         in afr.c
> > > > (gdb) bt
> > > > #0  0x00002aaaaacbc2bd in afr_stat (frame=0x2aab10d94830, this=<value
> > > > optimized out>, loc=0x2aab1105bd68) at afr.c:2602
> > > > #1  0x00002aaaaaece1bb in iot_stat (frame=0x2aab11aee060,
> > > > this=0x6126d0, loc=0x2aab1105bd68) at io-threads.c:651
> > > > #2  0x00002b56305f4382 in default_stat (frame=0x2aab15cb30a0,
> > > > this=0x612fe0, loc=0x2aab1105bd68) at defaults.c:112
> > > > #3  0x00002aaaab2db252 in wb_stat (frame=0x2aab1807e2e0,
> > > > this=0x613930, loc=0x2aab1105bd68) at write-behind.c:236
> > > > #4  0x0000000000405fd2 in fuse_getattr (req=<value optimized out>,
> > > > ino=<value optimized out>, fi=<value optimized out>) at
> > > > fuse-bridge.c:496
> > > > #5  0x0000000000407139 in fuse_transport_notify (xl=<value optimized
> > > > out>, event=<value optimized out>, data=<value optimized out>) at
> > > > fuse-bridge.c:2067
> > > > #6  0x00002b56305f6632 in sys_epoll_iteration (ctx=<value optimized
> > > > out>) at epoll.c:53
> > > > #7  0x000000000040356b in main (argc=5, argv=0x7fff7a6de138) at
> glusterfs.c:387
> > > >
> > > > It seems the error is the same in all three cases.
> > > >
> > > > Bernhard
> > > >
> > > > 2007/8/22, Krishna Srinivas < krishna@xxxxxxxxxxxxx>:
> > > > > Hi Bernhard,
> > > > >
> > > > > We are not able to figure out the bug's cause. Is it possible for
> > > > > you to give us access to your machine for debugging the core?
> > > > >
> > > > > Thanks
> > > > > Krishna
> > > > >
> > > > > On 8/20/07, Bernhard J. M. Grün <bernhard.gruen@xxxxxxxxxxxxxx >
> wrote:
> > > > > > I still have the core dump of the crash I've reported. But I don't
> > > > > > know if the backtrace is the same every time. The glusterfs client
> now
> > > > > > runs perfectly since 2007-08-16. So we have to wait for the next
> crash
> > > > > > to analyse that issue further.
> > > > > > Also the "print child_errno" does not output anything useful. It
> just
> > > > > > says that there is no symbol with that name in the current
> context.
> > > > > >
> > > > > > 2007/8/20, Krishna Srinivas <krishna@xxxxxxxxxxxxx>:
> > > > > > > Do you see the same backtrace everytime it crashes?
> > > > > > > can you do "print child_errno" at the gdb prompt when you have
> the core?
> > > > > > >
> > > > > > > Thanks
> > > > > > > Krishna
> > > > > > >
> > > > > > > On 8/20/07, Bernhard J. M. Grün <bernhard.gruen@xxxxxxxxxxxxxx>
> wrote:
> > > > > > > > Hi Krishna,
> > > > > > > >
> > > > > > > > One or also both of our glusterfs clients with that version
> crash
> > > > > > > > every 3 to 5 days I think. The problem is that there is much
> > > > > > > > throughput (about 30MBit/s on each client with about 99.5%
> file reads,
> > > > > > > > rest file writes). This makes it hard to debug.
> > > > > > > > We also have a core file from that crash (If I did not deleted
> it
> > > > > > > > because it was quite big) anyway when the next crash occurs
> I'll save
> > > > > > > > the core dump for sure.
> > > > > > > > Do you have some idea how to work around that crash?
> > > > > > > > .
> > > > > > > > 2007/8/20, Krishna Srinivas < krishna@xxxxxxxxxxxxx>:
> > > > > > > > > Hi Bernhard,
> > > > > > > > >
> > > > > > > > > Sorry for the late response. We are not able to figure out
> > > > > > > > > the cause for this bug. Do you have the core file?
> > > > > > > > > Is the bug seen regularly?
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > > Krishna
> > > > > > > > >
> > > > > > > > > On 8/16/07, Bernhard J. M. Grün
> <bernhard.gruen@xxxxxxxxxxxxxx > wrote:
> > > > > > > > > > Hello developers,
> > > > > > > > > >
> > > > > > > > > > We just discovered another segfault on client side. At the
> moment we
> > > > > > > > > > can't give you more information than our version number, a
> back trace
> > > > > > > > > > and our client configuration.
> > > > > > > > > >
> > > > > > > > > > We use version 1.3.0 with patches up to patch-449.
> > > > > > > > > >
> > > > > > > > > > The back trace looks as the follows:
> > > > > > > > > > Core was generated by `[glusterfs]
> > > > > > > > > >                               '.
> > > > > > > > > > Program terminated with signal 11, Segmentation fault.
> > > > > > > > > > #0  0x00002aaaaacbc2bd in afr_stat (frame=0x2aaabce32cb0,
> > > > > > > > > >     this=<value optimized out>, loc=0x2aaaac0fe168) at
> afr.c:2602
> > > > > > > > > > 2602     afr.c: No such file or directory.
> > > > > > > > > >         in afr.c
> > > > > > > > > > (gdb) bt
> > > > > > > > > > #0  0x00002aaaaacbc2bd in afr_stat (frame=0x2aaabce32cb0,
> > > > > > > > > >     this=<value optimized out>, loc=0x2aaaac0fe168) at
> afr.c:2602
> > > > > > > > > > #1  0x00002aaaaaece1bb in iot_stat (frame=0x2aaabcc00860,
> this=0x6126d0,
> > > > > > > > > >     loc=0x2aaaac0fe168) at io-threads.c:651
> > > > > > > > > > #2  0x00002aaaab0d2252 in wb_stat (frame=0x2aaaad05c5e0,
> this=0x612fe0,
> > > > > > > > > >     loc=0x2aaaac0fe168) at write-behind.c:236
> > > > > > > > > > #3  0x0000000000405fd2 in fuse_getattr (req=<value
> optimized out>,
> > > > > > > > > >     ino=<value optimized out>, fi=<value optimized out>)
> at fuse-bridge.c:496
> > > > > > > > > > #4  0x0000000000407139 in fuse_transport_notify (xl=<value
> optimized out>,
> > > > > > > > > >     event=<value optimized out>, data=<value optimized
> out>)
> > > > > > > > > >     at fuse-bridge.c:2067
> > > > > > > > > > #5  0x00002af562b6a632 in sys_epoll_iteration (ctx=<value
> optimized out>)
> > > > > > > > > >     at epoll.c:53
> > > > > > > > > > #6  0x000000000040356b in main (argc=9,
> argv=0x7fff48169b78) at glusterfs.c:387
> > > > > > > > > >
> > > > > > > > > > And here is our client configuration for that machine:
> > > > > > > > > > ### Add client feature and attach to remote subvolume
> > > > > > > > > > volume client1
> > > > > > > > > >   type protocol/client
> > > > > > > > > >   option transport-type tcp/client     # for TCP/IP
> transport
> > > > > > > > > >   option remote-host 10.1.1.13     # IP address of the
> remote brick
> > > > > > > > > >   option remote-port 9999              # default server
> port is 6996
> > > > > > > > > >   option remote-subvolume iothreads        # name of the
> remote volume
> > > > > > > > > > end-volume
> > > > > > > > > >
> > > > > > > > > > ### Add client feature and attach to remote subvolume
> > > > > > > > > > volume client2
> > > > > > > > > >   type protocol/client
> > > > > > > > > >   option transport-type tcp/client     # for TCP/IP
> transport
> > > > > > > > > >   option remote-host 10.1.1.14     # IP address of the
> remote brick
> > > > > > > > > >   option remote-port 9999              # default server
> port is 6996
> > > > > > > > > >   option remote-subvolume iothreads        # name of the
> remote volume
> > > > > > > > > > end-volume
> > > > > > > > > >
> > > > > > > > > > volume afrbricks
> > > > > > > > > >   type cluster/afr
> > > > > > > > > >   subvolumes client1 client2
> > > > > > > > > >   option replicate *:2
> > > > > > > > > >   option self-heal off
> > > > > > > > > > end-volume
> > > > > > > > > >
> > > > > > > > > > volume iothreads    #iothreads can give performance a
> boost
> > > > > > > > > >    type performance/io-threads
> > > > > > > > > >    option thread-count 16
> > > > > > > > > >    subvolumes afrbricks
> > > > > > > > > > end-volume
> > > > > > > > > >
> > > > > > > > > > ### Add writeback feature
> > > > > > > > > > volume bricks
> > > > > > > > > >   type performance/write-behind
> > > > > > > > > >   option aggregate-size 0  # unit in bytes
> > > > > > > > > >   subvolumes iothreads
> > > > > > > > > > end-volume
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > We hope you can easily find and fix that error. Thank you
> in advance
> > > > > > > > > >
> > > > > > > > > > Bernhard J. M. Grün
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> _______________________________________________
> > > > > > > > > > Gluster-devel mailing list
> > > > > > > > > > Gluster-devel@xxxxxxxxxx
> > > > > > > > > >
> http://lists.nongnu.org/mailman/listinfo/gluster-devel
> > > >
> > >
> >
> >
> > _______________________________________________
> > Gluster-devel mailing list
> > Gluster-devel@xxxxxxxxxx
> > http://lists.nongnu.org/mailman/listinfo/gluster-devel
> >
>
>
>
> --
> Amar Tumballi
> Engineer - Gluster Core Team
> [bulde on #gluster/irc.gnu.org]
> http://www.zresearch.com - Commoditizing Supercomputing and Superstorage!