Re: v3.4.0a3+ NFS crashing out

Michael Brown <michael@xxxxxxxxxxxx> · Thu, 02 May 2013 01:30:23 -0400

I created a 8x1x2 distributed-replicated volume and fired up Oracle with
Direct NFS enabled, then fired up a load generator.

It went quite well for a while then suddenly crashed.

Nothing fancy, just a lot of load.

M.

On 13-05-02 12:33 AM, Pranith Kumar Karampuri wrote:
> Michael,
>    Could you let us know the steps to re-create the issue.
>
> Pranith
>
> ----- Original Message -----
>> From: "Michael Brown" <michael@xxxxxxxxxxxx>
>> To: gluster-devel@xxxxxxxxxx
>> Sent: Wednesday, May 1, 2013 10:57:59 PM
>> Subject: v3.4.0a3+ NFS crashing out
>>
>> My gluster NFS daemon is crashing with the following:
>>
>> pending frames:
>> <<<25592 copies of>>>
>> frame : type(0) op(0)
>>
>> patchset: git://git.gluster.com/glusterfs.git
>> signal received: 11
>> time of crash: 2013-05-01 17:02:36configuration details:
>> argp 1
>> backtrace 1
>> dlfcn 1
>> fdatasync 1
>> libpthread 1
>> llistxattr 1
>> setfsid 1
>> spinlock 1
>> epoll.h 1
>> xattr.h 1
>> st_atim.tv_nsec 1
>> package-string: glusterfs 3.4git
>> /usr/local/glusterfs/sbin/glusterfs(glusterfsd_print_trace+0x1f)[0x407bd5]
>> /lib64/libc.so.6[0x3c48c32920]
>> /lib64/libc.so.6[0x3c48c7870a]
>> /usr/local/glusterfs/lib/libglusterfs.so.0(__gf_free+0x61)[0x7f80421665a9]
>> /usr/local/glusterfs/lib/libglusterfs.so.0(mem_put+0x212)[0x7f8042166fd8]
>> /usr/local/glusterfs/lib/glusterfs/3.4git/xlator/cluster/replicate.so(afr_writev_done+0xca)[0x7f803d8cf9ec]
>> /usr/local/glusterfs/lib/glusterfs/3.4git/xlator/cluster/replicate.so(+0x58d7f)[0x7f803d900d7f]
>> /usr/local/glusterfs/lib/glusterfs/3.4git/xlator/cluster/replicate.so(+0x58f09)[0x7f803d900f09]
>> /usr/local/glusterfs/lib/glusterfs/3.4git/xlator/cluster/replicate.so(+0x59214)[0x7f803d901214]
>> /usr/local/glusterfs/lib/glusterfs/3.4git/xlator/cluster/replicate.so(afr_unlock+0x57)[0x7f803d905aeb]
>> /usr/local/glusterfs/lib/glusterfs/3.4git/xlator/cluster/replicate.so(afr_changelog_post_op_cbk+0x10a)[0x7f803d8dd01f]
>> /usr/local/glusterfs/lib/glusterfs/3.4git/xlator/cluster/replicate.so(afr_changelog_post_op_now+0x8c7)[0x7f803d8ddebf]
>> /usr/local/glusterfs/lib/glusterfs/3.4git/xlator/cluster/replicate.so(afr_delayed_changelog_post_op+0x16e)[0x7f803d8e1f36]
>> /usr/local/glusterfs/lib/glusterfs/3.4git/xlator/cluster/replicate.so(afr_changelog_post_op+0x59)[0x7f803d8e1f99]
>> /usr/local/glusterfs/lib/glusterfs/3.4git/xlator/cluster/replicate.so(afr_transaction_resume+0x87)[0x7f803d8e205e]
>> /usr/local/glusterfs/lib/glusterfs/3.4git/xlator/cluster/replicate.so(afr_writev_wind_cbk+0x348)[0x7f803d8cf468]
>> /usr/local/glusterfs/lib/glusterfs/3.4git/xlator/protocol/client.so(client3_3_writev_cbk+0x490)[0x7f803db53397]
>> /usr/local/glusterfs/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0x1b5)[0x7f8041f14759]
>> /usr/local/glusterfs/lib/libgfrpc.so.0(rpc_clnt_notify+0x2d3)[0x7f8041f14af0]
>> /usr/local/glusterfs/lib/libgfrpc.so.0(rpc_transport_notify+0x110)[0x7f8041f1118c]
>> /usr/local/glusterfs/lib/glusterfs/3.4git/rpc-transport/socket.so(socket_event_poll_in+0x54)[0x7f803e9a40a9]
>> /usr/local/glusterfs/lib/glusterfs/3.4git/rpc-transport/socket.so(socket_event_handler+0x1c4)[0x7f803e9a4558]
>> /usr/local/glusterfs/lib/libglusterfs.so.0(+0x72441)[0x7f8042190441]
>> /usr/local/glusterfs/lib/libglusterfs.so.0(+0x72630)[0x7f8042190630]
>> /usr/local/glusterfs/lib/libglusterfs.so.0(event_dispatch+0x6c)[0x7f8042165af3]
>> /usr/local/glusterfs/sbin/glusterfs(main+0x2c7)[0x408503]
>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x3c48c1ecdd]
>> /usr/local/glusterfs/sbin/glusterfs[0x404649]
>> ---------
>>
>> It rather looks like the nfs code isn't freeing up NULL frames from the
>> frame stack (if those words are right :D) when it's done replying to them.
>>
>> Yes, Oracle does send quite a few. Up until that, it was behaving REALLY
>> well :)
>>
>> M.
>>
>> --
>> Michael Brown               | `One of the main causes of the fall of
>> Systems Consultant          | the Roman Empire was that, lacking zero,
>> Net Direct Inc.             | they had no way to indicate successful
>> ☎: +1 519 883 1172 x5106    | termination of their C programs.' - Firth
>>
>>
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel@xxxxxxxxxx
>> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>>

-- 
Michael Brown               | `One of the main causes of the fall of
Systems Consultant          | the Roman Empire was that, lacking zero,
Net Direct Inc.             | they had no way to indicate successful
☎: +1 519 883 1172 x5106    | termination of their C programs.' - Firth