Re: Client un-mounting since upgrade to 3.12.9-1 version

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Nithya

Thanks for the bug report. This new crash happened only once and only at one client in the last 6 days. I will let you know if it happened again or more frequently.

Cheers

Kashif   

On Wed, Jun 20, 2018 at 12:28 PM, Nithya Balachandran <nbalacha@xxxxxxxxxx> wrote:
Hi Mohammad,

This is a different crash. How often does it happen?


We have managed to reproduce the first crash you reported and a bug has been filed at [1].
We will work on a fix for this.


Regards,
Nithya



On 18 June 2018 at 14:09, mohammad kashif <kashif.alig@xxxxxxxxx> wrote:
Hi 

Problem appeared again after few days. This time, the client is glusterfs-3.10.12-1.el6.x86_64 and performance.parallel-readdir is off. The log level was set to ERROR and I got this log at the time of crash

[2018-06-14 08:45:43.551384] E [rpc-clnt.c:365:saved_frames_unwind] (--> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x7fac2e66ce03] (--> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e7)[0x7fac2e434867] (--> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7fac2e43497e] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x7fac2e434a45] (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x7fac2e434d68] ))))) 0-atlasglust-client-4: forced unwinding frame type(GlusterFS 3.3) op(READDIRP(40)) called at 2018-06-14 08:45:43.483303 (xid=0x7553c7

Core dump was enabled on client so it created a dump. It is here


I used a gdb trace using this command

gdb /usr/sbin/glusterfs core.1002074 -ex bt -ex quit |& tee backtrace.log_18_16_1



I haven't used gdb much so let me know if you want me to run gdb in different manner.

Thanks

Kashif


On Mon, Jun 18, 2018 at 6:27 AM, Raghavendra Gowdappa <rgowdapp@xxxxxxxxxx> wrote:


On Mon, Jun 18, 2018 at 9:39 AM, Raghavendra Gowdappa <rgowdapp@xxxxxxxxxx> wrote:


On Mon, Jun 18, 2018 at 8:11 AM, Raghavendra Gowdappa <rgowdapp@xxxxxxxxxx> wrote:
From the bt:

#8  0x00007f6ef977e6de in rda_readdirp (frame=0x7f6eec862320, this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=357, off=2, xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#9  0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0, orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at dht-common.c:5388
#10 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec862210, this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2, xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#11 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0, orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at dht-common.c:5388
#12 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec862100, this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2, xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#13 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0, orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at dht-common.c:5388
#14 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861ff0, this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2, xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#15 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0, orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at dht-common.c:5388
#16 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861ee0, this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2, xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#17 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0, orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at dht-common.c:5388
#18 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861dd0, this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2, xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#19 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0, orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at dht-common.c:5388
#20 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861cc0, this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2, xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#21 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0, orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at dht-common.c:5388
#22 0x00007f6ef977e7d7 in rda_readdirp (frame=0x7f6eec861bb0, this=0x7f6ef4019f20, fd=0x7f6ed40077b0, size=140114606084288, off=2, xdata=0x7f6eec0085a0) at readdir-ahead.c:266
#23 0x00007f6ef952db4c in dht_readdirp_cbk (frame=<value optimized out>, cookie=0x7f6ef4019f20, this=0x7f6ef40218a0, op_ret=2, op_errno=0, orig_entries=<value optimized out>, xdata=0x7f6eec0085a0) at dht-common.c:5388

It looks like an infinite recursion. Note that readdirp is wound to the same subvol (value of "this" is same in all calls to rda_readdirp) at the same offset (of value 2). This may be a bug in DHT (winding down readdirp with wrong offset) or in readdir-ahead (populating incorrect offset values in dentries it returns as readdirp response).

It looks to be a corruption. Value of size argument in rda_readdirp is too big (around 127 TB) to be sane. If you've a reproducer, please run it in valgrind or ASAN.


I spoke too early. It could be a negative value and hence it may not be a corruption. Is it possible to upload the core somewhere? Or better still access to gdb session with this core would be more helpful.


To make it explicit, ATM its not clear that there is bug in readdir-ahead or DHT as it looks to be a memory corruption. Till I get a reproducer or valgrind/ASAN output of client process when the issue occcurs, I won't be working on this problem.



On Wed, Jun 13, 2018 at 4:29 PM, mohammad kashif <kashif.alig@xxxxxxxxx> wrote:
Hi Milind

Thanks a lot, I manage to run gdb and produced traceback as well. Its here



I am trying to understand but still not able to make sense out of it.

Thanks

Kashif

On Wed, Jun 13, 2018 at 11:34 AM, Milind Changire <mchangir@xxxxxxxxxx> wrote:

On Wed, Jun 13, 2018 at 3:21 PM, mohammad kashif <kashif.alig@xxxxxxxxx> wrote:
Hi Milind

There is no glusterfs-debuginfo available for gluster-3.12 from http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/ repo. Do you know from where I can get it?
Also when I run gdb, it says

Missing separate debuginfos, use: debuginfo-install glusterfs-fuse-3.12.9-1.el6.x86_64

I can't find debug package for glusterfs-fuse either

Thanks from the pit of despair ;)

Kashif


On Tue, Jun 12, 2018 at 5:01 PM, mohammad kashif <kashif.alig@xxxxxxxxx> wrote:
Hi Milind

I will send you links for logs.

I collected these core dumps at client and there is no glusterd process running on client.

Kashif



On Tue, Jun 12, 2018 at 4:14 PM, Milind Changire <mchangir@xxxxxxxxxx> wrote:
Kashif,
Could you also send over the client/mount log file as Vijay suggested ?
Or maybe the lines with the crash backtrace lines

Also, you've mentioned that you straced glusterd, but when you ran gdb, you ran it over /usr/sbin/glusterfs


On Tue, Jun 12, 2018 at 8:19 PM, Vijay Bellur <vbellur@xxxxxxxxxx> wrote:


On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <kashif.alig@xxxxxxxxx> wrote:
Hi Milind

The operating system is Scientific Linux 6 which is based on RHEL6. The cpu arch is Intel x86_64.

I will send you a separate email with link to core dump.


You could also grep for crash in the client log file and the lines following crash would have a backtrace in most cases.

HTH,
Vijay
 

Thanks for your help.

Kashif


On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire <mchangir@xxxxxxxxxx> wrote:
Kashif,
Could you share the core dump via Google Drive or something similar

Also, let me know the CPU arch and OS Distribution on which you are running gluster.

If you've installed the glusterfs-debuginfo package, you'll also get the source lines in the backtrace via gdb



On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif <kashif.alig@xxxxxxxxx> wrote:
Hi Milind, Vijay

Thanks, I have some more information now as I straced glusterd on client

138544      0.000131 mprotect(0x7f2f70785000, 4096, PROT_READ|PROT_WRITE) = 0 <0.000026>
138544      0.000128 mprotect(0x7f2f70786000, 4096, PROT_READ|PROT_WRITE) = 0 <0.000027>
138544      0.000126 mprotect(0x7f2f70787000, 4096, PROT_READ|PROT_WRITE) = 0 <0.000027>
138544      0.000124 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544      0.000051 --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0} ---
138551      0.105048 +++ killed by SIGSEGV (core dumped) +++
138550      0.000041 +++ killed by SIGSEGV (core dumped) +++
138547      0.000008 +++ killed by SIGSEGV (core dumped) +++
138546      0.000007 +++ killed by SIGSEGV (core dumped) +++
138545      0.000007 +++ killed by SIGSEGV (core dumped) +++
138544      0.000008 +++ killed by SIGSEGV (core dumped) +++
138543      0.000007 +++ killed by SIGSEGV (core dumped) +++

As for I understand that somehow gluster is trying to access memory in appropriate manner and kernel sends SIGSEGV

I also got the core dump. I am trying gdb first time so I am not sure whether I am using it correctly

gdb /usr/sbin/glusterfs core.138536

It just tell me that program terminated with signal 11, segmentation fault .

The problem is not limited to one client but happening to many clients.

I will really appreciate any help as whole file system has become unusable

Thanks

Kashif




On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <mchangir@xxxxxxxxxx> wrote:
Kashif,
You can change the log level by:
$ gluster volume set <vol> diagnostics.brick-log-level TRACE
$ gluster volume set <vol> diagnostics.client-log-level TRACE

and see how things fare

If you want fewer logs you can change the log-level to DEBUG instead of TRACE.



On Tue, Jun 12, 2018 at 3:37 PM, mohammad kashif <kashif.alig@xxxxxxxxx> wrote:
Hi Vijay

Now it is unmounting every 30 mins !

The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log have this line only

2018-06-12 09:53:19.303102] I [MSGID: 115013] [server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd cleanup on /atlas/atlasdata/zgubic/hmumu/histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055] [client_t.c:443:gf_client_unref] 0-atlasglust-server: Shutting down connection <server-name> -2224879-2018/06/12-09:51:01:460889-atlasglust-client-0-0-0

There is no other information. Is there any way to increase log verbosity?

on the client

2018-06-12 09:51:01.744980] I [MSGID: 114057] [client-handshake.c:1478:select_server_supported_programs] 0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046] [client-handshake.c:1231:client_setvolume_cbk] 0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047] [client-handshake.c:1242:client_setvolume_cbk] 0-atlasglust-client-5: Server and Client lk-version numbers are not same, reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057] [client-handshake.c:1478:select_server_supported_programs] 0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046] [client-handshake.c:1231:client_setvolume_cbk] 0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047] [client-handshake.c:1242:client_setvolume_cbk] 0-atlasglust-client-6: Server and Client lk-version numbers are not same, reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync] 0-fuse: switched to graph 0


is there a problem with server and client 1k version?

Thanks for your help.

Kashif



 

On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <vbellur@xxxxxxxxxx> wrote:


On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <kashif.alig@xxxxxxxxx> wrote:
Hi

Since I have updated our gluster server and client to latest version 3.12.9-1, I am having this issue of gluster getting unmounted from client very regularly. It was not a problem before update.

Its a distributed file system with no replication. We have seven servers totaling around 480TB data. Its 97% full.

I am using following config on server


gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout 600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout 600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4

clients are mounted with this option

defaults,direct-io-mode=disable,attribute-timeout=600,entry-timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev

I can't see anything in the log file. Can someone suggest that how to troubleshoot this issue?




Can you please share the log file? Checking for messages related to disconnections/crashes in the log file would be a good way to start troubleshooting the problem.

Thanks,
Vijay 


_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users



--
Milind





--
Milind






--
Milind






--
Milind



_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users





_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users


_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux