Re: Client un-mounting since upgrade to 3.12.9-1 version

Nithya Balachandran <nbalacha@xxxxxxxxxx> · Fri, 15 Jun 2018 13:45:48 +0530

Hi Mohammad,
I was unable to reproduce this on a volume created on a system running 3.12.9.

Can you send me the FUSE volfiles for the volume atlasglust? They will be in   /var/lib/glusterd/vols/atlasglust/ on any of the gluster servers hosting the volume and called *.tcp-fuse.vol.

Thanks,
Nithya

On 14 June 2018 at 16:42, mohammad kashif <kashif.alig@xxxxxxxxx> wrote:
Hi Nithya

It seems that problem can be solved by either turning parallel-readir off or downgrading client to 3.10.12-1 . Yesterday I downgraded some clients to 3.10.12-1 and it seems to fixed the problem. Today when I saw your email then I disabled parallel-readir off and the current client 3.12.9-1 started  to work.   I upgraded server and clients to 3.12.9-1 last month and since then clients were intermittently unmounting once in a week. But during last three days, it started unmounting every few minutes. I don't know that what triggered this sudden panic except that file system was quite full; around 98%. It is 480 TB file system. The file system has almost 80 Million files.

Servers have 64GB RAM and clients have 64GB to 192GB RAM. I tested with 192GB RAM client and it still had the same issue.    

Volume Name: atlasglust
Type: Distribute
Volume ID: fbf0ebb8-deab-4388-9d8a-f722618a624b
Status: Started
Snapshot Count: 0
Number of Bricks: 7
Transport-type: tcp
Bricks:
Brick1: pplxgluster01.X.Y.Z/glusteratlas/brick001/gv0
Brick2: pplxgluster02.X.Y.Z:/glusteratlas/brick002/gv0
Brick3: pplxgluster03.X.Y.Z:/glusteratlas/brick003/gv0
Brick4: pplxgluster04.X.Y.Z:/glusteratlas/brick004/gv0
Brick5: pplxgluster05.X.Y.Z:/glusteratlas/brick005/gv0
Brick6: pplxgluster06.X.Y.Z:/glusteratlas/brick006/gv0
Brick7: pplxgluster07.X.Y.Z:/glusteratlas/brick007/gv0
Options Reconfigured:
diagnostics.client-log-level: ERROR
diagnostics.brick-log-level: ERROR
performance.cache-invalidation: on
server.event-threads: 4
client.event-threads: 4
cluster.lookup-optimize: on
performance.client-io-threads: on
performance.cache-size: 1GB
performance.parallel-readdir: off
performance.md-cache-timeout: 600
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
auth.allow: X.Y.Z.*
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on

Thanks

Kashif

On Thu, Jun 14, 2018 at 5:39 AM, Nithya Balachandran <nbalacha@xxxxxxxxxx> wrote:
+Poornima who works on parallel-readdir. 
@Poornima, Have you seen anything like this before?

On 14 June 2018 at 10:07, Nithya Balachandran <nbalacha@xxxxxxxxxx> wrote:
This is not the same issue as the one you are referring - that was in the RPC layer and caused the bricks to crash. This one is different as it seems to be in the dht and rda layers. It does look like a stack overflow though.
@Mohammad,

Please send the following information:

1. gluster volume info 
2. The number of entries in the directory being listed
3. System memory

Does this still happen if you turn off parallel-readdir?

Regards,
Nithya

On 13 June 2018 at 16:40, Milind Changire <mchangir@xxxxxxxxxx> wrote:
+Nithya

Nithya,
Do these logs [1]  look similar to the recursive readdir() issue that you encountered just a while back ?
i.e. recursive readdir() response definition in the XDR

[1] http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log

On Wed, Jun 13, 2018 at 4:29 PM, mohammad kashif <kashif.alig@xxxxxxxxx> wrote:
Hi Milind

Thanks a lot, I manage to run gdb and produced traceback as well. Its here

http://www-pnp.physics.ox.ac.uk/~mohammad/backtrace.log 

I am trying to understand but still not able to make sense out of it.

Thanks

Kashif

On Wed, Jun 13, 2018 at 11:34 AM, Milind Changire <mchangir@xxxxxxxxxx> wrote:
Kashif,
FYI: http://debuginfo.centos.org/centos/6/storage/x86_64/

On Wed, Jun 13, 2018 at 3:21 PM, mohammad kashif <kashif.alig@xxxxxxxxx> wrote:
Hi Milind 

There is no 
glusterfs-debuginfo available for gluster-3.12 from 
http://mirror.centos.org/centos/6/storage/x86_64/gluster-3.12/ repo. Do 
you know from where I can get it? 
Also when I run gdb, it says 

Missing separate debuginfos, use: debuginfo-install glusterfs-fuse-3.12.9-1.el6.x86_64 

I can't find debug package for glusterfs-fuse either

Thanks from the pit of despair ;)

Kashif

On Tue, Jun 12, 2018 at 5:01 PM, mohammad kashif <kashif.alig@xxxxxxxxx> wrote:
Hi Milind

I will send you links for logs.

I collected these core dumps at client and there is no glusterd process running on client.

Kashif

On Tue, Jun 12, 2018 at 4:14 PM, Milind Changire <mchangir@xxxxxxxxxx> wrote:
Kashif,
Could you also send over the client/mount log file as Vijay suggested ?
Or maybe the lines with the crash backtrace lines

Also, you've mentioned that you straced glusterd, but when you ran gdb, you ran it over /usr/sbin/glusterfs

On Tue, Jun 12, 2018 at 8:19 PM, Vijay Bellur <vbellur@xxxxxxxxxx> wrote:

On Tue, Jun 12, 2018 at 7:40 AM, mohammad kashif <kashif.alig@xxxxxxxxx> wrote:
Hi Milind

The operating system is Scientific Linux 6 which is based on RHEL6. The cpu arch is Intel x86_64.

I will send you a separate email with link to core dump.

You could also grep for crash in the client log file and the lines following crash would have a backtrace in most cases.

HTH,
Vijay

Thanks for your help.

Kashif

On Tue, Jun 12, 2018 at 3:16 PM, Milind Changire <mchangir@xxxxxxxxxx> wrote:
Kashif,
Could you share the core dump via Google Drive or something similar

Also, let me know the CPU arch and OS Distribution on which you are running gluster.

If you've installed the glusterfs-debuginfo package, you'll also get the source lines in the backtrace via gdb

On Tue, Jun 12, 2018 at 5:59 PM, mohammad kashif <kashif.alig@xxxxxxxxx> wrote:
Hi Milind, Vijay 

Thanks, I have some more information now as I straced glusterd on client

138544      0.000131 mprotect(0x7f2f70785000, 4096, PROT_READ|PROT_WRITE) = 0 <0.000026>
138544      0.000128 mprotect(0x7f2f70786000, 4096, PROT_READ|PROT_WRITE) = 0 <0.000027>
138544      0.000126 mprotect(0x7f2f70787000, 4096, PROT_READ|PROT_WRITE) = 0 <0.000027>
138544      0.000124 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_ACCERR, si_addr=0x7f2f7c60ef88} ---
138544      0.000051 --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=0} ---
138551      0.105048 +++ killed by SIGSEGV (core dumped) +++
138550      0.000041 +++ killed by SIGSEGV (core dumped) +++
138547      0.000008 +++ killed by SIGSEGV (core dumped) +++
138546      0.000007 +++ killed by SIGSEGV (core dumped) +++
138545      0.000007 +++ killed by SIGSEGV (core dumped) +++
138544      0.000008 +++ killed by SIGSEGV (core dumped) +++
138543      0.000007 +++ killed by SIGSEGV (core dumped) +++

As for I understand that somehow gluster is trying to access memory in appropriate manner and kernel sends SIGSEGV 

I also got the core dump. I am trying gdb first time so I am not sure whether I am using it correctly 

gdb /usr/sbin/glusterfs core.138536

It just tell me that program terminated with signal 11, segmentation fault .

The problem is not limited to one client but happening to many clients. 

I will really appreciate any help as whole file system has become unusable 

Thanks

Kashif

On Tue, Jun 12, 2018 at 12:26 PM, Milind Changire <mchangir@xxxxxxxxxx> wrote:
Kashif,
You can change the log level by:
$ gluster volume set <vol> diagnostics.brick-log-level TRACE
$ gluster volume set <vol> diagnostics.client-log-level TRACE

and see how things fare

If you want fewer logs you can change the log-level to DEBUG instead of TRACE.

On Tue, Jun 12, 2018 at 3:37 PM, mohammad kashif <kashif.alig@xxxxxxxxx> wrote:
Hi Vijay

Now it is unmounting every 30 mins ! 

The server log at /var/log/glusterfs/bricks/glusteratlas-brics001-gv0.log have this line only

2018-06-12 09:53:19.303102] I [MSGID: 115013] [server-helpers.c:289:do_fd_cleanup] 0-atlasglust-server: fd cleanup on /atlas/atlasdata/zgubic/hmumu/histograms/v14.3/Signal
[2018-06-12 09:53:19.306190] I [MSGID: 101055] [client_t.c:443:gf_client_unref] 0-atlasglust-server: Shutting down connection <server-name> -2224879-2018/06/12-09:51:01:460889-atlasglust-client-0-0-0

There is no other information. Is there any way to increase log verbosity?

on the client 

2018-06-12 09:51:01.744980] I [MSGID: 114057] [client-handshake.c:1478:select_server_supported_programs] 0-atlasglust-client-5: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2018-06-12 09:51:01.746508] I [MSGID: 114046] [client-handshake.c:1231:client_setvolume_cbk] 0-atlasglust-client-5: Connected to atlasglust-client-5, attached to remote volume '/glusteratlas/brick006/gv0'.
[2018-06-12 09:51:01.746543] I [MSGID: 114047] [client-handshake.c:1242:client_setvolume_cbk] 0-atlasglust-client-5: Server and Client lk-version numbers are not same, reopening the fds
[2018-06-12 09:51:01.746814] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-atlasglust-client-5: Server lk version = 1
[2018-06-12 09:51:01.748449] I [MSGID: 114057] [client-handshake.c:1478:select_server_supported_programs] 0-atlasglust-client-6: Using Program GlusterFS 3.3, Num (1298437), Version (330)
[2018-06-12 09:51:01.750219] I [MSGID: 114046] [client-handshake.c:1231:client_setvolume_cbk] 0-atlasglust-client-6: Connected to atlasglust-client-6, attached to remote volume '/glusteratlas/brick007/gv0'.
[2018-06-12 09:51:01.750261] I [MSGID: 114047] [client-handshake.c:1242:client_setvolume_cbk] 0-atlasglust-client-6: Server and Client lk-version numbers are not same, reopening the fds
[2018-06-12 09:51:01.750503] I [MSGID: 114035] [client-handshake.c:202:client_set_lk_version_cbk] 0-atlasglust-client-6: Server lk version = 1
[2018-06-12 09:51:01.752207] I [fuse-bridge.c:4205:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.14
[2018-06-12 09:51:01.752261] I [fuse-bridge.c:4835:fuse_graph_sync] 0-fuse: switched to graph 0

is there a problem with server and client 1k version?

Thanks for your help.

Kashif

On Mon, Jun 11, 2018 at 11:52 PM, Vijay Bellur <vbellur@xxxxxxxxxx> wrote:

On Mon, Jun 11, 2018 at 8:50 AM, mohammad kashif <kashif.alig@xxxxxxxxx> wrote:
Hi

Since I have updated our gluster server and client to latest version 3.12.9-1, I am having this issue of gluster getting unmounted from client very regularly. It was not a problem before update.

Its a distributed file system with no replication. We have seven servers totaling around 480TB data. Its 97% full. 

I am using following config on server

gluster volume set atlasglust features.cache-invalidation on
gluster volume set atlasglust features.cache-invalidation-timeout 600
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust performance.cache-invalidation on
gluster volume set atlasglust performance.md-cache-timeout 600
gluster volume set atlasglust performance.parallel-readdir on
gluster volume set atlasglust performance.cache-size 1GB
gluster volume set atlasglust performance.client-io-threads on
gluster volume set atlasglust cluster.lookup-optimize on
gluster volume set atlasglust performance.stat-prefetch on
gluster volume set atlasglust client.event-threads 4
gluster volume set atlasglust server.event-threads 4

clients are mounted with this option

defaults,direct-io-mode=disable,attribute-timeout=600,entry-timeout=600,negative-timeout=600,fopen-keep-cache,rw,_netdev 

I can't see anything in the log file. Can someone suggest that how to troubleshoot this issue?

Can you please share the log file? Checking for messages related to disconnections/crashes in the log file would be a good way to start troubleshooting the problem.

Thanks,
Vijay 

_______________________________________________

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

http://lists.gluster.org/mailman/listinfo/gluster-users

-- 
Milind

-- 
Milind

-- 
Milind

-- 
Milind

-- 
Milind

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users