glusterfsd crashing

Sergei Gerasenko <gerases@xxxxxxxxx> · Fri, 10 Mar 2017 10:17:45 -0600

Hi, 

I'm running gluster 3.7.12. It's an 8-node distributed, replicated cluster (replica 2). It's had been working fine for a long time when all of a sudden I started seeing bricks going offline. Researching further I found messages like this:

Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: pending frames:
Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: frame : type(0) op(5)
Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: patchset: git://git.gluster.com/glusterfs.git
Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: signal received: 6
Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: time of crash:
Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: 2017-03-10 05:02:12
Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: configuration details:
Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: argp 1
Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: backtrace 1
Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: dlfcn 1
Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: libpthread 1
Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: llistxattr 1
Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: setfsid 1
Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: spinlock 1
Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: epoll.h 1
Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: xattr.h 1
Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: st_atim.tv_nsec 1
Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: package-string: glusterfs 3.7.12
Mar 10 00:02:12 HOSTNAME data-ftp_gluster_brick[23769]: ---------

I initially thought it was related to quota support (based on some googling), so I turned off quota and also disabled NFS support to simplify the debugging. Every time after the crash, I restarted gluster and the bricks would go online for several hours only to crash again later. There are lots of messages like this preceding the crash:

...
[2017-03-10 04:40:46.002225] E [MSGID: 113091] [posix.c:178:posix_lookup] 0-ftp_volume-posix: null gfid for path (null)
[2017-03-10 04:40:46.002278] E [MSGID: 113018] [posix.c:196:posix_lookup] 0-ftp_volume-posix: lstat on null failed [Invalid argument]
The message "E [MSGID: 113091] [posix.c:178:posix_lookup] 0-ftp_volume-posix: null gfid for path (null)" repeated 3 times between [2017-03-10 04:40:46.002225] and [2017-03-10 04:40:46.005699]
The message "E [MSGID: 113018] [posix.c:196:posix_lookup] 0-ftp_volume-posix: lstat on null failed [Invalid argument]" repeated 3 times between [2017-03-10 04:40:46.002278] and [2017-03-10 04:40:46.005701]
[2017-03-10 04:50:47.002170] E [MSGID: 113091] [posix.c:178:posix_lookup] 0-ftp_volume-posix: null gfid for path (null)
[2017-03-10 04:50:47.002219] E [MSGID: 113018] [posix.c:196:posix_lookup] 0-ftp_volume-posix: lstat on null failed [Invalid argument]
The message "E [MSGID: 113091] [posix.c:178:posix_lookup] 0-ftp_volume-posix: null gfid for path (null)" repeated 3 times between [2017-03-10 04:50:47.002170] and [2017-03-10 04:50:47.005623]
The message "E [MSGID: 113018] [posix.c:196:posix_lookup] 0-ftp_volume-posix: lstat on null failed [Invalid argument]" repeated 3 times between [2017-03-10 04:50:47.002219] and [2017-03-10 04:50:47.005625]
[2017-03-10 05:00:48.002246] E [MSGID: 113091] [posix.c:178:posix_lookup] 0-ftp_volume-posix: null gfid for path (null)
[2017-03-10 05:00:48.002314] E [MSGID: 113018] [posix.c:196:posix_lookup] 0-ftp_volume-posix: lstat on null failed [Invalid argument]
The message "E [MSGID: 113091] [posix.c:178:posix_lookup] 0-ftp_volume-posix: null gfid for path (null)" repeated 3 times between [2017-03-10 05:00:48.002246] and [2017-03-10 05:00:48.005828]
The message "E [MSGID: 113018] [posix.c:196:posix_lookup] 0-ftp_volume-posix: lstat on null failed [Invalid argument]" repeated 3 times between [2017-03-10 05:00:48.002314] and [2017-03-10 05:00:48.005830]

One important detail I noticed yesterday is that one of the nodes was running gluster version 3.7.13! I'm not sure what did the upgrade. So I downgraded to 3.7.12 and restarted gluster. The crash above happened several hours later. But again, the crashes had been happening before the downgrade -- possibly because of the version mismatch on one of the nodes.

Anybody have any ideas?

Thanks!
  Sergei

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users