I've added the lru-limit=0 parameter to the mounts, and I see it's taken effect correctly:
"/usr/sbin/glusterfs --lru-limit=0 --process-name fuse --volfile-server=localhost --volfile-id=/<SNIP> /mnt/<SNIP>"
Let's see if it stops crashing or not.
On Wed, Feb 6, 2019 at 10:48 AM Artem Russakovskii <archon810@xxxxxxxxx> wrote:
Hi Nithya,Indeed, I upgraded from 4.1 to 5.3, at which point I started seeing crashes, and no further releases have been made yet.volume info:Type: ReplicateVolume ID: ****SNIP****Status: StartedSnapshot Count: 0Number of Bricks: 1 x 4 = 4Transport-type: tcpBricks:Brick1: ****SNIP****Brick2: ****SNIP****Brick3: ****SNIP****Brick4: ****SNIP****Options Reconfigured:cluster.quorum-count: 1cluster.quorum-type: fixednetwork.ping-timeout: 5network.remote-dio: enableperformance.rda-cache-limit: 256MBperformance.readdir-ahead: onperformance.parallel-readdir: onnetwork.inode-lru-limit: 500000performance.md-cache-timeout: 600performance.cache-invalidation: onperformance.stat-prefetch: onfeatures.cache-invalidation-timeout: 600features.cache-invalidation: oncluster.readdir-optimize: onperformance.io-thread-count: 32server.event-threads: 4client.event-threads: 4performance.read-ahead: offcluster.lookup-optimize: onperformance.cache-size: 1GBcluster.self-heal-daemon: enabletransport.address-family: inetnfs.disable: onperformance.client-io-threads: oncluster.granular-entry-heal: enablecluster.data-self-heal-algorithm: fullOn Wed, Feb 6, 2019 at 12:20 AM Nithya Balachandran <nbalacha@xxxxxxxxxx> wrote:Hi Artem,Do you still see the crashes with 5.3? If yes, please try mount the volume using the mount option lru-limit=0 and see if that helps. We are looking into the crashes and will update when have a fix.Also, please provide the gluster volume info for the volume in question.regards,NithyaOn Tue, 5 Feb 2019 at 05:31, Artem Russakovskii <archon810@xxxxxxxxx> wrote:The fuse crash happened two more times, but this time monit helped recover within 1 minute, so it's a great workaround for now.What's odd is that the crashes are only happening on one of 4 servers, and I don't know why._______________________________________________On Sat, Feb 2, 2019 at 12:14 PM Artem Russakovskii <archon810@xxxxxxxxx> wrote:The fuse crash happened again yesterday, to another volume. Are there any mount options that could help mitigate this?In the meantime, I set up a monit (https://mmonit.com/monit/) task to watch and restart the mount, which works and recovers the mount point within a minute. Not ideal, but a temporary workaround.By the way, the way to reproduce this "Transport endpoint is not connected" condition for testing purposes is to kill -9 the right "glusterfs --process-name fuse" process.monit check:check filesystem glusterfs_data1 with path /mnt/glusterfs_data1start program = "/bin/mount /mnt/glusterfs_data1"stop program = "/bin/umount /mnt/glusterfs_data1"if space usage > 90% for 5 times within 15 cyclesthen alert else if succeeded for 10 cycles then alertstack trace:[2019-02-01 23:22:00.312894] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fa0249e4329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument][2019-02-01 23:22:00.314051] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fa0249e4329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fa024bf5af5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fa02cf5b218] ) 0-dict: dict is NULL [Invalid argument]The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 0-epoll: Failed to dispatch handler" repeated 26 times between [2019-02-01 23:21:20.857333] and [2019-02-01 23:21:56.164427]The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 0-SITE_data3-replicate-0: selecting local read_child SITE_data3-client-3" repeated 27 times between [2019-02-01 23:21:11.142467] and [2019-02-01 23:22:03.474036]pending frames:frame : type(1) op(LOOKUP)frame : type(0) op(0)patchset: git://git.gluster.org/glusterfs.gitsignal received: 6time of crash:2019-02-01 23:22:03configuration details:argp 1backtrace 1dlfcn 1libpthread 1llistxattr 1setfsid 1spinlock 1epoll.h 1xattr.h 1st_atim.tv_nsec 1package-string: glusterfs 5.3/usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fa02cf6664c]/usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fa02cf70cb6]/lib64/libc.so.6(+0x36160)[0x7fa02c12d160]/lib64/libc.so.6(gsignal+0x110)[0x7fa02c12d0e0]/lib64/libc.so.6(abort+0x151)[0x7fa02c12e6c1]/lib64/libc.so.6(+0x2e6fa)[0x7fa02c1256fa]/lib64/libc.so.6(+0x2e772)[0x7fa02c125772]/lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fa02c4bb0b8]/usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x5dc9d)[0x7fa025543c9d]/usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x70ba1)[0x7fa025556ba1]/usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x58f3f)[0x7fa0257dbf3f]/usr/lib64/libgfrpc.so.0(+0xe820)[0x7fa02cd31820]/usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fa02cd31b6f]/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fa02cd2e063]/usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fa02694e0b2]/usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fa02cfc44c3]/lib64/libpthread.so.0(+0x7559)[0x7fa02c4b8559]/lib64/libc.so.6(clone+0x3f)[0x7fa02c1ef81f]On Fri, Feb 1, 2019 at 9:03 AM Artem Russakovskii <archon810@xxxxxxxxx> wrote:Hi,The first (and so far only) crash happened at 2am the next day after we upgraded, on only one of four servers and only to one of two mounts.I have no idea what caused it, but yeah, we do have a pretty busy site (apkmirror.com), and it caused a disruption for any uploads or downloads from that server until I woke up and fixed the mount.I wish I could be more helpful but all I have is that stack trace.I'm glad it's a blocker and will hopefully be resolved soon.On Thu, Jan 31, 2019, 7:26 PM Amar Tumballi Suryanarayan <atumball@xxxxxxxxxx> wrote:Hi Artem,Opened https://bugzilla.redhat.com/show_bug.cgi?id=1671603 (ie, as a clone of other bugs where recent discussions happened), and marked it as a blocker for glusterfs-5.4 release.We already have fixes for log flooding - https://review.gluster.org/22128, and are the process of identifying and fixing the issue seen with crash.Can you please tell if the crashes happened as soon as upgrade ? or was there any particular pattern you observed before the crash.-AmarOn Thu, Jan 31, 2019 at 11:40 PM Artem Russakovskii <archon810@xxxxxxxxx> wrote:Within 24 hours after updating from rock solid 4.1 to 5.3, I already got a crash which others have mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1313567 and had to unmount, kill gluster, and remount:[2019-01-31 09:38:04.317604] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fcccafcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument][2019-01-31 09:38:04.319308] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fcccafcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument][2019-01-31 09:38:04.320047] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fcccafcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument][2019-01-31 09:38:04.320677] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fcccafcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fcccb1deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fccd705b218] ) 2-dict: dict is NULL [Invalid argument]The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: selecting local read_child SITE_data1-client-3" repeated 5 times between [2019-01-31 09:37:54.751905] and [2019-01-31 09:38:03.958061]The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch handler" repeated 72 times between [2019-01-31 09:37:53.746741] and [2019-01-31 09:38:04.696993]pending frames:frame : type(1) op(READ)frame : type(1) op(OPEN)frame : type(0) op(0)patchset: git://git.gluster.org/glusterfs.gitsignal received: 6time of crash:2019-01-31 09:38:04configuration details:argp 1backtrace 1dlfcn 1libpthread 1llistxattr 1setfsid 1spinlock 1epoll.h 1xattr.h 1st_atim.tv_nsec 1package-string: glusterfs 5.3/usr/lib64/libglusterfs.so.0(+0x2764c)[0x7fccd706664c]/usr/lib64/libglusterfs.so.0(gf_print_trace+0x306)[0x7fccd7070cb6]/lib64/libc.so.6(+0x36160)[0x7fccd622d160]/lib64/libc.so.6(gsignal+0x110)[0x7fccd622d0e0]/lib64/libc.so.6(abort+0x151)[0x7fccd622e6c1]/lib64/libc.so.6(+0x2e6fa)[0x7fccd62256fa]/lib64/libc.so.6(+0x2e772)[0x7fccd6225772]/lib64/libpthread.so.0(pthread_mutex_lock+0x228)[0x7fccd65bb0b8]/usr/lib64/glusterfs/5.3/xlator/cluster/replicate.so(+0x32c4d)[0x7fcccbb01c4d]/usr/lib64/glusterfs/5.3/xlator/protocol/client.so(+0x65778)[0x7fcccbdd1778]/usr/lib64/libgfrpc.so.0(+0xe820)[0x7fccd6e31820]/usr/lib64/libgfrpc.so.0(+0xeb6f)[0x7fccd6e31b6f]/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fccd6e2e063]/usr/lib64/glusterfs/5.3/rpc-transport/socket.so(+0xa0b2)[0x7fccd0b7e0b2]/usr/lib64/libglusterfs.so.0(+0x854c3)[0x7fccd70c44c3]/lib64/libpthread.so.0(+0x7559)[0x7fccd65b8559]/lib64/libc.so.6(clone+0x3f)[0x7fccd62ef81f]---------Do the pending patches fix the crash or only the repeated warnings? I'm running glusterfs on OpenSUSE 15.0 installed via http://download.opensuse.org/repositories/home:/glusterfs:/Leap15-5/openSUSE_Leap_15.0/, not too sure how to make it core dump.If it's not fixed by the patches above, has anyone already opened a ticket for the crashes that I can join and monitor? This is going to create a massive problem for us since production systems are crashing.Thanks._______________________________________________On Wed, Jan 30, 2019 at 6:37 PM Raghavendra Gowdappa <rgowdapp@xxxxxxxxxx> wrote:On Thu, Jan 31, 2019 at 2:14 AM Artem Russakovskii <archon810@xxxxxxxxx> wrote:Also, not sure if related or not, but I got a ton of these "Failed to dispatch handler" in my logs as well. Many people have been commenting about this issue here https://bugzilla.redhat.com/show_bug.cgi?id=1651246.https://review.gluster.org/#/c/glusterfs/+/22046/ addresses this.==> mnt-SITE_data1.log <==
[2019-01-30 20:38:20.783713] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument]
==> mnt-SITE_data3.log <==
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch handler" repeated 413 times between [2019-01-30 20:36:23.881090] and [2019-01-30 20:38:20.015593]
The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: selecting local read_child SITE_data3-client-0" repeated 42 times between [2019-01-30 20:36:23.290287] and [2019-01-30 20:38:20.280306]
==> mnt-SITE_data1.log <==
The message "I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: selecting local read_child SITE_data1-client-0" repeated 50 times between [2019-01-30 20:36:22.247367] and [2019-01-30 20:38:19.459789]
The message "E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch handler" repeated 2654 times between [2019-01-30 20:36:22.667327] and [2019-01-30 20:38:20.546355]
[2019-01-30 20:38:21.492319] I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data1-replicate-0: selecting local read_child SITE_data1-client-0
==> mnt-SITE_data3.log <==
[2019-01-30 20:38:22.349689] I [MSGID: 108031] [afr-common.c:2543:afr_local_discovery_cbk] 2-SITE_data3-replicate-0: selecting local read_child SITE_data3-client-0
==> mnt-SITE_data1.log <==
[2019-01-30 20:38:22.762941] E [MSGID: 101191] [event-epoll.c:671:event_dispatch_epoll_worker] 2-epoll: Failed to dispatch handlerI'm hoping raising the issue here on the mailing list may bring some additional eyeballs and get them both fixed.Thanks.On Wed, Jan 30, 2019 at 12:26 PM Artem Russakovskii <archon810@xxxxxxxxx> wrote:I found a similar issue here: https://bugzilla.redhat.com/show_bug.cgi?id=1313567. There's a comment from 3 days ago from someone else with 5.3 who started seeing the spam.Here's the command that repeats over and over:[2019-01-30 20:23:24.481581] W [dict.c:761:dict_ref] (-->/usr/lib64/glusterfs/5.3/xlator/performance/quick-read.so(+0x7329) [0x7fd966fcd329] -->/usr/lib64/glusterfs/5.3/xlator/performance/io-cache.so(+0xaaf5) [0x7fd9671deaf5] -->/usr/lib64/libglusterfs.so.0(dict_ref+0x58) [0x7fd9731ea218] ) 2-dict: dict is NULL [Invalid argument]+Milind Changire Can you check why this message is logged and send a fix?_______________________________________________Is there any fix for this issue?Thanks.
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users--Amar Tumballi (amarts)
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users