Hi,
I am a bit surprised by this response. Quite recently we created a Redhat support case about the same issue (brick process crashing when scanned), and Redhats response was simply that the "solution" was to not scan the bricks and that this issue will not be resolved (RH support case #02551577). This of course is for Redhat's commercial GlusterFS version, currently at v6.0-21.el7rhgs.
Would this fix also be ported to the RHGS version ?
Regards,
Nico van Roijen - ING Bank.
Van: "Xavi Hernandez" <jahernan@xxxxxxxxxx>
Aan: "Ben Tasker" <btasker@xxxxxxxxxxxxxx>
Cc: "gluster-users" <gluster-users@xxxxxxxxxxx>
Verzonden: Maandag 13 januari 2020 12:20:29
Onderwerp: Re: Gluster Periodic Brick Process Deaths
Aan: "Ben Tasker" <btasker@xxxxxxxxxxxxxx>
Cc: "gluster-users" <gluster-users@xxxxxxxxxxx>
Verzonden: Maandag 13 januari 2020 12:20:29
Onderwerp: Re: Gluster Periodic Brick Process Deaths
Hi Ben,
we already identified the issue that caused crashes when gluster ports were scanned. The fix is present on 6.7 and 7.1, so if this was the reason for your problem, those versions should help.
Best regards,
Xavi
On Mon, Jan 13, 2020 at 11:57 AM Ben Tasker <btasker@xxxxxxxxxxxxxx> wrote:
Hi,Just an update on this - we made our ACLs much, much stricter around gluster ports and to my knowledge haven't seen a brick death since.BenOn Wed, Dec 11, 2019 at 12:43 PM Ben Tasker <btasker@xxxxxxxxxxxxxx> wrote:Hi Xavi,We don't that I'm explicitly aware of, *but* I can't rule it out as a probability as it's possible some of our partners do (some/most certainly have scans done as part of pentests fairly regularly).But, that does at least give me an avenue to pursue in the meantime, thanks!BenOn Wed, Dec 11, 2019 at 12:16 PM Xavi Hernandez <jahernan@xxxxxxxxxx> wrote:Hi Ben,I've recently seen some issues that seem similar to yours (based on the stack trace in the logs). Right now it seems that in these cases the problem is caused by some port scanning tool that triggers an unhandled condition. We are still investigating what is causing this to fix it as soon as possible.Do you have one of these tools on your network ?Regards,XaviOn Tue, Dec 10, 2019 at 7:53 PM Ben Tasker <btasker@xxxxxxxxxxxxxx> wrote:Hi,________A little while ago we had an issue with Gluster 6. As it was urgent we downgraded to Gluster 5.9 and it went away.
Some boxes are now running 5.10 and the issue has come back.From the operators point of view, the first you know about this is getting reports that the transport endpoint is not connected:OSError: [Errno 107] Transport endpoint is not connected: '/shared/lfd/benfusetestlfd'If we check, we can see that the brick process has died# gluster volume status Status of volume: shared Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick fa01.gl:/data1/gluster N/A N/A N N/A Brick fa02.gl:/data1/gluster N/A N/A N N/A Brick fa01.gl:/data2/gluster 49153 0 Y 14136 Brick fa02.gl:/data2/gluster 49153 0 Y 14154 NFS Server on localhost N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 186193 NFS Server on fa01.gl N/A N/A N N/A Self-heal Daemon on fa01.gl N/A N/A Y 6723Looking in the brick logs, we can see that the process crashed, and we get a backtrace (of sorts)>gen=110, slot->fd=17 pending frames: patchset: git://git.gluster.org/glusterfs.git signal received: 11 time of crash: 2019-07-04 09:42:43 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 6.1 /lib64/libglusterfs.so.0(+0x26db0)[0x7f79984eadb0] /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f79984f57b4] /lib64/libc.so.6(+0x36280)[0x7f7996b2a280] /usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f798c8af4cc] /lib64/libglusterfs.so.0(+0x8c286)[0x7f7998550286] /lib64/libpthread.so.0(+0x7dd5)[0x7f799732add5] /lib64/libc.so.6(clone+0x6d)[0x7f7996bf1ead]Other than that, there's not a lot in the logs. In syslog we can see the client (Gluster's FS is mounted on the boxes) complaining that the brick's gone away.Software versions (for when this was happening with 6):# rpm -qa | grep glus glusterfs-libs-6.1-1.el7.x86_64 glusterfs-cli-6.1-1.el7.x86_64 centos-release-gluster6-1.0-1.el7.centos.noarch glusterfs-6.1-1.el7.x86_64 glusterfs-api-6.1-1.el7.x86_64 glusterfs-server-6.1-1.el7.x86_64 glusterfs-client-xlators-6.1-1.el7.x86_64 glusterfs-fuse-6.1-1.el7.x86_64This was happening pretty regularly (uncomfortably so) on boxes running Gluster 6. Grepping through the brick logs it's always a segfault or sigabrt that leads to brick death# grep "signal received:" data* data1-gluster.log:signal received: 11 data1-gluster.log:signal received: 6 data1-gluster.log:signal received: 6 data1-gluster.log:signal received: 11 data2-gluster.log:signal received: 6There's no apparent correlation on times or usage levels that we could see. The issue was occurring on a wide array of hardware, spread across the globe (but always talking to local - i.e. LAN - peers). All the same, disks were checked, RAM checked etc.Digging through the logs we were able to find the lines just as the crash occurs[2019-07-07 06:37:00.213490] I [MSGID: 108031] [afr-common.c:2547:afr_local_discovery_cbk] 0-shared-replicate-1: selecting local read_child shared-client-2 [2019-07-07 06:37:03.544248] E [MSGID: 108008] [afr-transaction.c:2877:afr_write_txn_refresh_done] 0-shared-replicate-1: Failing SETATTR on gfid a9565e4b-9148-4969-91e8-ba816aea8f6a: split-brain observed. [Input/output error] [2019-07-07 06:37:03.544312] W [MSGID: 0] [dht-inode-write.c:1156:dht_non_mds_setattr_cbk] 0-shared-dht: subvolume shared-replicate-1 returned -1 [2019-07-07 06:37:03.545317] E [MSGID: 108008] [afr-transaction.c:2877:afr_write_txn_refresh_done] 0-shared-replicate-1: Failing SETATTR on gfid a8dd2910-ff64-4ced-81ef-01852b7094ae: split-brain observed. [Input/output error] [2019-07-07 06:37:03.545382] W [fuse-bridge.c:1583:fuse_setattr_cbk] 0-glusterfs-fuse: 2241437: SETATTR() /lfd/benfusetestlfd/_logs => -1 (Input/output error)But, it's not the first time that had occurred, so may be completely unrelated.When this happens, restarting gluster buys some time. It may just be coincidental, but our searches through the logs showed only the first brick process dying, processes for other bricks (some of the boxes have 4) don't appear to be affected by this.As we had lots and lots of Gluster machines failing across the network, at this point we stopped investigating and I came up with a downgrade procedure so that we could get production back into a usable state. Machines running Gluster 6 were downgraded to Gluster 5.9 and the issue just went away. Unfortunately other demands came up, so no-one was able to follow up on it.Tonight though, there's been a brick process fail on a 5.10 machine with an all too familiar looking BT[2019-12-10 17:20:01.708601] I [MSGID: 115029] [server-handshake.c:537:server_setvolume] 0-shared-server: accepted client from CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0 (version: 5.1 0) [2019-12-10 17:20:01.745940] I [MSGID: 115036] [server.c:469:server_rpc_notify] 0-shared-server: disconnecting connection from CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0 [2019-12-10 17:20:01.746090] I [MSGID: 101055] [client_t.c:435:gf_client_unref] 0-shared-server: Shutting down connection CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0 pending frames: patchset: git://git.gluster.org/glusterfs.git signal received: 11 time of crash: 2019-12-10 17:21:36 configuration details: argp 1 backtrace 1 dlfcn 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 5.10 /lib64/libglusterfs.so.0(+0x26650)[0x7f6a1c6f3650] /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f6a1c6fdc04] /lib64/libc.so.6(+0x363b0)[0x7f6a1ad543b0] /usr/lib64/glusterfs/5.10/rpc-transport/socket.so(+0x9e3b)[0x7f6a112dae3b] /lib64/libglusterfs.so.0(+0x8aab9)[0x7f6a1c757ab9] /lib64/libpthread.so.0(+0x7e65)[0x7f6a1b556e65] /lib64/libc.so.6(clone+0x6d)[0x7f6a1ae1c88d] ---------Versions this time are# rpm -qa | grep glus glusterfs-server-5.10-1.el7.x86_64 centos-release-gluster5-1.0-1.el7.centos.noarch glusterfs-fuse-5.10-1.el7.x86_64 glusterfs-libs-5.10-1.el7.x86_64 glusterfs-client-xlators-5.10-1.el7.x86_64 glusterfs-api-5.10-1.el7.x86_64 glusterfs-5.10-1.el7.x86_64 glusterfs-cli-5.10-1.el7.x86_64These boxes have been running 5.10 for less than 48 hoursHas anyone else run into this? Assuming the root is the same (it's a fairly limited BT, so hard to say for sure), was something from 6 backported into 5.10?ThanksBen
Community Meeting Calendar:
APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/441850968
NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/441850968
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users
________
Community Meeting Calendar:
APAC Schedule -
Every 2nd and 4th Tuesday at 11:30 AM IST
Bridge: https://bluejeans.com/441850968
NA/EMEA Schedule -
Every 1st and 3rd Tuesday at 01:00 PM EDT
Bridge: https://bluejeans.com/441850968
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users
________ Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/441850968 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users