On Thu, 2017-06-29 at 17:13 +0200, Dietmar Putz wrote: > Hello Anoop, > > thank you for your reply.... > > answers inside... > > best regards > > Dietmar > > > On 29.06.2017 10:48, Anoop C S wrote: > > On Wed, 2017-06-28 at 14:42 +0200, Dietmar Putz wrote: > > > Hello, > > > > > > recently we had two times a partial gluster outage followed by a total > > > outage of all four nodes. Looking into the gluster mailing list i found > > > a very similar case in > > > http://lists.gluster.org/pipermail/gluster-users/2016-June/027124.html > > > > If you are talking about a crash happening on bricks, were you able to find any backtraces from > > any > > of the brick logs? > > yes, the crash happened on the bricks. > i followed the hints in the mentioned similar case but unfortunately i > did not found any backtrace from any of the brick logs. Usually a backtrace will be written to logs just before brick dies in case of SIG SEGV. > > > > > > but i'm not sure if this issue is fixed... > > > > > > even this outage happened on glusterfs 3.7.18 which gets no more updates > > > since ~.20 i would kindly ask if this issue is known to be fixed in 3.8 > > > resp. 3.10... ? > > > unfortunately i did not found corresponding informations in the release > > > notes... > > > > > > best regards > > > Dietmar > > > > > > > > > the partial outage started as shown below, the very first entries > > > occurred in the brick-logs : > > > > > > gl-master-04, brick1-mvol1.log : > > > > > > [2017-06-23 16:35:11.373471] E [MSGID: 113020] > > > [posix.c:2839:posix_create] 0-mvol1-posix: setting gfid on > > > /brick1/mvol1/.trashcan//2290/uploads/170221_Sendung_Lieberum_01_AT.mp4_2017-06-23_163511 > > > failed > > > [2017-06-23 16:35:11.392540] E [posix.c:3188:_fill_writev_xdata] > > > (-->/usr/lib/x86_64-linux- > > > gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab) > > > [0x7f4f8c2aaa0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/ > > > storage/posix.so(posix_writev+0x1ff) [0x7f4f8caec62f] > > > -->/usr/lib/x86_64-linux- > > > gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6) > > > [0x7f4f8caec406] ) 0-mvol1-posix: fd: 0x7f4ef434225c inode: > > > 0x7f4ef430bd6cgfid:00000000-0 > > > 000-0000-0000-000000000000 [Invalid argument] > > > ... > > > > > > > > > gl-master-04 : etc-glusterfs-glusterd.vol.log > > > > > > [2017-06-23 16:35:18.872346] W [rpcsvc.c:270:rpcsvc_program_actor] > > > 0-rpc-service: RPC program not available (req 1298437 330) for > > > 10.0.1.203:65533 > > > [2017-06-23 16:35:18.872421] E > > > [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed > > > to complete successfully > > > > > > gl-master-04 : glustershd.log > > > > > > [2017-06-23 16:35:42.536840] E [MSGID: 108006] > > > [afr-common.c:4323:afr_notify] 0-mvol1-replicate-1: All subvolumes are > > > down. Going offline until atleast one of them comes back up. > > > [2017-06-23 16:35:51.702413] E [socket.c:2292:socket_connect_finish] > > > 0-mvol1-client-3: connection to 10.0.1.156:49152 failed (Connection refused) > > > > > > > > > > > > gl-master-03, brick1-movl1.log : > > > > > > [2017-06-23 16:35:11.399769] E [MSGID: 113020] > > > [posix.c:2839:posix_create] 0-mvol1-posix: setting gfid on > > > /brick1/mvol1/.trashcan//2290/uploads/170221_Sendung_Lieberum_01_AT.mp4_2017-06-23_163511 > > > failed > > > [2017-06-23 16:35:11.418559] E [posix.c:3188:_fill_writev_xdata] > > > (-->/usr/lib/x86_64-linux- > > > gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab) > > > [0x7ff517087a0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/ > > > storage/posix.so(posix_writev+0x1ff) [0x7ff5178c962f] > > > -->/usr/lib/x86_64-linux- > > > gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6) > > > [0x7ff5178c9406] ) 0-mvol1-posix: fd: 0x7ff4c814a43c inode: > > > 0x7ff4c82e1b5cgfid:00000000-0 > > > 000-0000-0000-000000000000 [Invalid argument] > > > ... > > > > > > > > > gl-master-03 : etc-glusterfs-glusterd.vol.log > > > > > > [2017-06-23 16:35:19.879140] W [rpcsvc.c:270:rpcsvc_program_actor] > > > 0-rpc-service: RPC program not available (req 1298437 330) for > > > 10.0.1.203:65530 > > > [2017-06-23 16:35:19.879201] E > > > [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed > > > to complete successfully > > > [2017-06-23 16:35:19.879300] W [rpcsvc.c:270:rpcsvc_program_actor] > > > 0-rpc-service: RPC program not available (req 1298437 330) for > > > 10.0.1.203:65530 > > > [2017-06-23 16:35:19.879314] E > > > [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed > > > to complete successfully > > > [2017-06-23 16:35:19.879845] W [rpcsvc.c:270:rpcsvc_program_actor] > > > 0-rpc-service: RPC program not available (req 1298437 330) for > > > 10.0.1.203:65530 > > > [2017-06-23 16:35:19.879859] E > > > [rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed > > > to complete successfully > > > [2017-06-23 16:35:42.538727] W [socket.c:596:__socket_rwv] 0-management: > > > readv on /var/run/gluster/5e23d9709b37ac7877720ac3986c48bc.socket failed > > > (No data available) > > > [2017-06-23 16:35:42.543486] I [MSGID: 106005] > > > [glusterd-handler.c:5037:__glusterd_brick_rpc_notify] 0-management: > > > Brick gl-master-03-int:/brick1/mvol1 has disconnected from glusterd. > > > > > > > > > gl-master-03 : glustershd.log > > > > > > [2017-06-23 16:35:42.537752] E [MSGID: 108006] > > > [afr-common.c:4323:afr_notify] 0-mvol1-replicate-1: All subvolumes are > > > down. Going offline until atleast one of them comes back up. > > > [2017-06-23 16:35:52.011016] E [socket.c:2292:socket_connect_finish] > > > 0-mvol1-client-3: connection to 10.0.1.156:49152 failed (Connection refused) > > > [2017-06-23 16:35:53.010620] E [socket.c:2292:socket_connect_finish] > > > 0-mvol1-client-2: connection to 10.0.1.154:49152 failed (Connection refused) > > > > > > > > > > > > about 73 minutes later the remaining replicated pair was affected by the > > > outage : > > > > > > gl-master-02, brick1-mvol1.log : > > > > > > [2017-06-23 17:48:30.093526] E [MSGID: 113018] > > > [posix.c:2766:posix_create] 0-mvol1-posix: pre-operation lstat on parent > > > /brick1/mvol1/.trashcan//2290/uploads failed [No such file or directory] > > > [2017-06-23 17:48:30.093591] E [MSGID: 113018] > > > [posix.c:1447:posix_mkdir] 0-mvol1-posix: pre-operation lstat on parent > > > /brick1/mvol1/.trashcan//2290 failed [No such file or directory] > > > [2017-06-23 17:48:30.093636] E [MSGID: 113027] > > > [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of /brick1/mvol1/ failed > > > [File exists] > > > [2017-06-23 17:48:30.093670] E [MSGID: 113027] > > > [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of > > > /brick1/mvol1/.trashcan failed [File exists] > > > [2017-06-23 17:48:30.093701] E [MSGID: 113027] > > > [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of > > > /brick1/mvol1/.trashcan/ failed [File exists] > > > [2017-06-23 17:48:30.113559] E [MSGID: 113001] > > > [posix.c:1562:posix_mkdir] 0-mvol1-posix: setting xattrs on > > > /brick1/mvol1/.trashcan//2290 failed [No such file or directory] > > > [2017-06-23 17:48:30.113630] E [MSGID: 113027] > > > [posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of > > > /brick1/mvol1/.trashcan//2290 failed [File exists] > > > [2017-06-23 17:48:30.163155] E [MSGID: 113001] > > > [posix.c:1562:posix_mkdir] 0-mvol1-posix: setting xattrs on > > > /brick1/mvol1/.trashcan//2290/uploads failed [No such file or directory] > > > [2017-06-23 17:48:30.163282] E [MSGID: 113001] > > > [posix.c:2832:posix_create] 0-mvol1-posix: setting xattrs on > > > /brick1/mvol1/.trashcan//2290/uploads/170623_TVM_News.mp4_2017-06-23_174830 > > > failed [No such file or directory] > > > [2017-06-23 17:48:30.165617] E [posix.c:3188:_fill_writev_xdata] > > > (-->/usr/lib/x86_64-linux- > > > gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab) > > > [0x7f4ec77d9a0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/ > > > storage/posix.so(posix_writev+0x1ff) [0x7f4ecc1c162f] > > > -->/usr/lib/x86_64-linux- > > > gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6) > > > [0x7f4ecc1c1406] ) 0-mvol1-posix: fd: 0x7f4e70429b6c inode: > > > 0x7f4e7041f9acgfid:00000000-0 > > > 000-0000-0000-000000000000 [Invalid argument] > > > > > > > > > the mentioned file in the brick-log was still available in the origin > > > directory but not in the corresponding trashcan directory : > > > > > > > > > [ 14:29:29 ] - root@gl-master-01 /var/log/glusterfs $ls -lh > > > /sdn/2290/uploads/170221_Sendung_Lieberum_01_AT* > > > -rw-r--r-- 1 2001 2001 386M Mar 31 13:00 > > > /sdn/2290/uploads/170221_Sendung_Lieberum_01_AT.mp4 > > > -rw-r--r-- 1 2001 2001 386M Jun 2 13:09 > > > /sdn/2290/uploads/170221_Sendung_Lieberum_01_AT_AT.mp4 > > > [ 15:08:53 ] - root@gl-master-01 /var/log/glusterfs $ > > > > > > > > > [ 15:11:04 ] - root@gl-master-01 /var/log/glusterfs $ls -lh > > > /sdn/.trashcan/2290/uploads/170221_Sendung_Lieberum_01_AT* > > > [ 15:11:10 ] - root@gl-master-01 /var/log/glusterfs $ > > > > > > > > > some further informations...the OS is ubuntu 16.04.2 lts, volume info > > > below : > > > > > > [ 11:31:53 ] - root@gl-master-03 ~ $gluster volume info mvol1 > > > > > > Volume Name: mvol1 > > > Type: Distributed-Replicate > > > Volume ID: 2f5de6e4-66de-40a7-9f24-4762aad3ca96 > > > Status: Started > > > Number of Bricks: 2 x 2 = 4 > > > Transport-type: tcp > > > Bricks: > > > Brick1: gl-master-01-int:/brick1/mvol1 > > > Brick2: gl-master-02-int:/brick1/mvol1 > > > Brick3: gl-master-03-int:/brick1/mvol1 > > > Brick4: gl-master-04-int:/brick1/mvol1 > > > Options Reconfigured: > > > geo-replication.ignore-pid-check: on > > > geo-replication.indexing: on > > > nfs.disable: off > > > diagnostics.client-log-level: ERROR > > > changelog.changelog: on > > > performance.cache-refresh-timeout: 32 > > > cluster.min-free-disk: 200GB > > > network.ping-timeout: 5 > > > performance.io-thread-count: 64 > > > performance.cache-size: 8GB > > > performance.readdir-ahead: on > > > features.trash: off > > > > mvol1 has disabled the trash feature. So you should not be seeing the above mentioned errors in > > brick logs further. > > yes, right after the second outage we decided to disable the trash > feature... > > > > > > features.trash-max-filesize: 1GB > > > [ 11:31:56 ] - root@gl-master-03 ~ $ > > > > > > > > > Host : gl-master-01 > > > -rw-r----- 1 root root 232M Jun 23 17:49 > > > /var/crash/_usr_sbin_glusterfsd.0.crash > > > ----------------------------------------------------- > > > Host : gl-master-02 > > > -rw-r----- 1 root root 226M Jun 23 17:49 > > > /var/crash/_usr_sbin_glusterfsd.0.crash > > > ----------------------------------------------------- > > > Host : gl-master-03 > > > -rw-r----- 1 root root 254M Jun 23 16:35 > > > /var/crash/_usr_sbin_glusterfsd.0.crash > > > ----------------------------------------------------- > > > Host : gl-master-04 > > > -rw-r----- 1 root root 239M Jun 23 16:35 > > > /var/crash/_usr_sbin_glusterfsd.0.crash > > > ----------------------------------------------------- > > > > If these are the core files dumped due to brick crash, can you please attach it to gdb as > > follows > > and paste the backtrace by executing the `bt` command within it. > > > > $ gdb /usr/sbin/glusterfsd /var/crash/_usr_sbin_glusterfs.0.crash > > > > (gdb) bt > > unfortunately another problem...even when the filename ends up with > 'crash' and the creation time meets the time of the error the file > _usr_sbin_glusterfsd.0.crash is not recognized as a core dump. > currently i don't know how to handle this, tried several things with no > success, therefore i add the 'head' of the file... > > [ 14:47:37 ] - root@gl-master-03 ~ $gdb /usr/sbin/glusterfsd > /var/crash/_usr_sbin_glusterfsd.0.crash > GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1 > ... > "/var/crash/_usr_sbin_glusterfsd.0.crash" is not a core dump: File > format not recognised > (gdb) > > [ 14:48:30 ] - root@gl-master-03 ~ $file > /var/crash/_usr_sbin_glusterfsd.0.crash > /var/crash/_usr_sbin_glusterfsd.0.crash: ASCII text, with very long lines > [ 14:48:37 ] - root@gl-master-03 ~ $head > /var/crash/_usr_sbin_glusterfsd.0.crash > ProblemType: Crash > Architecture: amd64 > Date: Fri Jun 23 16:35:13 2017 > DistroRelease: Ubuntu 16.04 > ExecutablePath: /usr/sbin/glusterfsd > ExecutableTimestamp: 1481112595 > ProcCmdline: /usr/sbin/glusterfsd -s gl-master-03-int --volfile-id > mvol1.gl-master-03-int.brick1-mvol1 -p > /var/lib/glusterd/vols/mvol1/run/gl-master-03-int-brick1-mvol1.pid -S > /var/run/gluster/5e23d9709b37ac7877720ac3986c48bc.socket --brick-name > /brick1/mvol1 -l /var/log/glusterfs/bricks/brick1-mvol1.log > --xlator-option > *-posix.glusterd-uuid=056fb1db-9a49-422d-81fb-94e1881313fd --brick-port > 49152 --xlator-option mvol1-server.listen-port=49152 > ProcCwd: / > ProcEnviron: > LANGUAGE=en_GB:en > [ 14:48:52 ] - root@gl-master-03 ~ $ Are there any information regarding the signal which caused the crash? like Signal: 11 or something similar.. May be you can use `apport-unpack` to extract the coredump(found via some search..not all familiar). _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-users