Re: setting gfid on .trashcan/... failed - total outage

Dietmar Putz <dietmar.putz@xxxxxxxxx> · Thu, 29 Jun 2017 17:13:37 +0200

Hello Anoop,

thank you for your reply....

answers inside...

best regards

Dietmar

On 29.06.2017 10:48, Anoop C S wrote:
On Wed, 2017-06-28 at 14:42 +0200, Dietmar Putz wrote:
Hello,

recently we had two times a partial gluster outage followed by a total
outage of all four nodes. Looking into the gluster mailing list i found
a very similar case in
http://lists.gluster.org/pipermail/gluster-users/2016-June/027124.html
If you are talking about a crash happening on bricks, were you able to find any backtraces from any
of the brick logs?

yes, the crash happened on the bricks.
i followed the hints in the mentioned similar case but unfortunately i 
did not found any backtrace from any of the brick logs.

but i'm not sure if this issue is fixed...

even this outage happened on glusterfs 3.7.18 which gets no more updates
since ~.20 i would kindly ask if this issue is known to be fixed in 3.8
resp. 3.10... ?
unfortunately i did not found corresponding informations in the release
notes...

best regards
Dietmar

the partial outage started as shown below, the very first entries
occurred in the brick-logs :

gl-master-04, brick1-mvol1.log :

[2017-06-23 16:35:11.373471] E [MSGID: 113020]
[posix.c:2839:posix_create] 0-mvol1-posix: setting gfid on
/brick1/mvol1/.trashcan//2290/uploads/170221_Sendung_Lieberum_01_AT.mp4_2017-06-23_163511
failed
[2017-06-23 16:35:11.392540] E [posix.c:3188:_fill_writev_xdata]
(-->/usr/lib/x86_64-linux-
gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab)
[0x7f4f8c2aaa0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/
storage/posix.so(posix_writev+0x1ff) [0x7f4f8caec62f]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6)
[0x7f4f8caec406] ) 0-mvol1-posix: fd: 0x7f4ef434225c inode:
0x7f4ef430bd6cgfid:00000000-0
000-0000-0000-000000000000 [Invalid argument]
...

gl-master-04 : etc-glusterfs-glusterd.vol.log

[2017-06-23 16:35:18.872346] W [rpcsvc.c:270:rpcsvc_program_actor]
0-rpc-service: RPC program not available (req 1298437 330) for
10.0.1.203:65533
[2017-06-23 16:35:18.872421] E
[rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed
to complete successfully

gl-master-04 : glustershd.log

[2017-06-23 16:35:42.536840] E [MSGID: 108006]
[afr-common.c:4323:afr_notify] 0-mvol1-replicate-1: All subvolumes are
down. Going offline until atleast one of them comes back up.
[2017-06-23 16:35:51.702413] E [socket.c:2292:socket_connect_finish]
0-mvol1-client-3: connection to 10.0.1.156:49152 failed (Connection refused)

gl-master-03, brick1-movl1.log :

[2017-06-23 16:35:11.399769] E [MSGID: 113020]
[posix.c:2839:posix_create] 0-mvol1-posix: setting gfid on
/brick1/mvol1/.trashcan//2290/uploads/170221_Sendung_Lieberum_01_AT.mp4_2017-06-23_163511
failed
[2017-06-23 16:35:11.418559] E [posix.c:3188:_fill_writev_xdata]
(-->/usr/lib/x86_64-linux-
gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab)
[0x7ff517087a0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/
storage/posix.so(posix_writev+0x1ff) [0x7ff5178c962f]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6)
[0x7ff5178c9406] ) 0-mvol1-posix: fd: 0x7ff4c814a43c inode:
0x7ff4c82e1b5cgfid:00000000-0
000-0000-0000-000000000000 [Invalid argument]
...

gl-master-03 : etc-glusterfs-glusterd.vol.log

[2017-06-23 16:35:19.879140] W [rpcsvc.c:270:rpcsvc_program_actor]
0-rpc-service: RPC program not available (req 1298437 330) for
10.0.1.203:65530
[2017-06-23 16:35:19.879201] E
[rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed
to complete successfully
[2017-06-23 16:35:19.879300] W [rpcsvc.c:270:rpcsvc_program_actor]
0-rpc-service: RPC program not available (req 1298437 330) for
10.0.1.203:65530
[2017-06-23 16:35:19.879314] E
[rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed
to complete successfully
[2017-06-23 16:35:19.879845] W [rpcsvc.c:270:rpcsvc_program_actor]
0-rpc-service: RPC program not available (req 1298437 330) for
10.0.1.203:65530
[2017-06-23 16:35:19.879859] E
[rpcsvc.c:565:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed
to complete successfully
[2017-06-23 16:35:42.538727] W [socket.c:596:__socket_rwv] 0-management:
readv on /var/run/gluster/5e23d9709b37ac7877720ac3986c48bc.socket failed
(No data available)
[2017-06-23 16:35:42.543486] I [MSGID: 106005]
[glusterd-handler.c:5037:__glusterd_brick_rpc_notify] 0-management:
Brick gl-master-03-int:/brick1/mvol1 has disconnected from glusterd.

gl-master-03 : glustershd.log

[2017-06-23 16:35:42.537752] E [MSGID: 108006]
[afr-common.c:4323:afr_notify] 0-mvol1-replicate-1: All subvolumes are
down. Going offline until atleast one of them comes back up.
[2017-06-23 16:35:52.011016] E [socket.c:2292:socket_connect_finish]
0-mvol1-client-3: connection to 10.0.1.156:49152 failed (Connection refused)
[2017-06-23 16:35:53.010620] E [socket.c:2292:socket_connect_finish]
0-mvol1-client-2: connection to 10.0.1.154:49152 failed (Connection refused)

about 73 minutes later the remaining replicated pair was affected by the
outage :

gl-master-02, brick1-mvol1.log :

[2017-06-23 17:48:30.093526] E [MSGID: 113018]
[posix.c:2766:posix_create] 0-mvol1-posix: pre-operation lstat on parent
/brick1/mvol1/.trashcan//2290/uploads failed [No such file or directory]
[2017-06-23 17:48:30.093591] E [MSGID: 113018]
[posix.c:1447:posix_mkdir] 0-mvol1-posix: pre-operation lstat on parent
/brick1/mvol1/.trashcan//2290 failed [No such file or directory]
[2017-06-23 17:48:30.093636] E [MSGID: 113027]
[posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of /brick1/mvol1/ failed
[File exists]
[2017-06-23 17:48:30.093670] E [MSGID: 113027]
[posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of
/brick1/mvol1/.trashcan failed [File exists]
[2017-06-23 17:48:30.093701] E [MSGID: 113027]
[posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of
/brick1/mvol1/.trashcan/ failed [File exists]
[2017-06-23 17:48:30.113559] E [MSGID: 113001]
[posix.c:1562:posix_mkdir] 0-mvol1-posix: setting xattrs on
/brick1/mvol1/.trashcan//2290 failed [No such file or directory]
[2017-06-23 17:48:30.113630] E [MSGID: 113027]
[posix.c:1538:posix_mkdir] 0-mvol1-posix: mkdir of
/brick1/mvol1/.trashcan//2290 failed [File exists]
[2017-06-23 17:48:30.163155] E [MSGID: 113001]
[posix.c:1562:posix_mkdir] 0-mvol1-posix: setting xattrs on
/brick1/mvol1/.trashcan//2290/uploads failed [No such file or directory]
[2017-06-23 17:48:30.163282] E [MSGID: 113001]
[posix.c:2832:posix_create] 0-mvol1-posix: setting xattrs on
/brick1/mvol1/.trashcan//2290/uploads/170623_TVM_News.mp4_2017-06-23_174830
failed  [No such file or directory]
[2017-06-23 17:48:30.165617] E [posix.c:3188:_fill_writev_xdata]
(-->/usr/lib/x86_64-linux-
gnu/glusterfs/3.7.18/xlator/features/trash.so(trash_truncate_readv_cbk+0x1ab)
[0x7f4ec77d9a0b] -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/
storage/posix.so(posix_writev+0x1ff) [0x7f4ecc1c162f]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.18/xlator/storage/posix.so(_fill_writev_xdata+0x1c6)
[0x7f4ecc1c1406] ) 0-mvol1-posix: fd: 0x7f4e70429b6c inode:
0x7f4e7041f9acgfid:00000000-0
000-0000-0000-000000000000 [Invalid argument]

the mentioned file in the brick-log was still available in the origin
directory but not in the corresponding trashcan directory :

[ 14:29:29 ] - root@gl-master-01  /var/log/glusterfs $ls -lh
/sdn/2290/uploads/170221_Sendung_Lieberum_01_AT*
-rw-r--r-- 1 2001 2001 386M Mar 31 13:00
/sdn/2290/uploads/170221_Sendung_Lieberum_01_AT.mp4
-rw-r--r-- 1 2001 2001 386M Jun  2 13:09
/sdn/2290/uploads/170221_Sendung_Lieberum_01_AT_AT.mp4
[ 15:08:53 ] - root@gl-master-01  /var/log/glusterfs $

[ 15:11:04 ] - root@gl-master-01  /var/log/glusterfs $ls -lh
/sdn/.trashcan/2290/uploads/170221_Sendung_Lieberum_01_AT*
[ 15:11:10 ] - root@gl-master-01  /var/log/glusterfs $

some further informations...the OS is ubuntu 16.04.2 lts, volume info
below :

[ 11:31:53 ] - root@gl-master-03  ~ $gluster volume info mvol1

Volume Name: mvol1
Type: Distributed-Replicate
Volume ID: 2f5de6e4-66de-40a7-9f24-4762aad3ca96
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: gl-master-01-int:/brick1/mvol1
Brick2: gl-master-02-int:/brick1/mvol1
Brick3: gl-master-03-int:/brick1/mvol1
Brick4: gl-master-04-int:/brick1/mvol1
Options Reconfigured:
geo-replication.ignore-pid-check: on
geo-replication.indexing: on
nfs.disable: off
diagnostics.client-log-level: ERROR
changelog.changelog: on
performance.cache-refresh-timeout: 32
cluster.min-free-disk: 200GB
network.ping-timeout: 5
performance.io-thread-count: 64
performance.cache-size: 8GB
performance.readdir-ahead: on
features.trash: off
mvol1 has disabled the trash feature. So you should not be seeing the above mentioned errors in
brick logs further.

yes, right after the second outage we decided to disable the trash 
feature...

features.trash-max-filesize: 1GB
[ 11:31:56 ] - root@gl-master-03  ~ $

Host : gl-master-01
-rw-r----- 1 root root 232M Jun 23 17:49
/var/crash/_usr_sbin_glusterfsd.0.crash
-----------------------------------------------------
Host : gl-master-02
-rw-r----- 1 root root 226M Jun 23 17:49
/var/crash/_usr_sbin_glusterfsd.0.crash
-----------------------------------------------------
Host : gl-master-03
-rw-r----- 1 root root 254M Jun 23 16:35
/var/crash/_usr_sbin_glusterfsd.0.crash
-----------------------------------------------------
Host : gl-master-04
-rw-r----- 1 root root 239M Jun 23 16:35
/var/crash/_usr_sbin_glusterfsd.0.crash
-----------------------------------------------------
If these are the core files dumped due to brick crash, can you please attach it to gdb as follows
and paste the backtrace by executing the `bt` command within it.

$ gdb /usr/sbin/glusterfsd /var/crash/_usr_sbin_glusterfs.0.crash

(gdb) bt

unfortunately another problem...even when the filename ends up with 
'crash' and the creation time meets the time of the error the file 
_usr_sbin_glusterfsd.0.crash is not recognized as a core dump.
currently i don't know how to handle this, tried several things with no 
success, therefore i add the 'head' of the file...

[ 14:47:37 ] - root@gl-master-03  ~ $gdb /usr/sbin/glusterfsd 
/var/crash/_usr_sbin_glusterfsd.0.crash
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
...
"/var/crash/_usr_sbin_glusterfsd.0.crash" is not a core dump: File 
format not recognised
(gdb)

[ 14:48:30 ] - root@gl-master-03  ~ $file 
/var/crash/_usr_sbin_glusterfsd.0.crash
/var/crash/_usr_sbin_glusterfsd.0.crash: ASCII text, with very long lines
[ 14:48:37 ] - root@gl-master-03  ~ $head 
/var/crash/_usr_sbin_glusterfsd.0.crash
ProblemType: Crash
Architecture: amd64
Date: Fri Jun 23 16:35:13 2017
DistroRelease: Ubuntu 16.04
ExecutablePath: /usr/sbin/glusterfsd
ExecutableTimestamp: 1481112595
ProcCmdline: /usr/sbin/glusterfsd -s gl-master-03-int --volfile-id 
mvol1.gl-master-03-int.brick1-mvol1 -p 
/var/lib/glusterd/vols/mvol1/run/gl-master-03-int-brick1-mvol1.pid -S 
/var/run/gluster/5e23d9709b37ac7877720ac3986c48bc.socket --brick-name 
/brick1/mvol1 -l /var/log/glusterfs/bricks/brick1-mvol1.log 
--xlator-option 
*-posix.glusterd-uuid=056fb1db-9a49-422d-81fb-94e1881313fd --brick-port 
49152 --xlator-option mvol1-server.listen-port=49152
ProcCwd: /
ProcEnviron:
 LANGUAGE=en_GB:en
[ 14:48:52 ] - root@gl-master-03  ~ $

--
Dietmar Putz
3Q GmbH
Wetzlarer Str. 86
D-14482 Potsdam

Telefax:  +49 (0)331 / 2797 866 - 1
Telefon:  +49 (0)331 / 2797 866 - 8
Mobile:   +49 171 / 90 160 39
Mail:     dietmar.putz@xxxxxxxxx

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users