Re: geo-rep: remote operation failed - No such file or directory

Milind Changire <mchangir@xxxxxxxxxx> · Wed, 24 Feb 2016 04:14:53 -0500 (EST)

1. You could use the script at
   https://gist.github.com/aravindavk/afb16813261794faa432
   to create a path from the gfid that you could cd to
   i.e. for gfid c4b19f1c-cc18-4727-87a4-18de8fe0089e

2. yes, you have to recursively set the virtual xattr
   on all entries in the directory tree
   Also, remember to set a value as well
   # setfattr -n glusterfs.geo-rep.trigger-sync -v 1 <file-path>

Also, remember to set the virtual xattr via the volume
mount path and not the brick back-end path.
You should have geo-replication stopped when you are
setting the virtual xattr and start it when you are 
done setting the xattr for the entire directory tree.

--
Milind

----- Original Message -----
From: "ML mail" <mlnospam@xxxxxxxxx>
To: "Milind Changire" <mchangir@xxxxxxxxxx>
Cc: "Gluster-users" <gluster-users@xxxxxxxxxxx>
Sent: Wednesday, February 24, 2016 1:46:11 PM
Subject: Re:  geo-rep: remote operation failed - No such file or	directory

Thank you for explaining me how the symbolic linking works in the the .glusterfs directory. Now regarding your new instructions I have two questions:

1) How can I find out which directory "OC_DEFAULT_MODULE" on my master brick I should run the 
setfattr command on? My problem here is that there are a lot of OC_DEFAULT_MODULE directories on my brick not just only a single one.

2) If I understand your last paragraph correctly, you want me to locate the correct OC_DEFAULT_MODULE directory and recursively use setfattr on each sub-directories and/or files inside that directory, is this correct?

Regards
ML

On Wednesday, February 24, 2016 7:29 AM, Milind Changire <mchangir@xxxxxxxxxx> wrote:
ML,
You just need to worry about the very first entry that you found with
the find command:

$ find .glusterfs -name c4b19f1c-cc18-4727-87a4-18de8fe0089e -ls
228215    0 lrwxrwxrwx   1 root     root           66 Feb 19 08:52 .glusterfs/c4/b1/c4b19f1c-cc18-4727-87a4-18de8fe0089e -> ../../92/1b/921bfe8e-81ef-4579-b335-abfa2c7e6afb/OC_DEFAULT_MODULE

Since the back-end entry is a symlink, it means that OC_DEFAULT_MODULE
is a directory on the master and it is missing on the slave.
If you try to recursively look at the parent gfids of each of the entries
then they will always point to symlinks since a directory is always
represented as a symlink at the glusterfs back-end, and you will follow
them up to the ROOT gfid.

-----

Now, to get the OC_DEFAULT_MODULE directory replicated on the slave,
you will have to set the virtual xattr on the entire directory tree
in pre-order listing i.e. set the virtual xattr on the directory
starting at OC_DEFAULT_MODULE and then on the entries inside the
directory, and so on down the directory tree.

--
Milind

----- Original Message -----
From: "ML mail" <mlnospam@xxxxxxxxx>
To: "Milind Changire" <mchangir@xxxxxxxxxx>
Cc: "Gluster-users" <gluster-users@xxxxxxxxxxx>
Sent: Wednesday, February 24, 2016 12:25:26 AM
Subject: Re:  geo-rep: remote operation failed - No such file or    directory

Hi Milind,

Thanks for the instructions for forcing the data sync of a specific file. I was not able to do that as I have discovered something even more weird by trying to find out the concerned file by GFID with the find command as you suggested. Indeed it looks like I have a symbolic link pointing to another one and then to another and so on, as you can see below:

$ find .glusterfs -name c4b19f1c-cc18-4727-87a4-18de8fe0089e -ls
228215    0 lrwxrwxrwx   1 root     root           66 Feb 19 08:52 .glusterfs/c4/b1/c4b19f1c-cc18-4727-87a4-18de8fe0089e -> ../../92/1b/921bfe8e-81ef-4579-b335-abfa2c7e6afb/OC_DEFAULT_MODULE

$ ls -la 92/1b/921bfe8e-81ef-4579-b335-abfa2c7e6afb
lrwxrwxrwx 1 root root 79 Feb 19 08:52 92/1b/921bfe8e-81ef-4579-b335-abfa2c7e6afb -> ../../d7/9f/d79f2ebd-029c-4ac5-8074-5eef7ff21236/160201_File_1602_XX.xls

$ ls -la d7/9f/d79f2ebd-029c-4ac5-8074-5eef7ff21236
lrwxrwxrwx 1 root root 53 Feb 15 07:34 d7/9f/d79f2ebd-029c-4ac5-8074-5eef7ff21236 -> ../../fd/ea/fdea1fc6-0f2a-43d2-8776-651cc6ea73e8/1602

$ ls -la fd/ea/fdea1fc6-0f2a-43d2-8776-651cc6ea73e8
lrwxrwxrwx 1 root root 55 Feb 15 07:29 fd/ea/fdea1fc6-0f2a-43d2-8776-651cc6ea73e8 -> ../../20/25/20253364-add8-4149-a7cf-cf46d237a45c/Banana

Is this normal? I somehow don't understand this weird structure of never ending symbolic links... or am I missing something?

Regards
ML

On Tuesday, February 23, 2016 6:31 AM, Milind Changire <mchangir@xxxxxxxxxx> wrote:
ML,
You will have to search for the gfid c4b19f1c-cc18-4727-87a4-18de8fe0089e
at the master cluster brick back-ends and run the following command for
that specific file on the master cluster to force triggering a data sync [1]

# setfattr -n glusterfs.geo-rep.trigger-sync <file-path>

To search for the file at the brick back-end:

# find /<path-to-brick>/.glusterfs -name c4b19f1c-cc18-4727-87a4-18de8fe0089e

Once path to the file is found at any of the bricks, you can then use
the setfattr command described above.

Reference:
[1] feature/changelog: Virtual xattr to trigger explicit sync in geo-rep
    http://review.gluster.org/#/c/9337/
--
Milind

----- Original Message -----
From: "ML mail" <mlnospam@xxxxxxxxx>
To: "Milind Changire" <mchangir@xxxxxxxxxx>
Cc: "Gluster-users" <gluster-users@xxxxxxxxxxx>
Sent: Monday, February 22, 2016 9:10:56 PM
Subject: Re:  geo-rep: remote operation failed - No such file or    directory

Hi Milind,

Thanks for the suggestion, I did that for a few problematic files and it seems to continue but now I am stuck at the following error message on the slave:

[2016-02-22 15:21:30.451133] W [MSGID: 114031] [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-myvolume-geo-client-0: remote operation failed. Path: <gfid:c4b19f1c-cc18-4727-87a4-18de8fe0089e> (c4b19f1c-cc18-4727-87a4-18de8fe0089e) [No such file or directory]

As you can see this message does not include any file or directory name, so I can't go any delete that file or directory. Any other ideas how I may proceed here?

Or maybe would it be easier if I delete the whole directory which I think is affected and start geo-rep from there? Or will this mess things up?

Regards
ML

On Monday, February 22, 2016 12:12 PM, Milind Changire <mchangir@xxxxxxxxxx> wrote:
ML,
You could try deleting problematic files on slave to recover geo-replication
from Faulty state.

However, changelogs generated due to logrotate scenario will still cause
geo-replication to go into Faulty state frequently if geo-replication
fails and restarts.

The patches mentioned in an earlier mail are being worked upon and finalized.
They will be available soon in a release which will avoid geo-replication
going into a Faulty state.

--
Milind

----- Original Message -----
From: "ML mail" <mlnospam@xxxxxxxxx>
To: "Milind Changire" <mchangir@xxxxxxxxxx>, "Gluster-users" <gluster-users@xxxxxxxxxxx>
Sent: Monday, February 22, 2016 1:27:14 PM
Subject: Re:  geo-rep: remote operation failed - No such file or    directory

Hi Milind,

Any news on this issue? I was wondering how can I fix and restart my geo-replication? Can I simply delete the problematic file(s) on my slave and restart geo-rep?

Regards
ML

On Wednesday, February 17, 2016 4:30 PM, ML mail <mlnospam@xxxxxxxxx> wrote:

Hi Milind,

Thank you for your short analysis. Indeed that's exactly what happens, as soon as I restart geo-rep it replays the same over and over as it does not succeed. 

Now regarding the sequence of the file management operations I am not totally sure how it works but I can tell you that we are using ownCloud v8.2.2 (www.owncloud.org) and as storage for this cloud software we use GlusterFS. So it is very probable that ownCloud works like that: when a user uploads a new file if first creates it with another temporary name which it then either renames or moves after successful upload. 

I have the feeling this issue is related to my initial issue which I have reported earlier this month: 
https://www.gluster.org/pipermail/gluster-users/2016-February/025176.html

For now my question would be how do I get to restart geo-replication succesfully?

Regards
ML

On Wednesday, February 17, 2016 4:10 PM, Milind Changire <mchangir@xxxxxxxxxx> wrote:

As per the slave logs, there is an attempt to RENAME files
i.e. a .part file getting renamed to a name without the
.part suffix

Just restarting geo-rep isn't going to help much if
you've already hit the problem. Since the last CHANGELOG
is replayed by geo-rep on a restart, you'll most probably
encounter the same log messages in the logs.

Are the .part files CREATEd, RENAMEd and DELETEd with the
same name often? Are the operations somewhat in the following
sequence that happen on the geo-replication master cluster?

CREATE f1.part
RENAME f1.part f1
DELETE f1
CREATE f1.part
RENAME f1.part f1
...
...

If not, then it would help if you could send the sequence
of file management operations.

--
Milind

----- Original Message -----
From: "Kotresh Hiremath Ravishankar" <khiremat@xxxxxxxxxx>
To: "ML mail" <mlnospam@xxxxxxxxx>
Cc: "Gluster-users" <gluster-users@xxxxxxxxxxx>, "Milind Changire" <mchangir@xxxxxxxxxx>
Sent: Tuesday, February 16, 2016 6:28:21 PM
Subject: Re:  geo-rep: remote operation failed - No such file or    directory

Ccing Milind, he would be able to help

Thanks and Regards,
Kotresh H R

----- Original Message -----
> From: "ML mail" <mlnospam@xxxxxxxxx>
> To: "Gluster-users" <gluster-users@xxxxxxxxxxx>
> Sent: Monday, February 15, 2016 4:41:56 PM
> Subject:  geo-rep: remote operation failed - No such file or    directory
> 
> Hello,
> 
> I noticed that the geo-replication of a volume has STATUS "Faulty" and while
> looking in the *.gluster.log file in
> /var/log/glusterfs/geo-replication-slaves/ on my slave I can see the
> following relevant problem:
> 
> [2016-02-15 10:58:40.402516] I [rpc-clnt.c:1847:rpc_clnt_reconfig]
> 0-myvolume-geo-client-0: changing port to 49152 (from 0)
> [2016-02-15 10:58:40.403928] I [MSGID: 114057]
> [client-handshake.c:1437:select_server_supported_programs]
> 0-myvolume-geo-client-0: Using Program GlusterFS 3.3, Num (1298437), Version
> (330)
> [2016-02-15 10:58:40.404130] I [MSGID: 114046]
> [client-handshake.c:1213:client_setvolume_cbk] 0-myvolume-geo-client-0:
> Connected to myvolume-geo-client-0, attached to remote volume
> '/data/myvolume-geo/brick'.
> [2016-02-15 10:58:40.404150] I [MSGID: 114047]
> [client-handshake.c:1224:client_setvolume_cbk] 0-myvolume-geo-client-0:
> Server and Client lk-version numbers are not same, reopening the fds
> [2016-02-15 10:58:40.410150] I [fuse-bridge.c:5137:fuse_graph_setup] 0-fuse:
> switched to graph 0
> [2016-02-15 10:58:40.410223] I [MSGID: 114035]
> [client-handshake.c:193:client_set_lk_version_cbk] 0-myvolume-geo-client-0:
> Server lk version = 1
> [2016-02-15 10:58:40.410370] I [fuse-bridge.c:4030:fuse_init]
> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.22 kernel
> 7.23
> [2016-02-15 10:58:45.662416] I [MSGID: 109066] [dht-rename.c:1411:dht_rename]
> 0-myvolume-geo-dht: renaming
> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_03_Rosen.JPG-chunking-2242590604-0.FpKL3SIUb9vKHyjd.part
> (hash=myvolume-geo-client-0/cache=myvolume-geo-client-0) =>
> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_03_Rosen.JPG-chunking-2242590604-0
> (hash=myvolume-geo-client-0/cache=<nul>)
> [2016-02-15 10:58:45.665144] I [MSGID: 109066] [dht-rename.c:1411:dht_rename]
> 0-myvolume-geo-dht: renaming
> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_03_Rosen.JPG-chunking-2242590604-1.C6l0DEurb2y3Azw4.part
> (hash=myvolume-geo-client-0/cache=myvolume-geo-client-0) =>
> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_03_Rosen.JPG-chunking-2242590604-1
> (hash=myvolume-geo-client-0/cache=<nul>)
> [2016-02-15 10:58:45.749829] I [MSGID: 109066] [dht-rename.c:1411:dht_rename]
> 0-myvolume-geo-dht: renaming
> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_02_Pilz.JPG-chunking-628343631-0.ajEnSguUZ7EkzjzT.part
> (hash=myvolume-geo-client-0/cache=myvolume-geo-client-0) =>
> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_02_Pilz.JPG-chunking-628343631-0
> (hash=myvolume-geo-client-0/cache=<nul>)
> [2016-02-15 10:58:45.750225] W [MSGID: 114031]
> [client-rpc-fops.c:2971:client3_3_lookup_cbk] 0-myvolume-geo-client-0:
> remote operation failed. Path:
> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_02_Pilz.JPG-chunking-628343631-0.ajEnSguUZ7EkzjzT.part
> (9164caeb-740d-4429-a3bd-c85f40c35e11) [No such file or directory]
> [2016-02-15 10:58:45.750418] W [fuse-bridge.c:1777:fuse_rename_cbk]
> 0-glusterfs-fuse: 60:
> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_02_Pilz.JPG-chunking-628343631-0.ajEnSguUZ7EkzjzT.part
> ->
> /.gfid/94310944-7f8a-421d-a51f-1e23e28da9cc/Bild_02_Pilz.JPG-chunking-628343631-0
> => -1 (Device or resource busy)
> [2016-02-15 10:58:45.767788] I [fuse-bridge.c:4984:fuse_thread_proc] 0-fuse:
> unmounting /tmp/gsyncd-aux-mount-bZ9SMt
> [2016-02-15 10:58:45.768063] W [glusterfsd.c:1236:cleanup_and_exit]
> (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x80a4) [0x7feb610820a4]
> -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xe5) [0x7feb626f45b5]
> -->/usr/sbin/glusterfs(cleanup_and_exit+0x59) [0x7feb626f4429] ) 0-:
> received signum (15), shutting down
> [2016-02-15 10:58:45.768093] I [fuse-bridge.c:5683:fini] 0-fuse: Unmounting
> '/tmp/gsyncd-aux-mount-bZ9SMt'.
> [2016-02-15 10:58:54.871855] I [dict.c:473:dict_get]
> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.6/xlator/system/posix-acl.so(posix_acl_setxattr_cbk+0x26)
> [0x7f8313dfb166]
> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.6/xlator/system/posix-acl.so(handling_other_acl_related_xattr+0x20)
> [0x7f8313dfb060]
> -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_get+0x93)
> [0x7f831f3f40c3] ) 0-dict: !this || key=system.posix_acl_access [Invalid
> argument]
> [2016-02-15 10:58:54.871914] I [dict.c:473:dict_get]
> (-->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.6/xlator/system/posix-acl.so(posix_acl_setxattr_cbk+0x26)
> [0x7f8313dfb166]
> -->/usr/lib/x86_64-linux-gnu/glusterfs/3.7.6/xlator/system/posix-acl.so(handling_other_acl_related_xattr+0xb0)
> [0x7f8313dfb0f0]
> -->/usr/lib/x86_64-linux-gnu/libglusterfs.so.0(dict_get+0x93)
> [0x7f831f3f40c3] ) 0-dict: !this || key=system.posix_acl_default [Invalid
> argument]
> 
> This error gets repeated forever with always the same files. I tried to stop
> and restart the geo-rep on the master but still the same problem and geo
> replication does not proceed. Does anyone have an idea how to fix this?
> 
> I am using GlusterFS 3.7.6 on Debian 8 with a two node replicate volume (1
> brick per node) and one single off-site slave node for geo-rep.
> 
> Regards
> ML
> _______________________________________________
> Gluster-users mailing list
> Gluster-users@xxxxxxxxxxx
> http://www.gluster.org/mailman/listinfo/gluster-users
> 
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users