Re: no progress in geo-replication

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Shwetha,

thank you for your reply...

I ran a few tests in Debug Mode and found no real indication of the problem. After each start of the geo-replication some files are transferred at the beginning and then no further transfers.

Few minutes after start the amount of changelog files in <brick> looks like :

[ 06:42:26 ] - root@gl-master-02  ~/tmp $./var_gluster.sh
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.processed : 0                          
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.processing : 27                         ### growing whole time
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history : 324861
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.processed : 1
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.processing : 324859            ###finished building changelog files
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.current : 0
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.current : 0
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/xsync : 0
---

As far as i remember at the beginning i have seen a few changelog files in <brick>/.processed for a short moment, but always with size 0. Even after some hours there are no files in <brick>/.processed.

In strace are a lot of of messages like 'failed: No data available' and 'rsync error: some files/attrs were not transfered ... (code 23)' for about the first 5-10 minutes after geo-rep start.

for example gfid 8d601e5b-180c.... :

19361 1615530800.812727 select(7, [6], [], [], NULL <unfinished ...>
19357 1615530800.812779 select(0, NULL, NULL, NULL, {tv_sec=0, tv_usec=235797} <unfinished ...>
19352 1615530800.816522 lstat(".gfid/f0ed7d0e-83be-4c3f-b2c8-f763e9aada12", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.817723 lstat(".gfid/c5b44852-9cf9-441b-8766-d87bfac774c8", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.819507 lstat(".gfid/b0b71bcc-7653-4ab8-b863-a83d395f5e91", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.821106 lstat(".gfid/c4e80ff5-2e08-4e68-9a4d-ea7f45ed290d", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.822874 lstat(".gfid/aaa468d8-9d6a-4aaf-8344-c57440286f5c",  <unfinished ...>
19418 1615530800.823466 <... restart_syscall resumed> ) = 1
19418 1615530800.823519 read(14, "rsync: get_xattr_data: lgetxattr(\"\"/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234\"\",\"trusted.glusterfs.mdata\",0) failed: No data available (61)\n", 32768) = 171
19418 1615530800.823587 poll([{fd=14, events=POLLIN}], 1, -1 <unfinished ...>
19352 1615530800.823830 <... lstat resumed> {st_mode=S_IFREG|0644, st_size=4226767, ...}) = 0
19352 1615530800.823882 lstat(".gfid/8164ea3b-44f6-4ea2-a75f-501cea0024cc", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.897938 lstat(".gfid/01da73ae-1f88-498d-8fe5-84ea76db12f3", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.934281 lstat(".gfid/be48f891-cdc1-4e4c-a141-7001ae3f592e", {st_mode=S_IFDIR|0777, st_size=4096, ...}) = 0
19352 1615530800.935938 lstat(".gfid/501fab77-5e83-42cb-9edf-ce30bc3a86a9", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.937481 lstat(".gfid/668f6bd1-bdb0-46e0-9cd4-c7ebc38fbaf9",  <unfinished ...>
19417 1615530800.961937 <... restart_syscall resumed> ) = 1
19417 1615530800.962042 read(13, "rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1196) [sender=3.1.2]\n", 32768) = 114

according gsyncd.log this gfid is candidate for syncing like many others, very few others are 'synced' :

 
[2021-03-12 06:33:20.651147] D [master(worker /brick1/mvol1):318:a_syncdata] _GMaster: candidate for syncing    file=.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
[2021-03-12 06:35:17.419920] D [master(worker /brick1/mvol1):318:a_syncdata] _GMaster: candidate for syncing    file=.gfid/8d601e5b-180c-46c8-b64f-ae6224542234

[2021-03-12 06:35:03.382977] D [master(worker /brick1/mvol1):324:regjob] _GMaster: synced    file=.gfid/a3656075-784c-4377-a482-4aad8378acf0

when i try to fetch some attributes i get 'No such attribute'  for the mentioned gfid while for a synced gfid attributes are available :

[ 09:58:54 ] - root@gl-master-02  ~/tmp $getfattr -m . -d -e hex /tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234: trusted.glusterfs.mdata: No such attribute

[ 09:59:38 ] - root@gl-master-02  ~/tmp $getfattr -m . -d -e hex /tmp/gsyncd-aux-mount-46pc26b9/.gfid/a3656075-784c-4377-a482-4aad8378acf0
getfattr: Removing leading '/' from absolute path names
# file: tmp/gsyncd-aux-mount-46pc26b9/.gfid/a3656075-784c-4377-a482-4aad8378acf0
trusted.glusterfs.mdata=0x010000000000000000000000005d1f73ff000000000da35668000000005d1f73fd0000000015811b46000000005f29050b000000001d5363a6

i can make a stat on that directory for example :

[ 10:07:19 ] - root@gl-master-02  ~/tmp $stat /tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
  File: /tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
  Size: 4096          Blocks: 8          IO Block: 131072 directory
Device: 37h/55d    Inode: 9394601311212820456  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-01-17 07:26:09.596743288 +0000
Modify: 2021-03-12 07:34:39.382122663 +0000
Change: 2021-03-12 07:34:39.383446790 +0000
 Birth: -

currently i have no clou how to get this fixed. obviously extended attributes are missing for the most entries in /tmp/gsyncd-aux-mount.../ while some others exist.

I believe rsync is not the reason, it's rather because of missing attributes ?!

Lastly the gfid points to a directory, when i try to get attributes for this directroy in brick-path it succeed's :


[ 10:15:25 ] - root@gl-master-02  ~/tmp $ls -l /brick1/mvol1/.glusterfs/8d/60/8d601e5b-180c-46c8-b64f-ae6224542234
lrwxrwxrwx 1 root root 56 Nov  5 17:54 /brick1/mvol1/.glusterfs/8d/60/8d601e5b-180c-46c8-b64f-ae6224542234 -> ../../f0/94/f094bf06-2806-4f90-9a79-489827c6cdf9/2217547


[ 10:38:01 ] - root@gl-master-02  ~ $getfattr -m . -d -e hex /brick1/mvol1/2137/files/20/11/2217547
getfattr: Removing leading '/' from absolute path names
# file: brick1/mvol1/2137/files/20/11/2217547
trusted.gfid=0x8d601e5b180c46c8b64fae6224542234
trusted.glusterfs.2f5de6e4-66de-40a7-9f24-4762aad3ca96.xtime=0x604b198f0005e528
trusted.glusterfs.dht=0x001ed359000000007a2d37c1a8b9af89
trusted.glusterfs.dht.mds=0x00000000

[ 10:38:42 ] - root@gl-master-02  ~ $getfattr -m . -d -e hex /brick1/mvol1/2137/files/20/11
getfattr: Removing leading '/' from absolute path names
# file: brick1/mvol1/2137/files/20/11
trusted.gfid=0xf094bf0628064f909a79489827c6cdf9
trusted.glusterfs.2f5de6e4-66de-40a7-9f24-4762aad3ca96.xtime=0x604b198f0005e528
trusted.glusterfs.dht=0x001ed35900000000d1738834ffffffff
trusted.glusterfs.mdata=0x010000000000000000000000005fc5378000000000077ba08a000000005fc535b60000000038d942cc000000005f9ffc610000000007b08744

[ 10:39:54 ] - root@gl-master-02  ~ $


but the geo-rep ended up in a loop, but without 'E'error :

[2021-03-12 10:46:40.572500] D [repce(worker /brick1/mvol1):215:__call__] RepceClient: call 19352:140387951818496:1615546000.5609794 keep_alive -> 256
[2021-03-12 10:46:41.23154] D [master(worker /brick2/mvol1):554:crawlwrap] _GMaster: ... crawl #0 done, took 5.017846 seconds
[2021-03-12 10:46:41.35729] D [master(worker /brick2/mvol1):578:crawlwrap] _GMaster: Crawl info    cluster_stime=(1609281098, 0)    brick_stime=(1609281900, 0)
[2021-03-12 10:46:46.41012] D [master(worker /brick2/mvol1):554:crawlwrap] _GMaster: ... crawl #0 done, took 5.017512 seconds
[2021-03-12 10:46:46.53818] D [master(worker /brick2/mvol1):578:crawlwrap] _GMaster: Crawl info    cluster_stime=(1609281098, 0)    brick_stime=(1609281900, 0)
[2021-03-12 10:46:48.269174] D [repce(worker /brick2/mvol1):195:push] RepceClient: call 19354:140476158043904:1615546008.2690222 keep_alive({'version': (1, 0), 'uuid': '2f5de6e4-66de-40a7-9f24-4762aad3ca96', 'retval': 0, 'volume_mark': (1609275788, 819193), 'timeout': 1615546128},) ...


Does anyone have any idea how to solve this problem ?

best regards,

Dietar


On 04.03.21 08:48, Shwetha Acharya wrote:
Hi Dietmar,

batch-fsync-delay-usec was already set to 0 and I increased the sync_jobs from 3 to 6. In the moment I increased the sync_jobs following error appeared in gsyncd.log :

[2021-03-03 23:17:46.59727] E [syncdutils(worker /brick1/mvol1):312:log_raise_exception] <top>: connection to peer is broken
[2021-03-03 23:17:46.59912] E [syncdutils(worker /brick2/mvol1):312:log_raise_exception] <top>: connection to peer is broken

If the geo-rep session is currently not in faulty state, we should be bothered about this log message. It is normal when the config is updated, geo-rep restart occurs and the above message pops up.

passive nodes became active and the content in <brick>/.processing was removed. currently new changelog files are created in this directory.shortly before I changed the sync_jobs I have checked the <brick>/.processing directory on the master nodes. the result was the same for every master node.

since the last error about 12 hours ago nearly 2400 changelog files were created on each node but it looks like none of them were consumed.

 Processed changelogs that are synced are archived under <brick>/.processed directory. Verify if the latest file is created there.

in the moment I'm not sure what is right and what is wrong...should at least the oldest changelog files in this directory have been processed gradually ?

Also you can try to set the log-level to debug for a while and set it back to info(to avoid flooding of logs) and check the logs to get a better picture of the scenario.
#gluster volume geo-replication <primary> <ip>::<secondary> config log-level DEBUG
#gluster volume geo-replication <primary> <ip>::<secondary> config log-level INFO

Regards,
Shwetha


  

________



Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux