Hi Shwetha,
thank you for your reply...
I ran a few tests in Debug Mode and found no real indication of the problem. After each start of the geo-replication some files are transferred at the beginning and then no further transfers.
Few minutes after start the amount of changelog files in
<brick> looks like :
[ 06:42:26 ] - root@gl-master-02 ~/tmp
$./var_gluster.sh
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.processed
: 0
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.processing
: 27 ### growing whole time
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history
: 324861
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.processed
: 1
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.processing
: 324859 ###finished building changelog files
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.current
: 0
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.current
: 0
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/xsync
: 0
---
As far as i remember at the beginning i have seen a few changelog
files in <brick>/.processed for a short moment, but always
with size 0. Even after some hours there are no files in
<brick>/.processed.
In strace are a lot of of messages like 'failed: No data available' and 'rsync error: some files/attrs were not transfered ... (code 23)' for about the first 5-10 minutes after geo-rep start.
for example gfid 8d601e5b-180c.... :
19361 1615530800.812727 select(7, [6], [],
[], NULL <unfinished ...>
19357 1615530800.812779 select(0, NULL, NULL, NULL, {tv_sec=0,
tv_usec=235797} <unfinished ...>
19352 1615530800.816522
lstat(".gfid/f0ed7d0e-83be-4c3f-b2c8-f763e9aada12",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.817723
lstat(".gfid/c5b44852-9cf9-441b-8766-d87bfac774c8",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.819507
lstat(".gfid/b0b71bcc-7653-4ab8-b863-a83d395f5e91",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.821106
lstat(".gfid/c4e80ff5-2e08-4e68-9a4d-ea7f45ed290d",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.822874
lstat(".gfid/aaa468d8-9d6a-4aaf-8344-c57440286f5c",
<unfinished ...>
19418 1615530800.823466 <... restart_syscall resumed> ) =
1
19418 1615530800.823519 read(14, "rsync: get_xattr_data:
lgetxattr(\"\"/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234\"\",\"trusted.glusterfs.mdata\",0)
failed: No data available (61)\n", 32768) = 171
19418 1615530800.823587 poll([{fd=14, events=POLLIN}], 1, -1
<unfinished ...>
19352 1615530800.823830 <... lstat resumed>
{st_mode=S_IFREG|0644, st_size=4226767, ...}) = 0
19352 1615530800.823882
lstat(".gfid/8164ea3b-44f6-4ea2-a75f-501cea0024cc",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.897938
lstat(".gfid/01da73ae-1f88-498d-8fe5-84ea76db12f3",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.934281
lstat(".gfid/be48f891-cdc1-4e4c-a141-7001ae3f592e",
{st_mode=S_IFDIR|0777, st_size=4096, ...}) = 0
19352 1615530800.935938
lstat(".gfid/501fab77-5e83-42cb-9edf-ce30bc3a86a9",
{st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
19352 1615530800.937481
lstat(".gfid/668f6bd1-bdb0-46e0-9cd4-c7ebc38fbaf9",
<unfinished ...>
19417 1615530800.961937 <... restart_syscall resumed> ) =
1
19417 1615530800.962042 read(13, "rsync error: some files/attrs
were not transferred (see previous errors) (code 23) at
main.c(1196) [sender=3.1.2]\n", 32768) = 114
according gsyncd.log this gfid is candidate for syncing like many
others, very few others are 'synced' :
[2021-03-12 06:33:20.651147] D [master(worker
/brick1/mvol1):318:a_syncdata] _GMaster: candidate for
syncing file=.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
[2021-03-12 06:35:17.419920] D [master(worker
/brick1/mvol1):318:a_syncdata] _GMaster: candidate for
syncing file=.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
[2021-03-12 06:35:03.382977] D
[master(worker /brick1/mvol1):324:regjob] _GMaster: synced
file=.gfid/a3656075-784c-4377-a482-4aad8378acf0
when i try to fetch some attributes i get 'No such attribute'
for the mentioned gfid while for a synced gfid attributes are
available :
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234: trusted.glusterfs.mdata: No such attribute
[ 09:59:38 ] - root@gl-master-02 ~/tmp $getfattr -m . -d -e hex /tmp/gsyncd-aux-mount-46pc26b9/.gfid/a3656075-784c-4377-a482-4aad8378acf0
getfattr: Removing leading '/' from absolute path names
# file: tmp/gsyncd-aux-mount-46pc26b9/.gfid/a3656075-784c-4377-a482-4aad8378acf0
trusted.glusterfs.mdata=0x010000000000000000000000005d1f73ff000000000da35668000000005d1f73fd0000000015811b46000000005f29050b000000001d5363a6
i can make a stat on that directory for example :
[ 10:07:19 ] - root@gl-master-02 ~/tmp
$stat
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
File:
/tmp/gsyncd-aux-mount-46pc26b9/.gfid/8d601e5b-180c-46c8-b64f-ae6224542234
Size: 4096 Blocks: 8 IO Block: 131072
directory
Device: 37h/55d Inode: 9394601311212820456 Links: 2
Access: (0755/drwxr-xr-x) Uid: ( 0/ root) Gid: (
0/ root)
Access: 2021-01-17 07:26:09.596743288 +0000
Modify: 2021-03-12 07:34:39.382122663 +0000
Change: 2021-03-12 07:34:39.383446790 +0000
Birth: -
currently i have no clou how to get this fixed. obviously extended attributes are missing for the most entries in /tmp/gsyncd-aux-mount.../ while some others exist.
I believe rsync is not the reason, it's rather because of missing attributes ?!
Lastly the gfid points to a directory, when i try to get
attributes for this directroy in brick-path it succeed's :
[ 10:15:25 ] - root@gl-master-02 ~/tmp $ls -l /brick1/mvol1/.glusterfs/8d/60/8d601e5b-180c-46c8-b64f-ae6224542234
lrwxrwxrwx 1 root root 56 Nov 5 17:54 /brick1/mvol1/.glusterfs/8d/60/8d601e5b-180c-46c8-b64f-ae6224542234 -> ../../f0/94/f094bf06-2806-4f90-9a79-489827c6cdf9/2217547
getfattr: Removing leading '/' from absolute path names
# file: brick1/mvol1/2137/files/20/11/2217547
trusted.gfid=0x8d601e5b180c46c8b64fae6224542234
trusted.glusterfs.2f5de6e4-66de-40a7-9f24-4762aad3ca96.xtime=0x604b198f0005e528
trusted.glusterfs.dht=0x001ed359000000007a2d37c1a8b9af89
trusted.glusterfs.dht.mds=0x00000000
[ 10:38:42 ] - root@gl-master-02 ~ $getfattr -m . -d -e hex /brick1/mvol1/2137/files/20/11
getfattr: Removing leading '/' from absolute path names
# file: brick1/mvol1/2137/files/20/11
trusted.gfid=0xf094bf0628064f909a79489827c6cdf9
trusted.glusterfs.2f5de6e4-66de-40a7-9f24-4762aad3ca96.xtime=0x604b198f0005e528
trusted.glusterfs.dht=0x001ed35900000000d1738834ffffffff
trusted.glusterfs.mdata=0x010000000000000000000000005fc5378000000000077ba08a000000005fc535b60000000038d942cc000000005f9ffc610000000007b08744
[ 10:39:54 ] - root@gl-master-02 ~ $
but the geo-rep ended up in a loop, but without 'E'error :
[2021-03-12 10:46:40.572500] D
[repce(worker /brick1/mvol1):215:__call__] RepceClient: call
19352:140387951818496:1615546000.5609794 keep_alive -> 256
[2021-03-12 10:46:41.23154] D [master(worker
/brick2/mvol1):554:crawlwrap] _GMaster: ... crawl #0 done, took
5.017846 seconds
[2021-03-12 10:46:41.35729] D [master(worker
/brick2/mvol1):578:crawlwrap] _GMaster: Crawl info
cluster_stime=(1609281098, 0) brick_stime=(1609281900, 0)
[2021-03-12 10:46:46.41012] D [master(worker
/brick2/mvol1):554:crawlwrap] _GMaster: ... crawl #0 done, took
5.017512 seconds
[2021-03-12 10:46:46.53818] D [master(worker
/brick2/mvol1):578:crawlwrap] _GMaster: Crawl info
cluster_stime=(1609281098, 0) brick_stime=(1609281900, 0)
[2021-03-12 10:46:48.269174] D [repce(worker
/brick2/mvol1):195:push] RepceClient: call
19354:140476158043904:1615546008.2690222 keep_alive({'version':
(1, 0), 'uuid': '2f5de6e4-66de-40a7-9f24-4762aad3ca96',
'retval': 0, 'volume_mark': (1609275788, 819193), 'timeout':
1615546128},) ...
Does anyone have any idea how to solve this problem ?
best regards,
Dietar
Hi Dietmar,
batch-fsync-delay-usec was already set to 0 and I increased the sync_jobs from 3 to 6. In the moment I increased the sync_jobs following error appeared in gsyncd.log :
[2021-03-03 23:17:46.59727] E [syncdutils(worker /brick1/mvol1):312:log_raise_exception] <top>: connection to peer is broken
[2021-03-03 23:17:46.59912] E [syncdutils(worker /brick2/mvol1):312:log_raise_exception] <top>: connection to peer is brokenIf the geo-rep session is currently not in faulty state, we should be bothered about this log message. It is normal when the config is updated, geo-rep restart occurs and the above message pops up.passive nodes became active and the content in <brick>/.processing was removed. currently new changelog files are created in this directory.shortly before I changed the sync_jobs I have checked the <brick>/.processing directory on the master nodes. the result was the same for every master node.
since the last error about 12 hours ago nearly 2400 changelog files were created on each node but it looks like none of them were consumed.
Processed changelogs that are synced are archived under <brick>/.processed directory. Verify if the latest file is created there.in the moment I'm not sure what is right and what is wrong...should at least the oldest changelog files in this directory have been processed gradually ?
Also you can try to set the log-level to debug for a while and set it back to info(to avoid flooding of logs) and check the logs to get a better picture of the scenario.#gluster volume geo-replication <primary> <ip>::<secondary> config log-level DEBUG
#gluster volume geo-replication <primary> <ip>::<secondary> config log-level INFO
Regards,
Shwetha
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users