Hi,
I'm
having a problem with geo-replication. A
short summary...
About two month ago I have added
two further nodes to a distributed replicated volume. For that
purpose I have stopped the geo-replication, added two nodes on
mvol and svol and started a rebalance process on both sides. Once
the rebalance process was finished I startet the geo-replication
again.
After a few days and beside some Unicode Errors the status of the new added brick changed from hybrid crawl to history crawl. Since then no progress, no files / directories have been created on svol for a couple of days.
After a few days and beside some Unicode Errors the status of the new added brick changed from hybrid crawl to history crawl. Since then no progress, no files / directories have been created on svol for a couple of days.
Looking for a possible reason I
recognized that there is was
/var/log/glusterfs/geo-replication-slaves/mvol1_gl-slave-01-int_svol1
directory on the new added slave nodes.
Obviously I forgot to add the new
svol node IP addresses on all master's /etc/hosts. After fixing
that I did the '... execute gsec_create' and '...create push-pem
force' command again and corresponding directory were created.
Geo-replication started normal, all active sessions were in
history crawl (as shown below) and for a short while some data
were transfered to svol. But for about a week nothing had changed
on svol, 0 byte transferred.
Meanwhile i have deleted (without
reset-sync-time) and recreated the geo-rep session. the current
status is as shown below but without any last_synced date.
an entry like
"last_synced_entry": 1609283145 is still visible in
/var/lib/glusterd/geo-replication/mvol1_gl-slave-01-int_svol1/*status
and changelog files are continously created in
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/<brick>/.processing.
Short time ago i changed
log_level to DEBUG for a moment. Unfortunately I got an 'EOFError:
Ran out of input' in gsyncd.log and rebuild of .processing starts
from beginning.
But one of the first very long
lines in gsyncd.log looks like :
[2021-03-03
11:59:39.503881] D [repce(worker /brick1/mvol1):215:__call__]
RepceClient: call 9163:139944064358208:1614772779.4982471
history_getchanges ->
['/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history/.processing/CHANGELOG.1609280278',...
1609280278 means Tuesday, December 29,
2020 10:17:58 PM and would somehow fit to the last_synced date.
However, I got nearly 300k files in
<brick>/.history/.processing and in in log/trace it seems
that any file in <brick>/.history/.processing will be
processed and transferred to <brick>/.processing.
My questions so far...
first of all, is everything still ok
with this geo-replication ?
do i have to wait until all changelog
files in <brick>/.history/.processing are processed until
transfers to svol start ?
what happens if any other error
appears in geo-replication while these changelog files are
processed resp. crawl status is history crawl ... does the
entire process starts from the beginning ? would a checkpiont be
helpful...for future decisions...?
is there any suitable setting in the
gluster-environment which would take influence on the speed of
the processing (current settings attached) ?
I hope someone can help...
best regards
dietmar
[
15:17:47 ] - root@gl-master-01
/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/brick1-mvol1/.history
$ls .processing/ | wc -l
294669
294669
[ 12:56:31
] - root@gl-master-01 ~ $gluster volume geo-replication mvol1
gl-slave-01-int::svol1 status
MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED
----------------------------------------------------------------------------------------------------------------------------------------------------
gl-master-01-int mvol1 /brick1/mvol1 root gl-slave-01-int::svol1 gl-slave-05-int Active History Crawl 2020-12-29 23:00:48
gl-master-01-int mvol1 /brick2/mvol1 root gl-slave-01-int::svol1 gl-slave-03-int Active History Crawl 2020-12-29 23:05:45
gl-master-05-int mvol1 /brick1/mvol1 root gl-slave-01-int::svol1 gl-slave-03-int Active History Crawl 2021-02-20 17:38:38
gl-master-06-int mvol1 /brick1/mvol1 root gl-slave-01-int::svol1 gl-slave-06-int Passive N/A N/A
gl-master-03-int mvol1 /brick1/mvol1 root gl-slave-01-int::svol1 gl-slave-05-int Passive N/A N/A
gl-master-03-int mvol1 /brick2/mvol1 root gl-slave-01-int::svol1 gl-slave-04-int Active History Crawl 2020-12-29 23:07:34
gl-master-04-int mvol1 /brick1/mvol1 root gl-slave-01-int::svol1 gl-slave-06-int Active History Crawl 2020-12-29 23:07:22
gl-master-04-int mvol1 /brick2/mvol1 root gl-slave-01-int::svol1 gl-slave-01-int Passive N/A N/A
gl-master-02-int mvol1 /brick1/mvol1 root gl-slave-01-int::svol1 gl-slave-01-int Passive N/A N/A
gl-master-02-int mvol1 /brick2/mvol1 root gl-slave-01-int::svol1 gl-slave-06-int Passive N/A N/A
[ 13:14:47 ] - root@gl-master-01 ~ $
MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED
----------------------------------------------------------------------------------------------------------------------------------------------------
gl-master-01-int mvol1 /brick1/mvol1 root gl-slave-01-int::svol1 gl-slave-05-int Active History Crawl 2020-12-29 23:00:48
gl-master-01-int mvol1 /brick2/mvol1 root gl-slave-01-int::svol1 gl-slave-03-int Active History Crawl 2020-12-29 23:05:45
gl-master-05-int mvol1 /brick1/mvol1 root gl-slave-01-int::svol1 gl-slave-03-int Active History Crawl 2021-02-20 17:38:38
gl-master-06-int mvol1 /brick1/mvol1 root gl-slave-01-int::svol1 gl-slave-06-int Passive N/A N/A
gl-master-03-int mvol1 /brick1/mvol1 root gl-slave-01-int::svol1 gl-slave-05-int Passive N/A N/A
gl-master-03-int mvol1 /brick2/mvol1 root gl-slave-01-int::svol1 gl-slave-04-int Active History Crawl 2020-12-29 23:07:34
gl-master-04-int mvol1 /brick1/mvol1 root gl-slave-01-int::svol1 gl-slave-06-int Active History Crawl 2020-12-29 23:07:22
gl-master-04-int mvol1 /brick2/mvol1 root gl-slave-01-int::svol1 gl-slave-01-int Passive N/A N/A
gl-master-02-int mvol1 /brick1/mvol1 root gl-slave-01-int::svol1 gl-slave-01-int Passive N/A N/A
gl-master-02-int mvol1 /brick2/mvol1 root gl-slave-01-int::svol1 gl-slave-06-int Passive N/A N/A
[ 13:14:47 ] - root@gl-master-01 ~ $
Option Value ------ ----- cluster.lookup-unhashed on cluster.lookup-optimize on cluster.min-free-disk 200GB cluster.min-free-inodes 5% cluster.rebalance-stats off cluster.subvols-per-directory (null) cluster.readdir-optimize off cluster.rsync-hash-regex (null) cluster.extra-hash-regex (null) cluster.dht-xattr-name trusted.glusterfs.dht cluster.randomize-hash-range-by-gfid off cluster.rebal-throttle normal cluster.lock-migration off cluster.force-migration off cluster.local-volume-name (null) cluster.weighted-rebalance on cluster.switch-pattern (null) cluster.entry-change-log on cluster.read-subvolume (null) cluster.read-subvolume-index -1 cluster.read-hash-mode 1 cluster.background-self-heal-count 8 cluster.metadata-self-heal off cluster.data-self-heal off cluster.entry-self-heal off cluster.self-heal-daemon on cluster.heal-timeout 600 cluster.self-heal-window-size 1 cluster.data-change-log on cluster.metadata-change-log on cluster.data-self-heal-algorithm (null) cluster.eager-lock on disperse.eager-lock on disperse.other-eager-lock on disperse.eager-lock-timeout 1 disperse.other-eager-lock-timeout 1 cluster.quorum-type none cluster.quorum-count (null) cluster.choose-local true cluster.self-heal-readdir-size 1KB cluster.post-op-delay-secs 1 cluster.ensure-durability on cluster.consistent-metadata no cluster.heal-wait-queue-length 128 cluster.favorite-child-policy none cluster.full-lock yes cluster.optimistic-change-log on diagnostics.latency-measurement off diagnostics.dump-fd-stats off diagnostics.count-fop-hits off diagnostics.brick-log-level INFO diagnostics.client-log-level ERROR diagnostics.brick-sys-log-level CRITICAL diagnostics.client-sys-log-level CRITICAL diagnostics.brick-logger (null) diagnostics.client-logger (null) diagnostics.brick-log-format (null) diagnostics.client-log-format (null) diagnostics.brick-log-buf-size 5 diagnostics.client-log-buf-size 5 diagnostics.brick-log-flush-timeout 120 diagnostics.client-log-flush-timeout 120 diagnostics.stats-dump-interval 0 diagnostics.fop-sample-interval 0 diagnostics.stats-dump-format json diagnostics.fop-sample-buf-size 65535 diagnostics.stats-dnscache-ttl-sec 86400 performance.cache-max-file-size 0 performance.cache-min-file-size 0 performance.cache-refresh-timeout 32 performance.cache-priority performance.cache-size 16GB performance.io-thread-count 64 performance.high-prio-threads 16 performance.normal-prio-threads 16 performance.low-prio-threads 16 performance.least-prio-threads 1 performance.enable-least-priority on performance.iot-watchdog-secs (null) performance.iot-cleanup-disconnected-reqsoff performance.iot-pass-through false performance.io-cache-pass-through false performance.cache-size 16GB performance.qr-cache-timeout 1 performance.cache-invalidation false performance.ctime-invalidation false performance.flush-behind on performance.nfs.flush-behind on performance.write-behind-window-size 4MB performance.resync-failed-syncs-after-fsyncoff performance.nfs.write-behind-window-size1MB performance.strict-o-direct off performance.nfs.strict-o-direct off performance.strict-write-ordering off performance.nfs.strict-write-ordering off performance.write-behind-trickling-writeson performance.aggregate-size 128KB performance.nfs.write-behind-trickling-writeson performance.lazy-open yes performance.read-after-open yes performance.open-behind-pass-through false performance.read-ahead-page-count 4 performance.read-ahead-pass-through false performance.readdir-ahead-pass-through false performance.md-cache-pass-through false performance.md-cache-timeout 600 performance.cache-swift-metadata true performance.cache-samba-metadata false performance.cache-capability-xattrs true performance.cache-ima-xattrs true performance.md-cache-statfs off performance.xattr-cache-list performance.nl-cache-pass-through false network.frame-timeout 1800 network.ping-timeout 20 network.tcp-window-size (null) client.ssl off network.remote-dio disable client.event-threads 4 client.tcp-user-timeout 0 client.keepalive-time 20 client.keepalive-interval 2 client.keepalive-count 9 network.tcp-window-size (null) network.inode-lru-limit 200000 auth.allow * auth.reject (null) transport.keepalive 1 server.allow-insecure on server.root-squash off server.all-squash off server.anonuid 65534 server.anongid 65534 server.statedump-path /var/run/gluster server.outstanding-rpc-limit 64 server.ssl off auth.ssl-allow * server.manage-gids off server.dynamic-auth on client.send-gids on server.gid-timeout 300 server.own-thread (null) server.event-threads 4 server.tcp-user-timeout 42 server.keepalive-time 20 server.keepalive-interval 2 server.keepalive-count 9 transport.listen-backlog 1024 transport.address-family inet performance.write-behind on performance.read-ahead on performance.readdir-ahead on performance.io-cache on performance.open-behind on performance.quick-read on performance.nl-cache off performance.stat-prefetch off performance.client-io-threads off performance.nfs.write-behind on performance.nfs.read-ahead on performance.nfs.io-cache off performance.nfs.quick-read on performance.nfs.stat-prefetch off performance.nfs.io-threads on performance.force-readdirp true performance.cache-invalidation false performance.global-cache-invalidation true features.uss off features.snapshot-directory .snaps features.show-snapshot-directory off features.tag-namespaces off network.compression off network.compression.window-size -15 network.compression.mem-level 8 network.compression.min-size 0 network.compression.compression-level -1 network.compression.debug false features.default-soft-limit 80% features.soft-timeout 60 features.hard-timeout 5 features.alert-time 86400 features.quota-deem-statfs off geo-replication.indexing on geo-replication.indexing on geo-replication.ignore-pid-check on geo-replication.ignore-pid-check on features.quota off features.inode-quota off features.bitrot disable debug.trace off debug.log-history no debug.log-file no debug.exclude-ops (null) debug.include-ops (null) debug.error-gen off debug.error-failure (null) debug.error-number (null) debug.random-failure off debug.error-fops (null) nfs.disable on features.read-only off features.worm off features.worm-file-level off features.worm-files-deletable on features.default-retention-period 120 features.retention-mode relax features.auto-commit-period 180 storage.linux-aio off storage.batch-fsync-mode reverse-fsync storage.batch-fsync-delay-usec 0 storage.owner-uid -1 storage.owner-gid -1 storage.node-uuid-pathinfo off storage.health-check-interval 30 storage.build-pgfid off storage.gfid2path on storage.gfid2path-separator : storage.reserve 1 storage.reserve-size 0 storage.health-check-timeout 10 storage.fips-mode-rchecksum on storage.force-create-mode 0000 storage.force-directory-mode 0000 storage.create-mask 0777 storage.create-directory-mask 0777 storage.max-hardlinks 100 features.ctime on config.gfproxyd off cluster.server-quorum-type off cluster.server-quorum-ratio 51 changelog.changelog on changelog.changelog-dir {{ brick.path }}/.glusterfs/changelogs changelog.encoding ascii changelog.rollover-time 15 changelog.fsync-interval 5 changelog.changelog-barrier-timeout 120 changelog.capture-del-path off features.barrier disable features.barrier-timeout 120 features.trash off features.trash-dir .trashcan features.trash-eliminate-path (null) features.trash-max-filesize 5GB features.trash-internal-op off cluster.enable-shared-storage disable locks.trace off locks.mandatory-locking off cluster.disperse-self-heal-daemon enable cluster.quorum-reads no client.bind-insecure (null) features.timeout 45 features.failover-hosts (null) features.shard off features.shard-block-size 64MB features.shard-lru-limit 16384 features.shard-deletion-rate 100 features.scrub-throttle lazy features.scrub-freq biweekly features.scrub false features.expiry-time 120 features.cache-invalidation on features.cache-invalidation-timeout 600 features.leases off features.lease-lock-recall-timeout 60 disperse.background-heals 8 disperse.heal-wait-qlength 128 cluster.heal-timeout 600 dht.force-readdirp on disperse.read-policy gfid-hash cluster.shd-max-threads 1 cluster.shd-wait-qlength 1024 cluster.locking-scheme full cluster.granular-entry-heal no features.locks-revocation-secs 0 features.locks-revocation-clear-all false features.locks-revocation-max-blocked 0 features.locks-monkey-unlocking false features.locks-notify-contention no features.locks-notify-contention-delay 5 disperse.shd-max-threads 1 disperse.shd-wait-qlength 1024 disperse.cpu-extensions auto disperse.self-heal-window-size 1 cluster.use-compound-fops off performance.parallel-readdir on performance.rda-request-size 131072 performance.rda-low-wmark 4096 performance.rda-high-wmark 128KB performance.rda-cache-limit 10MB performance.nl-cache-positive-entry false performance.nl-cache-limit 10MB performance.nl-cache-timeout 600 cluster.brick-multiplex disable glusterd.vol_count_per_thread 100 cluster.max-bricks-per-process 250 disperse.optimistic-change-log on disperse.stripe-cache 4 cluster.halo-enabled False cluster.halo-shd-max-latency 99999 cluster.halo-nfsd-max-latency 5 cluster.halo-max-latency 5 cluster.halo-max-replicas 99999 cluster.halo-min-replicas 2 features.selinux on cluster.daemon-log-level INFO debug.delay-gen off delay-gen.delay-percentage 10% delay-gen.delay-duration 100000 delay-gen.enable disperse.parallel-writes on features.sdfs off features.cloudsync off features.ctime on ctime.noatime on features.cloudsync-storetype (null) features.enforce-mandatory-lock off config.global-threading off config.client-threads 16 config.brick-threads 16 features.cloudsync-remote-read off features.cloudsync-store-id (null) features.cloudsync-product-id (null)
Volume Name: mvol1 Type: Distributed-Replicate Volume ID: 2f5de6e4-66de-40a7-9f24-4762aad3ca96 Status: Started Snapshot Count: 0 Number of Bricks: 5 x 2 = 10 Transport-type: tcp Bricks: Brick1: gl-master-01-int:/brick1/mvol1 Brick2: gl-master-02-int:/brick1/mvol1 Brick3: gl-master-03-int:/brick1/mvol1 Brick4: gl-master-04-int:/brick1/mvol1 Brick5: gl-master-01-int:/brick2/mvol1 Brick6: gl-master-02-int:/brick2/mvol1 Brick7: gl-master-03-int:/brick2/mvol1 Brick8: gl-master-04-int:/brick2/mvol1 Brick9: gl-master-05-int:/brick1/mvol1 Brick10: gl-master-06-int:/brick1/mvol1 Options Reconfigured: performance.parallel-readdir: on performance.readdir-ahead: on storage.fips-mode-rchecksum: on performance.stat-prefetch: off features.cache-invalidation: on features.cache-invalidation-timeout: 600 performance.md-cache-timeout: 600 network.inode-lru-limit: 200000 performance.nl-cache: off performance.nl-cache-timeout: 600 client.event-threads: 4 server.event-threads: 4 performance.write-behind-window-size: 4MB performance.nfs.io-threads: on performance.nfs.quick-read: on performance.nfs.read-ahead: on transport.address-family: inet features.trash-max-filesize: 5GB features.trash: off performance.cache-size: 16GB performance.io-thread-count: 64 network.ping-timeout: 20 cluster.min-free-disk: 200GB performance.cache-refresh-timeout: 32 changelog.changelog: on diagnostics.client-log-level: ERROR nfs.disable: on geo-replication.indexing: on geo-replication.ignore-pid-check: on
access_mount:true allow_network: change_detector:changelog change_interval:5 changelog_archive_format:%Y%m changelog_batch_size:727040 changelog_log_file:/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1/changes-${local_id}.log changelog_log_level:INFO checkpoint:0 cli_log_file:/var/log/glusterfs/geo-replication/cli.log cli_log_level:INFO connection_timeout:60 georep_session_working_dir:/var/lib/glusterd/geo-replication/mvol1_gl-slave-01-int_svol1/ gfid_conflict_resolution:true gluster_cli_options: gluster_command:gluster gluster_command_dir:/usr/sbin gluster_log_file:/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1/mnt-${local_id}.log gluster_log_level:INFO gluster_logdir:/var/log/glusterfs gluster_params:aux-gfid-mount acl gluster_rundir:/var/run/gluster glusterd_workdir:/var/lib/glusterd gsyncd_miscdir:/var/lib/misc/gluster/gsyncd ignore_deletes:false isolated_slaves: log_file:/var/log/glusterfs/geo-replication/mvol1_gl-slave-01-int_svol1/gsyncd.log log_level:INFO log_rsync_performance:false master_disperse_count:1 master_distribution_count:2 master_replica_count:1 max_rsync_retries:10 meta_volume_mnt:/var/run/gluster/shared_storage pid_file:/var/run/gluster/gsyncd-mvol1-gl-slave-01-int-svol1.pid remote_gsyncd: replica_failover_interval:1 rsync_command:rsync rsync_opt_existing:true rsync_opt_ignore_missing_args:true rsync_options: rsync_ssh_options: slave_access_mount:false slave_gluster_command_dir:/usr/sbin slave_gluster_log_file:/var/log/glusterfs/geo-replication-slaves/mvol1_gl-slave-01-int_svol1/mnt-${master_node}-${master_brick_id}.log slave_gluster_log_file_mbr:/var/log/glusterfs/geo-replication-slaves/mvol1_gl-slave-01-int_svol1/mnt-mbr-${master_node}-${master_brick_id}.log slave_gluster_log_level:INFO slave_gluster_params:aux-gfid-mount acl slave_log_file:/var/log/glusterfs/geo-replication-slaves/mvol1_gl-slave-01-int_svol1/gsyncd.log slave_log_level:INFO slave_timeout:120 special_sync_mode: ssh_command:ssh ssh_options:-oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem ssh_options_tar:-oPasswordAuthentication=no -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/tar_ssh.pem ssh_port:22 state_file:/var/lib/glusterd/geo-replication/mvol1_gl-slave-01-int_svol1/monitor.status state_socket_unencoded: stime_xattr_prefix:trusted.glusterfs.2f5de6e4-66de-40a7-9f24-4762aad3ca96.256628ab-57c2-44a4-9367-59e1939ade64 sync_acls:true sync_jobs:3 sync_method:rsync sync_xattrs:true tar_command:tar use_meta_volume:true use_rsync_xattrs:false working_dir:/var/lib/misc/gluster/gsyncd/mvol1_gl-slave-01-int_svol1/
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users