Re: Geo-replication: Entry not present on master. Fixing gfid mismatch in slave

David Cunningham <dcunningham@xxxxxxxxxxxxx> · Tue, 9 Jun 2020 14:07:11 +1200

Hi Sankarshan,

Thanks for that. So what should we look for to figure out what this process is doing? In /var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log we see something like the following logged regularly:

[[2020-06-09 02:01:19.670595] D [master(worker /nodirectwritedata/gluster/gvol0):1454:changelogs_batch_process] _GMaster: processing changes     batch=['/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668040', '/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668055']
[2020-06-09 02:01:19.674927] D [master(worker /nodirectwritedata/gluster/gvol0):1289:process] _GMaster: processing change       changelog=/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668040
[2020-06-09 02:01:19.683098] D [master(worker /nodirectwritedata/gluster/gvol0):1170:process_change] _GMaster: entries: []
[2020-06-09 02:01:19.695125] D [master(worker /nodirectwritedata/gluster/gvol0):312:a_syncdata] _GMaster: files files=set(['.gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17', '.gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77', '.gfid/779cd2b3-1571-446a-8903-48d6183d3dd0', '.gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435'])
[2020-06-09 02:01:19.695344] D [master(worker /nodirectwritedata/gluster/gvol0):315:a_syncdata] _GMaster: candidate for syncing file=.gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17
[2020-06-09 02:01:19.695508] D [master(worker /nodirectwritedata/gluster/gvol0):315:a_syncdata] _GMaster: candidate for syncing file=.gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77
[2020-06-09 02:01:19.695638] D [master(worker /nodirectwritedata/gluster/gvol0):315:a_syncdata] _GMaster: candidate for syncing file=.gfid/779cd2b3-1571-446a-8903-48d6183d3dd0
[2020-06-09 02:01:19.695759] D [master(worker /nodirectwritedata/gluster/gvol0):315:a_syncdata] _GMaster: candidate for syncing file=.gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435
[2020-06-09 02:01:19.695883] D [master(worker /nodirectwritedata/gluster/gvol0):1289:process] _GMaster: processing change       changelog=/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668055
[2020-06-09 02:01:19.696170] D [master(worker /nodirectwritedata/gluster/gvol0):1170:process_change] _GMaster: entries: []
[2020-06-09 02:01:19.714097] D [master(worker /nodirectwritedata/gluster/gvol0):312:a_syncdata] _GMaster: files files=set(['.gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17', '.gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77', '.gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435'])
[2020-06-09 02:01:19.714286] D [master(worker /nodirectwritedata/gluster/gvol0):315:a_syncdata] _GMaster: candidate for syncing file=.gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17
[2020-06-09 02:01:19.714433] D [master(worker /nodirectwritedata/gluster/gvol0):315:a_syncdata] _GMaster: candidate for syncing file=.gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77
[2020-06-09 02:01:19.714577] D [master(worker /nodirectwritedata/gluster/gvol0):315:a_syncdata] _GMaster: candidate for syncing file=.gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435
[2020-06-09 02:01:20.179656] D [resource(worker /nodirectwritedata/gluster/gvol0):1419:rsync] SSH: files: .gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17, .gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77, .gfid/779cd2b3-1571-446a-8903-48d6183d3dd0, .gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435, .gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17, .gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77, .gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435
[2020-06-09 02:01:20.738632] I [master(worker /nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken   duration=0.5588 num_files=7     job=2   return_code=0
[2020-06-09 02:01:20.739650] D [master(worker /nodirectwritedata/gluster/gvol0):321:regjob] _GMaster: synced    file=.gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17
[2020-06-09 02:01:20.740041] D [master(worker /nodirectwritedata/gluster/gvol0):321:regjob] _GMaster: synced    file=.gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77
[2020-06-09 02:01:20.740200] D [master(worker /nodirectwritedata/gluster/gvol0):321:regjob] _GMaster: synced    file=.gfid/779cd2b3-1571-446a-8903-48d6183d3dd0
[2020-06-09 02:01:20.740343] D [master(worker /nodirectwritedata/gluster/gvol0):321:regjob] _GMaster: synced    file=.gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435
[2020-06-09 02:01:20.740482] D [master(worker /nodirectwritedata/gluster/gvol0):321:regjob] _GMaster: synced    file=.gfid/0f98f9cd-1800-4c0f-b449-edcd7446bf17
[2020-06-09 02:01:20.740616] D [master(worker /nodirectwritedata/gluster/gvol0):321:regjob] _GMaster: synced    file=.gfid/512b4710-5af7-4e5a-8f3a-0a3dece42f77
[2020-06-09 02:01:20.740741] D [master(worker /nodirectwritedata/gluster/gvol0):321:regjob] _GMaster: synced    file=.gfid/8ae32eec-f766-4cd9-a788-4561ba1fa435
[2020-06-09 02:01:20.745385] D [repce(worker /nodirectwritedata/gluster/gvol0):196:push] RepceClient: call 22499:140085349934848:1591668080.75 done('/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668040',) ...
[2020-06-09 02:01:20.746252] D [repce(worker /nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call 22499:140085349934848:1591668080.75 done -> None
[2020-06-09 02:01:20.746515] D [repce(worker /nodirectwritedata/gluster/gvol0):196:push] RepceClient: call 22499:140085349934848:1591668080.75 done('/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668055',) ...
[2020-06-09 02:01:20.747045] D [repce(worker /nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call 22499:140085349934848:1591668080.75 done -> None
[2020-06-09 02:01:31.364780] I [gsyncd(config-get):308:main] <top>: Using session config file   path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf
[2020-06-09 02:01:31.458915] I [gsyncd(status):308:main] <top>: Using session config file       path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf
[2020-06-09 02:01:32.156028] I [gsyncd(config-get):308:main] <top>: Using session config file   path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf
[2020-06-09 02:01:32.250082] I [gsyncd(status):308:main] <top>: Using session config file       path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf
[2020-06-09 02:01:33.778866] I [master(worker /nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time Taken        MKD=0   MKN=0   LIN=0   SYM=0   REN=0   RMD=0   CRE=0   duration=0.0000 UNL=0
[2020-06-09 02:01:33.779000] I [master(worker /nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata Time Taken        SETA=0  SETX=0  meta_duration=0.0000    data_duration=14.0839   DATA="" XATT=0
[2020-06-09 02:01:33.779195] I [master(worker /nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch Completed changelog_end=1591668055        entry_stime=(1591668024, 0)     changelog_start=1591668040      stime=(1591668054, 0)   duration=14.1043        num_changelogs=2        mode=live_changelog
[2020-06-09 02:01:35.337511] I [gsyncd(config-get):308:main] <top>: Using session config file   path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf
[2020-06-09 02:01:35.432674] I [gsyncd(status):308:main] <top>: Using session config file       path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf
[2020-06-09 02:01:38.784449] D [master(worker /nodirectwritedata/gluster/gvol0):551:crawlwrap] _GMaster: ... crawl #18195 done, took 32.533401 seconds
[2020-06-09 02:01:38.790709] D [repce(worker /nodirectwritedata/gluster/gvol0):196:push] RepceClient: call 22499:140085349934848:1591668098.79 scan() ...
[2020-06-09 02:01:38.791342] D [repce(worker /nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call 22499:140085349934848:1591668098.79 scan -> None
[2020-06-09 02:01:38.791479] D [repce(worker /nodirectwritedata/gluster/gvol0):196:push] RepceClient: call 22499:140085349934848:1591668098.79 getchanges() ...
[2020-06-09 02:01:38.791989] D [repce(worker /nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call 22499:140085349934848:1591668098.79 getchanges -> ['/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668069', '/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668083', '/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591668097']
[2020-06-09 02:01:38.792114] I [master(worker /nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time      stime=(1591668054, 0)

Something very similar is logged around every 32 seconds, and I notice the log above says the crawl took 32 seconds.

On Mon, 8 Jun 2020 at 16:22, sankarshan <sankarshan@xxxxxxxxx> wrote:
Reading through the thread it occurs to me that it would be a stronger

approach to understand the workload (a general description of the

application) and in terms of the releases of GlusterFS running, assess

if there are new issues to be addressed or if existing sets of patches

tend to work. Thanks for setting up the debug level in the log. High

CPU usage by a geo-replication process would need to be traced back to

why it really requires that %-age of CPU if it was not doing so

previously.

On Mon, 8 Jun 2020 at 05:29, David Cunningham <dcunningham@xxxxxxxxxxxxx> wrote:

>

> Hi Strahil,

>

> The CPU is still quite high, with "top" regularly showing 100% CPU usage by that process. However it's not clear whether this is really a problem, or if it's just normal geo-replication activity. While CPU usage was not previously as high on this server, it's not clear whether GlusterFS might have made this the "push" node for geo-replication when it used to be some other server. Having said that, no other server shows a sudden drop in CPU usage.

>

> Ideally we could find out what the gsyncd process is doing, and therefore whether it's expected or not. Any ideas on that? We did set the log level to DEBUG as Sunny suggested.

>

> Looking at /var/log/glusterfs/geo-replication-slaves/gvol0_nvfs10_gvol0/mnt-cafs30-nodirectwritedata-gluster-gvol0.log on the geo-replication slave, we have a lot of lines like the following logged.

>

> [2020-06-06 00:32:43.155856] W [fuse-bridge.c:897:fuse_attr_cbk] 0-glusterfs-fuse: 875853: STAT() /.gfid/25d78a6c-41a9-4364-84a8-31f5571223ac => -1 (Stale file handle)

> [2020-06-06 00:32:43.219759] W [fuse-bridge.c:897:fuse_attr_cbk] 0-glusterfs-fuse: 875923: STAT() /.gfid/25d78a6c-41a9-4364-84a8-31f5571223ac => -1 (Stale file handle)

> [2020-06-06 00:32:43.280357] W [fuse-bridge.c:897:fuse_attr_cbk] 0-glusterfs-fuse: 876001: STAT() /.gfid/25d78a6c-41a9-4364-84a8-31f5571223ac => -1 (Stale file handle)

> The message "E [MSGID: 109040] [dht-helper.c:1332:dht_migration_complete_check_task] 0-gvol0-dht: /.gfid/25d78a6c-41a9-4364-84a8-31f5571223ac: failed to lookup the file on gvol0-dht [Stale file handle]" repeated 9 times between [2020-06-06 00:32:42.689780] and [2020-06-06 00:32:43.280322]

> [2020-06-06 09:03:03.660956] E [MSGID: 109040] [dht-helper.c:1332:dht_migration_complete_check_task] 0-gvol0-dht: /.gfid/decdd552-d58b-4ddf-b27a-0da9a6fbc38b: failed to lookup the file on gvol0-dht [Stale file handle]

> [2020-06-06 09:03:03.661057] W [fuse-bridge.c:897:fuse_attr_cbk] 0-glusterfs-fuse: 965375: STAT() /.gfid/decdd552-d58b-4ddf-b27a-0da9a6fbc38b => -1 (Stale file handle)

> [2020-06-06 09:03:10.258798] E [MSGID: 109040] [dht-helper.c:1332:dht_migration_complete_check_task] 0-gvol0-dht: /.gfid/1d90e129-d40f-4ca4-bea8-0bb3c4c7985a: failed to lookup the file on gvol0-dht [Stale file handle]

> [2020-06-06 09:03:10.258880] W [fuse-bridge.c:897:fuse_attr_cbk] 0-glusterfs-fuse: 969455: STAT() /.gfid/1d90e129-d40f-4ca4-bea8-0bb3c4c7985a => -1 (Stale file handle)

> [2020-06-06 09:09:41.259362] E [MSGID: 109040] [dht-helper.c:1332:dht_migration_complete_check_task] 0-gvol0-dht: /.gfid/740efff6-74dd-45be-847f-66919d1179e0: failed to lookup the file on gvol0-dht [Stale file handle]

> [2020-06-06 09:09:41.259458] W [fuse-bridge.c:897:fuse_attr_cbk] 0-glusterfs-fuse: 1040904: STAT() /.gfid/740efff6-74dd-45be-847f-66919d1179e0 => -1 (Stale file handle)

>

> Could these errors be part of the problem?

>

> Thanks again.

>

>

> On Sat, 6 Jun 2020 at 21:21, Strahil Nikolov <hunter86_bg@xxxxxxxxx> wrote:

>>

>> Hey David,

>>

>> can you check the old logs for gfid mismatch and get a list  of files that were causing the high cpu .

>> Maybe they are  related  somehow (maybe created by the same software  ,  same client version or something else) which could help about that.

>>

>> Also  take  a  look in geo-replication-slave logs.

>>

>> Does the issue still occurs ?

>>

>> Best Regards,

>> Strahil Nikolov

>>

>> На 6 юни 2020 г. 1:21:55 GMT+03:00, David Cunningham <dcunningham@xxxxxxxxxxxxx> написа:

>> >Hi Sunny and Strahil,

>> >

>> >Thanks again for your responses. We don't have a lot of renaming

>> >activity -

>> >maybe some, but not a lot. We do have files which are open for writing

>> >for

>> >quite a while - they're call recordings being written as the call

>> >happens.

>> >

>> >We've installed GlusterFS using the Ubuntu packages and I'd really like

>> >to

>> >avoid compiling and applying patches.

>> >

>> >After enabling DEBUG the log at

>> >/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log

>> >doesn't

>> >seem to show anything very unusual:

>> >

>> >[2020-06-03 02:49:01.992177] I [master(worker

>> >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time

>> >Taken

>> >   MKD=0   MKN=0   LIN=0   SYM=0   REN=0   RMD=0CRE=0   duration=0.0000

>> >UNL=0

>> >[2020-06-03 02:49:01.992465] I [master(worker

>> >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata

>> >Time Taken        SETA=0  SETX=0

>> >meta_duration=0.0000data_duration=13.0954

>> >   DATA="" XATT=0

>> >[2020-06-03 02:49:01.992863] I [master(worker

>> >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch

>> >Completed

>> >changelog_end=1591152508        entry_stime=(1591152352, 0)

>> >changelog_start=1591152494      stime=(1591152507, 0)

>> >duration=13.1077

>> >     num_changelogs=2        mode=live_changelog

>> >[2020-06-03 02:49:02.958687] D [repce(worker

>> >/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call

>> >19017:139678812452608:1591152542.96 keep_alive({'version': (1, 0),

>> >'retval': 0, 'uuid': '8ee85fae-f3aa-4285-ad48-67a1dc17ed73', 'timeout':

>> >1591152662, 'volume_mark': (1583043396, 161632)},) ...

>> >[2020-06-03 02:49:02.979139] D [repce(worker

>> >/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call

>> >19017:139678812452608:1591152542.96 keep_alive -> 28

>> >[2020-06-03 02:49:06.998127] D [master(worker

>> >/nodirectwritedata/gluster/gvol0):551:crawlwrap] _GMaster: ... crawl

>> >#114

>> >done, took 30.180089 seconds

>> >[2020-06-03 02:49:07.10132] D [repce(worker

>> >/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call

>> >19017:139679441716992:1591152547.01 scan() ...

>> >[2020-06-03 02:49:07.10781] D [repce(worker

>> >/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call

>> >19017:139679441716992:1591152547.01 scan -> None

>> >[2020-06-03 02:49:07.10935] D [repce(worker

>> >/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call

>> >19017:139679441716992:1591152547.01 getchanges() ...

>> >[2020-06-03 02:49:07.11579] D [repce(worker

>> >/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call

>> >19017:139679441716992:1591152547.01 getchanges ->

>> >['/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591152522']

>> >[2020-06-03 02:49:07.11720] I [master(worker

>> >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time

>> >stime=(1591152507, 0)

>> >

>> >Am I looking in the right place to find out what that gsyncd.py process

>> >is

>> >doing?

>> >

>> >

>> >On Tue, 2 Jun 2020 at 21:58, Sunny Kumar <sunkumar@xxxxxxxxxx> wrote:

>> >

>> >> Hi David,

>> >>

>> >> You haven't answered my previous question regarding the type of your

>> >> workload.

>> >> ---

>> >> You can use the below command to enable debug log.

>> >>

>> >> `gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config

>> >log-level

>> >> DEBUG`

>> >>

>> >> and after capturing log again switch back to info mode:

>> >>

>> >> `gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config

>> >log-level

>> >> INFO`

>> >>

>> >> Please share the debug log and geo-rep config to debug further:

>> >> for config:

>> >>

>> >> `gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config`

>> >>

>> >> /sunny

>> >>

>> >>

>> >> On Tue, Jun 2, 2020 at 10:18 AM Strahil Nikolov

>> ><hunter86_bg@xxxxxxxxx>

>> >> wrote:

>> >> >

>> >> > Hi David,

>> >> >

>> >> > in which log do you see the entries ?

>> >> >

>> >> > I think I got an explanation why you see the process  only on one

>> >of the

>> >> master nodes -  geo-rep session is established from only 1 master

>> >node  /I

>> >> hope someone corrects me if I'm wrong/ to one slave node. Thus it

>> >will be

>> >> natural to see the high CPU  usage on only 1 master node in your

>> >situation.

>> >> >

>> >> > Do you see anything else  in the :

>> >> var/log/glusterfs/geo-replication/<logs> (master nodes) or in

>> >> /var/log/glusterfs/geo-replication-slaves (slaves) that could hint of

>> >the

>> >> exact issue. I have  a vague feeling that that python script is

>> >constantly

>> >> looping over some data causing the CPU hog.

>> >> >

>> >> > Sadly, I can't find an instruction for increasing the log level of

>> >the

>> >> geo rep log .

>> >> >

>> >> >

>> >> > Best  Regards,

>> >> > Strahil  Nikolov

>> >> >

>> >> >

>> >> > На 2 юни 2020 г. 6:14:46 GMT+03:00, David Cunningham <

>> >> dcunningham@xxxxxxxxxxxxx> написа:

>> >> > >Hi Strahil and Sunny,

>> >> > >

>> >> > >Thank you for the replies. I checked the gfid on the master and

>> >slaves

>> >> > >and

>> >> > >they are the same. After moving the file away and back again it

>> >doesn't

>> >> > >seem to be having the issue with that file any more.

>> >> > >

>> >> > >We are still getting higher CPU usage on one of the master nodes

>> >than

>> >> > >the

>> >> > >others. It logs this every few seconds:

>> >> > >

>> >> > >[2020-06-02 03:10:15.637815] I [master(worker

>> >> > >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry

>> >Time

>> >> > >Taken

>> >> > >   MKD=0   MKN=0   LIN=0   SYM=0   REN=0   RMD=0CRE=0

>> >duration=0.0000

>> >> > >UNL=0

>> >> > >[2020-06-02 03:10:15.638010] I [master(worker

>> >> > >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster:

>> >Data/Metadata

>> >> > >Time Taken        SETA=0  SETX=0

>> >> > >meta_duration=0.0000data_duration=12.7878

>> >> > >   DATA="" XATT=0

>> >> > >[2020-06-02 03:10:15.638286] I [master(worker

>> >> > >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch

>> >> > >Completed

>> >> > >changelog_end=1591067378        entry_stime=(1591067167, 0)

>> >> > >changelog_start=1591067364      stime=(1591067377, 0)

>> >> > >duration=12.8068

>> >> > >     num_changelogs=2        mode=live_changelog

>> >> > >[2020-06-02 03:10:20.658601] I [master(worker

>> >> > >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's

>> >time

>> >> > > stime=(1591067377, 0)

>> >> > >[2020-06-02 03:10:34.21799] I [master(worker

>> >> > >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time

>> >Taken

>> >> > > duration=0.3826 num_files=8     job=1   return_code=0

>> >> > >[2020-06-02 03:10:46.440535] I [master(worker

>> >> > >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry

>> >Time

>> >> > >Taken

>> >> > >   MKD=0   MKN=0   LIN=0   SYM=0   REN=1   RMD=0CRE=2

>> >duration=0.1314

>> >> > >UNL=1

>> >> > >[2020-06-02 03:10:46.440809] I [master(worker

>> >> > >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster:

>> >Data/Metadata

>> >> > >Time Taken        SETA=0  SETX=0

>> >> > >meta_duration=0.0000data_duration=13.0171

>> >> > >   DATA="" XATT=0

>> >> > >[2020-06-02 03:10:46.441205] I [master(worker

>> >> > >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch

>> >> > >Completed

>> >> > >changelog_end=1591067420        entry_stime=(1591067419, 0)

>> >> > >changelog_start=1591067392      stime=(1591067419, 0)

>> >> > >duration=13.0322

>> >> > >     num_changelogs=3        mode=live_changelog

>> >> > >[2020-06-02 03:10:51.460925] I [master(worker

>> >> > >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's

>> >time

>> >> > > stime=(1591067419, 0)

>> >> > >

>> >> > >[2020-06-02 03:11:04.448913] I [master(worker

>> >> > >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time

>> >Taken

>> >> > >duration=0.3466 num_files=3     job=1   return_code=0

>> >> > >

>> >> > >Whereas the other master nodes only log this:

>> >> > >

>> >> > >[2020-06-02 03:11:33.886938] I [gsyncd(config-get):308:main]

>> ><top>:

>> >> > >Using

>> >> > >session config file

>> >> >

>> >>path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf

>> >> > >[2020-06-02 03:11:33.993175] I [gsyncd(status):308:main] <top>:

>> >Using

>> >> > >session config file

>> >> >

>> >>path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf

>> >> > >

>> >> > >Can anyone help with what might cause the high CPU usage on one

>> >master

>> >> > >node? The process is this one, and is using 70-100% of CPU:

>> >> > >

>> >> > >python2

>> >/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py

>> >> > >worker gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path

>> >> > >/nodirectwritedata/gluster/gvol0 --local-node cafs30

>> >--local-node-id

>> >> > >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id

>> >> > >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13

>> >--subvol-num 1

>> >> > >--resource-remote nvfs30 --resource-remote-id

>> >> > >1e698ccd-aeec-4ec4-96fe-383da8fc3b78

>> >> > >

>> >> > >Thank you in advance!

>> >> > >

>> >> > >

>> >> > >

>> >> > >

>> >> > >On Sat, 30 May 2020 at 20:20, Strahil Nikolov

>> ><hunter86_bg@xxxxxxxxx>

>> >> > >wrote:

>> >> > >

>> >> > >> Hey David,

>> >> > >>

>> >> > >> for me a gfid  mismatch means  that the file  was

>> >replaced/recreated

>> >> > > -

>> >> > >> just like  vim in linux does (and it is expected for config

>> >file).

>> >> > >>

>> >> > >> Have  you checked the gfid  of  the file on both source and

>> >> > >destination,

>> >> > >> do they really match or they are different ?

>> >> > >>

>> >> > >> What happens  when you move away the file  from the slave ,

>> >does it

>> >> > >fixes

>> >> > >> the issue ?

>> >> > >>

>> >> > >> Best Regards,

>> >> > >> Strahil Nikolov

>> >> > >>

>> >> > >> На 30 май 2020 г. 1:10:56 GMT+03:00, David Cunningham <

>> >> > >> dcunningham@xxxxxxxxxxxxx> написа:

>> >> > >> >Hello,

>> >> > >> >

>> >> > >> >We're having an issue with a geo-replication process with

>> >unusually

>> >> > >> >high

>> >> > >> >CPU use and giving "Entry not present on master. Fixing gfid

>> >> > >mismatch

>> >> > >> >in

>> >> > >> >slave" errors. Can anyone help on this?

>> >> > >> >

>> >> > >> >We have 3 GlusterFS replica nodes (we'll call the master),

>> >which

>> >> > >also

>> >> > >> >push

>> >> > >> >data to a remote server (slave) using geo-replication. This has

>> >been

>> >> > >> >running fine for a couple of months, but yesterday one of the

>> >master

>> >> > >> >nodes

>> >> > >> >started having unusually high CPU use. It's this process:

>> >> > >> >

>> >> > >> >root@cafs30:/var/log/glusterfs# ps aux | grep 32048

>> >> > >> >root     32048 68.7  0.6 1843140 845756 ?      Rl   02:51

>> >493:51

>> >> > >> >python2

>> >> > >> >/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py

>> >> > >worker

>> >> > >> >gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path

>> >> > >> >/nodirectwritedata/gluster/gvol0 --local-node cafs30

>> >--local-node-id

>> >> > >> >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id

>> >> > >> >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13

>> >> > >--subvol-num 1

>> >> > >> >--resource-remote nvfs30 --resource-remote-id

>> >> > >> >1e698ccd-aeec-4ec4-96fe-383da8fc3b78

>> >> > >> >

>> >> > >> >Here's what is being logged in

>> >> > >>

>> >>/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log:

>> >> > >> >

>> >> > >> >[2020-05-29 21:57:18.843524] I [master(worker

>> >> > >> >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's

>> >time

>> >> > >> > stime=(1590789408, 0)

>> >> > >> >[2020-05-29 21:57:30.626172] I [master(worker

>> >> > >>

>> >>/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures]

>> >> > >> >_GMaster: Entry not present on master. Fixing gfid mismatch in

>> >> > >slave.

>> >> > >> >Deleting the entry    retry_count=1   entry=({u'uid': 108,

>> >u'gfid':

>> >> > >> >u'7c0b75e5-d8b7-454f-8010-112d613c599e', u'gid': 117, u'mode':

>> >> > >33204,

>> >> > >> >u'entry':

>> >> > >> >u'.gfid/c5422396-1578-4b50-a29d-315be2a9c5d8/00a859f7xxxx.cfg',

>> >> > >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False,

>> >u'gfid_mismatch':

>> >> > >True,

>> >> > >> >u'slave_name': None, u'slave_gfid':

>> >> > >> >u'ec4b0ace-2ec4-4ea5-adbc-9f519b81917c', u'name_mismatch':

>> >False,

>> >> > >> >u'dst':

>> >> > >> >False})

>> >> > >> >[2020-05-29 21:57:30.627893] I [master(worker

>> >> > >>

>> >>/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures]

>> >> > >> >_GMaster: Entry not present on master. Fixing gfid mismatch in

>> >> > >slave.

>> >> > >> >Deleting the entry    retry_count=1   entry=({u'uid': 108,

>> >u'gfid':

>> >> > >> >u'a4d52e40-2e2f-4885-be5f-65fe95a8ebd7', u'gid': 117, u'mode':

>> >> > >33204,

>> >> > >> >u'entry':

>> >> > >>

>> >> >

>> >>>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/polycom_00a859f7xxxx.cfg',

>> >> > >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False,

>> >u'gfid_mismatch':

>> >> > >True,

>> >> > >> >u'slave_name': None, u'slave_gfid':

>> >> > >> >u'ece8da77-b5ea-45a7-9af7-7d4d8f55f74a', u'name_mismatch':

>> >False,

>> >> > >> >u'dst':

>> >> > >> >False})

>> >> > >> >[2020-05-29 21:57:30.629532] I [master(worker

>> >> > >>

>> >>/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures]

>> >> > >> >_GMaster: Entry not present on master. Fixing gfid mismatch in

>> >> > >slave.

>> >> > >> >Deleting the entry    retry_count=1   entry=({u'uid': 108,

>> >u'gfid':

>> >> > >> >u'3c525ad8-aeb2-46b6-9c41-7fb4987916f8', u'gid': 117, u'mode':

>> >> > >33204,

>> >> > >> >u'entry':

>> >> > >>

>> >> >

>> >>

>> >>>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/00a859f7xxxx-directory.xml',

>> >> > >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False,

>> >u'gfid_mismatch':

>> >> > >True,

>> >> > >> >u'slave_name': None, u'slave_gfid':

>> >> > >> >u'06717b5a-d842-495d-bd25-aab9cd454490', u'name_mismatch':

>> >False,

>> >> > >> >u'dst':

>> >> > >> >False})

>> >> > >> >[2020-05-29 21:57:30.659123] I [master(worker

>> >> > >> >/nodirectwritedata/gluster/gvol0):942:handle_entry_failures]

>> >> > >_GMaster:

>> >> > >> >Sucessfully fixed entry ops with gfid mismatch

>> >retry_count=1

>> >> > >> >[2020-05-29 21:57:30.659343] I [master(worker

>> >> > >> >/nodirectwritedata/gluster/gvol0):1194:process_change]

>> >_GMaster:

>> >> > >Retry

>> >> > >> >original entries. count = 1

>> >> > >> >[2020-05-29 21:57:30.725810] I [master(worker

>> >> > >> >/nodirectwritedata/gluster/gvol0):1197:process_change]

>> >_GMaster:

>> >> > >> >Sucessfully fixed all entry ops with gfid mismatch

>> >> > >> >[2020-05-29 21:57:31.747319] I [master(worker

>> >> > >> >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync

>> >Time

>> >> > >Taken

>> >> > >> >duration=0.7409 num_files=18    job=1   return_code=0

>> >> > >> >

>> >> > >> >We've verified that the files like polycom_00a859f7xxxx.cfg

>> >referred

>> >> > >to

>> >> > >> >in

>> >> > >> >the error do exist on the master nodes and slave.

>> >> > >> >

>> >> > >> >We found this bug fix:

>> >> > >> >https://bugzilla.redhat.com/show_bug.cgi?id=1642865

>> >> > >> >

>> >> > >> >However that fix went in 5.1, and we're running 5.12 on the

>> >master

>> >> > >> >nodes

>> >> > >> >and slave. A couple of GlusterFS clients connected to the

>> >master

>> >> > >nodes

>> >> > >> >are

>> >> > >> >running 5.13.

>> >> > >> >

>> >> > >> >Would anyone have any suggestions? Thank you in advance.

>> >> > >>

>> >> > ________

>> >> >

>> >> >

>> >> >

>> >> > Community Meeting Calendar:

>> >> >

>> >> > Schedule -

>> >> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC

>> >> > Bridge: https://bluejeans.com/441850968

>> >> >

>> >> > Gluster-users mailing list

>> >> > Gluster-users@xxxxxxxxxxx

>> >> > https://lists.gluster.org/mailman/listinfo/gluster-users

>> >>

>> >>

>

>

>

> --

> David Cunningham, Voisonics Limited

> http://voisonics.com/

> USA: +1 213 221 1092

> New Zealand: +64 (0)28 2558 3782

> ________

>

>

>

> Community Meeting Calendar:

>

> Schedule -

> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC

> Bridge: https://bluejeans.com/441850968

>

> Gluster-users mailing list

> Gluster-users@xxxxxxxxxxx

> https://lists.gluster.org/mailman/listinfo/gluster-users

-- 

sankarshan@xxxxxxxxx | TZ: UTC+0530 | +91 99606 03294

kadalu.io : Making it easy to provision storage in k8s!

________

Community Meeting Calendar:

Schedule -

Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC

Bridge: https://bluejeans.com/441850968

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

https://lists.gluster.org/mailman/listinfo/gluster-users

-- 
David Cunningham, Voisonics Limited
http://voisonics.com/
USA: +1 213 221 1092
New Zealand: +64 (0)28 2558 3782
________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users