Re: Geo-replication: Entry not present on master. Fixing gfid mismatch in slave

Sunny Kumar <sunkumar@xxxxxxxxxx> · Tue, 2 Jun 2020 10:57:45 +0100

Hi David,

You haven't answered my previous question regarding the type of your workload.
---
You can use the below command to enable debug log.

`gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config log-level DEBUG`

and after capturing log again switch back to info mode:

`gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config log-level INFO`

Please share the debug log and geo-rep config to debug further:
for config:

`gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config`

/sunny

On Tue, Jun 2, 2020 at 10:18 AM Strahil Nikolov <hunter86_bg@xxxxxxxxx> wrote:
>
> Hi David,
>
> in which log do you see the entries ?
>
> I think I got an explanation why you see the process  only on one of the master nodes -  geo-rep session is established from only 1 master node  /I hope someone corrects me if I'm wrong/ to one slave node. Thus it will be natural to see the high CPU  usage on only 1 master node in your situation.
>
> Do you see anything else  in the : var/log/glusterfs/geo-replication/<logs> (master nodes) or in /var/log/glusterfs/geo-replication-slaves (slaves) that could hint of the exact issue. I have  a vague feeling that that python script is constantly  looping over some data causing the CPU hog.
>
> Sadly, I can't find an instruction for increasing the log level of the geo rep log .
>
>
> Best  Regards,
> Strahil  Nikolov
>
>
> На 2 юни 2020 г. 6:14:46 GMT+03:00, David Cunningham <dcunningham@xxxxxxxxxxxxx> написа:
> >Hi Strahil and Sunny,
> >
> >Thank you for the replies. I checked the gfid on the master and slaves
> >and
> >they are the same. After moving the file away and back again it doesn't
> >seem to be having the issue with that file any more.
> >
> >We are still getting higher CPU usage on one of the master nodes than
> >the
> >others. It logs this every few seconds:
> >
> >[2020-06-02 03:10:15.637815] I [master(worker
> >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time
> >Taken
> >   MKD=0   MKN=0   LIN=0   SYM=0   REN=0   RMD=0CRE=0   duration=0.0000
> >UNL=0
> >[2020-06-02 03:10:15.638010] I [master(worker
> >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata
> >Time Taken        SETA=0  SETX=0
> >meta_duration=0.0000data_duration=12.7878
> >   DATA=4  XATT=0
> >[2020-06-02 03:10:15.638286] I [master(worker
> >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch
> >Completed
> >changelog_end=1591067378        entry_stime=(1591067167, 0)
> >changelog_start=1591067364      stime=(1591067377, 0)
> >duration=12.8068
> >     num_changelogs=2        mode=live_changelog
> >[2020-06-02 03:10:20.658601] I [master(worker
> >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time
> > stime=(1591067377, 0)
> >[2020-06-02 03:10:34.21799] I [master(worker
> >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken
> > duration=0.3826 num_files=8     job=1   return_code=0
> >[2020-06-02 03:10:46.440535] I [master(worker
> >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time
> >Taken
> >   MKD=0   MKN=0   LIN=0   SYM=0   REN=1   RMD=0CRE=2   duration=0.1314
> >UNL=1
> >[2020-06-02 03:10:46.440809] I [master(worker
> >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata
> >Time Taken        SETA=0  SETX=0
> >meta_duration=0.0000data_duration=13.0171
> >   DATA=14 XATT=0
> >[2020-06-02 03:10:46.441205] I [master(worker
> >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch
> >Completed
> >changelog_end=1591067420        entry_stime=(1591067419, 0)
> >changelog_start=1591067392      stime=(1591067419, 0)
> >duration=13.0322
> >     num_changelogs=3        mode=live_changelog
> >[2020-06-02 03:10:51.460925] I [master(worker
> >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time
> > stime=(1591067419, 0)
> >
> >[2020-06-02 03:11:04.448913] I [master(worker
> >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time Taken
> >duration=0.3466 num_files=3     job=1   return_code=0
> >
> >Whereas the other master nodes only log this:
> >
> >[2020-06-02 03:11:33.886938] I [gsyncd(config-get):308:main] <top>:
> >Using
> >session config file
> >path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf
> >[2020-06-02 03:11:33.993175] I [gsyncd(status):308:main] <top>: Using
> >session config file
> >path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf
> >
> >Can anyone help with what might cause the high CPU usage on one master
> >node? The process is this one, and is using 70-100% of CPU:
> >
> >python2 /usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py
> >worker gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path
> >/nodirectwritedata/gluster/gvol0 --local-node cafs30 --local-node-id
> >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id
> >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13 --subvol-num 1
> >--resource-remote nvfs30 --resource-remote-id
> >1e698ccd-aeec-4ec4-96fe-383da8fc3b78
> >
> >Thank you in advance!
> >
> >
> >
> >
> >On Sat, 30 May 2020 at 20:20, Strahil Nikolov <hunter86_bg@xxxxxxxxx>
> >wrote:
> >
> >> Hey David,
> >>
> >> for me a gfid  mismatch means  that the file  was  replaced/recreated
> > -
> >> just like  vim in linux does (and it is expected for config file).
> >>
> >> Have  you checked the gfid  of  the file on both source and
> >destination,
> >> do they really match or they are different ?
> >>
> >> What happens  when you move away the file  from the slave ,  does it
> >fixes
> >> the issue ?
> >>
> >> Best Regards,
> >> Strahil Nikolov
> >>
> >> На 30 май 2020 г. 1:10:56 GMT+03:00, David Cunningham <
> >> dcunningham@xxxxxxxxxxxxx> написа:
> >> >Hello,
> >> >
> >> >We're having an issue with a geo-replication process with unusually
> >> >high
> >> >CPU use and giving "Entry not present on master. Fixing gfid
> >mismatch
> >> >in
> >> >slave" errors. Can anyone help on this?
> >> >
> >> >We have 3 GlusterFS replica nodes (we'll call the master), which
> >also
> >> >push
> >> >data to a remote server (slave) using geo-replication. This has been
> >> >running fine for a couple of months, but yesterday one of the master
> >> >nodes
> >> >started having unusually high CPU use. It's this process:
> >> >
> >> >root@cafs30:/var/log/glusterfs# ps aux | grep 32048
> >> >root     32048 68.7  0.6 1843140 845756 ?      Rl   02:51 493:51
> >> >python2
> >> >/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py
> >worker
> >> >gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path
> >> >/nodirectwritedata/gluster/gvol0 --local-node cafs30 --local-node-id
> >> >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id
> >> >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13
> >--subvol-num 1
> >> >--resource-remote nvfs30 --resource-remote-id
> >> >1e698ccd-aeec-4ec4-96fe-383da8fc3b78
> >> >
> >> >Here's what is being logged in
> >> >/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log:
> >> >
> >> >[2020-05-29 21:57:18.843524] I [master(worker
> >> >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time
> >> > stime=(1590789408, 0)
> >> >[2020-05-29 21:57:30.626172] I [master(worker
> >> >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures]
> >> >_GMaster: Entry not present on master. Fixing gfid mismatch in
> >slave.
> >> >Deleting the entry    retry_count=1   entry=({u'uid': 108, u'gfid':
> >> >u'7c0b75e5-d8b7-454f-8010-112d613c599e', u'gid': 117, u'mode':
> >33204,
> >> >u'entry':
> >> >u'.gfid/c5422396-1578-4b50-a29d-315be2a9c5d8/00a859f7xxxx.cfg',
> >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch':
> >True,
> >> >u'slave_name': None, u'slave_gfid':
> >> >u'ec4b0ace-2ec4-4ea5-adbc-9f519b81917c', u'name_mismatch': False,
> >> >u'dst':
> >> >False})
> >> >[2020-05-29 21:57:30.627893] I [master(worker
> >> >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures]
> >> >_GMaster: Entry not present on master. Fixing gfid mismatch in
> >slave.
> >> >Deleting the entry    retry_count=1   entry=({u'uid': 108, u'gfid':
> >> >u'a4d52e40-2e2f-4885-be5f-65fe95a8ebd7', u'gid': 117, u'mode':
> >33204,
> >> >u'entry':
> >>
> >>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/polycom_00a859f7xxxx.cfg',
> >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch':
> >True,
> >> >u'slave_name': None, u'slave_gfid':
> >> >u'ece8da77-b5ea-45a7-9af7-7d4d8f55f74a', u'name_mismatch': False,
> >> >u'dst':
> >> >False})
> >> >[2020-05-29 21:57:30.629532] I [master(worker
> >> >/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures]
> >> >_GMaster: Entry not present on master. Fixing gfid mismatch in
> >slave.
> >> >Deleting the entry    retry_count=1   entry=({u'uid': 108, u'gfid':
> >> >u'3c525ad8-aeb2-46b6-9c41-7fb4987916f8', u'gid': 117, u'mode':
> >33204,
> >> >u'entry':
> >>
> >>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/00a859f7xxxx-directory.xml',
> >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, u'gfid_mismatch':
> >True,
> >> >u'slave_name': None, u'slave_gfid':
> >> >u'06717b5a-d842-495d-bd25-aab9cd454490', u'name_mismatch': False,
> >> >u'dst':
> >> >False})
> >> >[2020-05-29 21:57:30.659123] I [master(worker
> >> >/nodirectwritedata/gluster/gvol0):942:handle_entry_failures]
> >_GMaster:
> >> >Sucessfully fixed entry ops with gfid mismatch     retry_count=1
> >> >[2020-05-29 21:57:30.659343] I [master(worker
> >> >/nodirectwritedata/gluster/gvol0):1194:process_change] _GMaster:
> >Retry
> >> >original entries. count = 1
> >> >[2020-05-29 21:57:30.725810] I [master(worker
> >> >/nodirectwritedata/gluster/gvol0):1197:process_change] _GMaster:
> >> >Sucessfully fixed all entry ops with gfid mismatch
> >> >[2020-05-29 21:57:31.747319] I [master(worker
> >> >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time
> >Taken
> >> >duration=0.7409 num_files=18    job=1   return_code=0
> >> >
> >> >We've verified that the files like polycom_00a859f7xxxx.cfg referred
> >to
> >> >in
> >> >the error do exist on the master nodes and slave.
> >> >
> >> >We found this bug fix:
> >> >https://bugzilla.redhat.com/show_bug.cgi?id=1642865
> >> >
> >> >However that fix went in 5.1, and we're running 5.12 on the master
> >> >nodes
> >> >and slave. A couple of GlusterFS clients connected to the master
> >nodes
> >> >are
> >> >running 5.13.
> >> >
> >> >Would anyone have any suggestions? Thank you in advance.
> >>
> ________
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing list
> Gluster-users@xxxxxxxxxxx
> https://lists.gluster.org/mailman/listinfo/gluster-users

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users