Re: Geo-replication: Entry not present on master. Fixing gfid mismatch in slave

Strahil Nikolov <hunter86_bg@xxxxxxxxx> · Sat, 06 Jun 2020 12:21:17 +0300

Hey David,

can you check the old logs for gfid mismatch and get a list  of files that were causing the high cpu .
Maybe they are  related  somehow (maybe created by the same software  ,  same client version or something else) which could help about that.

Also  take  a  look in geo-replication-slave logs.

Does the issue still occurs ?

Best Regards,
Strahil Nikolov

На 6 юни 2020 г. 1:21:55 GMT+03:00, David Cunningham <dcunningham@xxxxxxxxxxxxx> написа:
>Hi Sunny and Strahil,
>
>Thanks again for your responses. We don't have a lot of renaming
>activity -
>maybe some, but not a lot. We do have files which are open for writing
>for
>quite a while - they're call recordings being written as the call
>happens.
>
>We've installed GlusterFS using the Ubuntu packages and I'd really like
>to
>avoid compiling and applying patches.
>
>After enabling DEBUG the log at
>/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log
>doesn't
>seem to show anything very unusual:
>
>[2020-06-03 02:49:01.992177] I [master(worker
>/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time
>Taken
>   MKD=0   MKN=0   LIN=0   SYM=0   REN=0   RMD=0CRE=0   duration=0.0000
>UNL=0
>[2020-06-03 02:49:01.992465] I [master(worker
>/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata
>Time Taken        SETA=0  SETX=0 
>meta_duration=0.0000data_duration=13.0954
>   DATA=8  XATT=0
>[2020-06-03 02:49:01.992863] I [master(worker
>/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch
>Completed
>changelog_end=1591152508        entry_stime=(1591152352, 0)
>changelog_start=1591152494      stime=(1591152507, 0)  
>duration=13.1077
>     num_changelogs=2        mode=live_changelog
>[2020-06-03 02:49:02.958687] D [repce(worker
>/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call
>19017:139678812452608:1591152542.96 keep_alive({'version': (1, 0),
>'retval': 0, 'uuid': '8ee85fae-f3aa-4285-ad48-67a1dc17ed73', 'timeout':
>1591152662, 'volume_mark': (1583043396, 161632)},) ...
>[2020-06-03 02:49:02.979139] D [repce(worker
>/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call
>19017:139678812452608:1591152542.96 keep_alive -> 28
>[2020-06-03 02:49:06.998127] D [master(worker
>/nodirectwritedata/gluster/gvol0):551:crawlwrap] _GMaster: ... crawl
>#114
>done, took 30.180089 seconds
>[2020-06-03 02:49:07.10132] D [repce(worker
>/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call
>19017:139679441716992:1591152547.01 scan() ...
>[2020-06-03 02:49:07.10781] D [repce(worker
>/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call
>19017:139679441716992:1591152547.01 scan -> None
>[2020-06-03 02:49:07.10935] D [repce(worker
>/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call
>19017:139679441716992:1591152547.01 getchanges() ...
>[2020-06-03 02:49:07.11579] D [repce(worker
>/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call
>19017:139679441716992:1591152547.01 getchanges ->
>['/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591152522']
>[2020-06-03 02:49:07.11720] I [master(worker
>/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time
>stime=(1591152507, 0)
>
>Am I looking in the right place to find out what that gsyncd.py process
>is
>doing?
>
>
>On Tue, 2 Jun 2020 at 21:58, Sunny Kumar <sunkumar@xxxxxxxxxx> wrote:
>
>> Hi David,
>>
>> You haven't answered my previous question regarding the type of your
>> workload.
>> ---
>> You can use the below command to enable debug log.
>>
>> `gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config
>log-level
>> DEBUG`
>>
>> and after capturing log again switch back to info mode:
>>
>> `gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config
>log-level
>> INFO`
>>
>> Please share the debug log and geo-rep config to debug further:
>> for config:
>>
>> `gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config`
>>
>> /sunny
>>
>>
>> On Tue, Jun 2, 2020 at 10:18 AM Strahil Nikolov
><hunter86_bg@xxxxxxxxx>
>> wrote:
>> >
>> > Hi David,
>> >
>> > in which log do you see the entries ?
>> >
>> > I think I got an explanation why you see the process  only on one
>of the
>> master nodes -  geo-rep session is established from only 1 master
>node  /I
>> hope someone corrects me if I'm wrong/ to one slave node. Thus it
>will be
>> natural to see the high CPU  usage on only 1 master node in your
>situation.
>> >
>> > Do you see anything else  in the :
>> var/log/glusterfs/geo-replication/<logs> (master nodes) or in
>> /var/log/glusterfs/geo-replication-slaves (slaves) that could hint of
>the
>> exact issue. I have  a vague feeling that that python script is
>constantly
>> looping over some data causing the CPU hog.
>> >
>> > Sadly, I can't find an instruction for increasing the log level of
>the
>> geo rep log .
>> >
>> >
>> > Best  Regards,
>> > Strahil  Nikolov
>> >
>> >
>> > На 2 юни 2020 г. 6:14:46 GMT+03:00, David Cunningham <
>> dcunningham@xxxxxxxxxxxxx> написа:
>> > >Hi Strahil and Sunny,
>> > >
>> > >Thank you for the replies. I checked the gfid on the master and
>slaves
>> > >and
>> > >they are the same. After moving the file away and back again it
>doesn't
>> > >seem to be having the issue with that file any more.
>> > >
>> > >We are still getting higher CPU usage on one of the master nodes
>than
>> > >the
>> > >others. It logs this every few seconds:
>> > >
>> > >[2020-06-02 03:10:15.637815] I [master(worker
>> > >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry
>Time
>> > >Taken
>> > >   MKD=0   MKN=0   LIN=0   SYM=0   REN=0   RMD=0CRE=0  
>duration=0.0000
>> > >UNL=0
>> > >[2020-06-02 03:10:15.638010] I [master(worker
>> > >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster:
>Data/Metadata
>> > >Time Taken        SETA=0  SETX=0
>> > >meta_duration=0.0000data_duration=12.7878
>> > >   DATA=4  XATT=0
>> > >[2020-06-02 03:10:15.638286] I [master(worker
>> > >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch
>> > >Completed
>> > >changelog_end=1591067378        entry_stime=(1591067167, 0)
>> > >changelog_start=1591067364      stime=(1591067377, 0)
>> > >duration=12.8068
>> > >     num_changelogs=2        mode=live_changelog
>> > >[2020-06-02 03:10:20.658601] I [master(worker
>> > >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's
>time
>> > > stime=(1591067377, 0)
>> > >[2020-06-02 03:10:34.21799] I [master(worker
>> > >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time
>Taken
>> > > duration=0.3826 num_files=8     job=1   return_code=0
>> > >[2020-06-02 03:10:46.440535] I [master(worker
>> > >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry
>Time
>> > >Taken
>> > >   MKD=0   MKN=0   LIN=0   SYM=0   REN=1   RMD=0CRE=2  
>duration=0.1314
>> > >UNL=1
>> > >[2020-06-02 03:10:46.440809] I [master(worker
>> > >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster:
>Data/Metadata
>> > >Time Taken        SETA=0  SETX=0
>> > >meta_duration=0.0000data_duration=13.0171
>> > >   DATA=14 XATT=0
>> > >[2020-06-02 03:10:46.441205] I [master(worker
>> > >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch
>> > >Completed
>> > >changelog_end=1591067420        entry_stime=(1591067419, 0)
>> > >changelog_start=1591067392      stime=(1591067419, 0)
>> > >duration=13.0322
>> > >     num_changelogs=3        mode=live_changelog
>> > >[2020-06-02 03:10:51.460925] I [master(worker
>> > >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's
>time
>> > > stime=(1591067419, 0)
>> > >
>> > >[2020-06-02 03:11:04.448913] I [master(worker
>> > >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time
>Taken
>> > >duration=0.3466 num_files=3     job=1   return_code=0
>> > >
>> > >Whereas the other master nodes only log this:
>> > >
>> > >[2020-06-02 03:11:33.886938] I [gsyncd(config-get):308:main]
><top>:
>> > >Using
>> > >session config file
>> >
>>path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf
>> > >[2020-06-02 03:11:33.993175] I [gsyncd(status):308:main] <top>:
>Using
>> > >session config file
>> >
>>path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf
>> > >
>> > >Can anyone help with what might cause the high CPU usage on one
>master
>> > >node? The process is this one, and is using 70-100% of CPU:
>> > >
>> > >python2
>/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py
>> > >worker gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path
>> > >/nodirectwritedata/gluster/gvol0 --local-node cafs30
>--local-node-id
>> > >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id
>> > >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13
>--subvol-num 1
>> > >--resource-remote nvfs30 --resource-remote-id
>> > >1e698ccd-aeec-4ec4-96fe-383da8fc3b78
>> > >
>> > >Thank you in advance!
>> > >
>> > >
>> > >
>> > >
>> > >On Sat, 30 May 2020 at 20:20, Strahil Nikolov
><hunter86_bg@xxxxxxxxx>
>> > >wrote:
>> > >
>> > >> Hey David,
>> > >>
>> > >> for me a gfid  mismatch means  that the file  was 
>replaced/recreated
>> > > -
>> > >> just like  vim in linux does (and it is expected for config
>file).
>> > >>
>> > >> Have  you checked the gfid  of  the file on both source and
>> > >destination,
>> > >> do they really match or they are different ?
>> > >>
>> > >> What happens  when you move away the file  from the slave , 
>does it
>> > >fixes
>> > >> the issue ?
>> > >>
>> > >> Best Regards,
>> > >> Strahil Nikolov
>> > >>
>> > >> На 30 май 2020 г. 1:10:56 GMT+03:00, David Cunningham <
>> > >> dcunningham@xxxxxxxxxxxxx> написа:
>> > >> >Hello,
>> > >> >
>> > >> >We're having an issue with a geo-replication process with
>unusually
>> > >> >high
>> > >> >CPU use and giving "Entry not present on master. Fixing gfid
>> > >mismatch
>> > >> >in
>> > >> >slave" errors. Can anyone help on this?
>> > >> >
>> > >> >We have 3 GlusterFS replica nodes (we'll call the master),
>which
>> > >also
>> > >> >push
>> > >> >data to a remote server (slave) using geo-replication. This has
>been
>> > >> >running fine for a couple of months, but yesterday one of the
>master
>> > >> >nodes
>> > >> >started having unusually high CPU use. It's this process:
>> > >> >
>> > >> >root@cafs30:/var/log/glusterfs# ps aux | grep 32048
>> > >> >root     32048 68.7  0.6 1843140 845756 ?      Rl   02:51
>493:51
>> > >> >python2
>> > >> >/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py
>> > >worker
>> > >> >gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path
>> > >> >/nodirectwritedata/gluster/gvol0 --local-node cafs30
>--local-node-id
>> > >> >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id
>> > >> >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13
>> > >--subvol-num 1
>> > >> >--resource-remote nvfs30 --resource-remote-id
>> > >> >1e698ccd-aeec-4ec4-96fe-383da8fc3b78
>> > >> >
>> > >> >Here's what is being logged in
>> > >>
>>/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log:
>> > >> >
>> > >> >[2020-05-29 21:57:18.843524] I [master(worker
>> > >> >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's
>time
>> > >> > stime=(1590789408, 0)
>> > >> >[2020-05-29 21:57:30.626172] I [master(worker
>> > >>
>>/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures]
>> > >> >_GMaster: Entry not present on master. Fixing gfid mismatch in
>> > >slave.
>> > >> >Deleting the entry    retry_count=1   entry=({u'uid': 108,
>u'gfid':
>> > >> >u'7c0b75e5-d8b7-454f-8010-112d613c599e', u'gid': 117, u'mode':
>> > >33204,
>> > >> >u'entry':
>> > >> >u'.gfid/c5422396-1578-4b50-a29d-315be2a9c5d8/00a859f7xxxx.cfg',
>> > >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False,
>u'gfid_mismatch':
>> > >True,
>> > >> >u'slave_name': None, u'slave_gfid':
>> > >> >u'ec4b0ace-2ec4-4ea5-adbc-9f519b81917c', u'name_mismatch':
>False,
>> > >> >u'dst':
>> > >> >False})
>> > >> >[2020-05-29 21:57:30.627893] I [master(worker
>> > >>
>>/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures]
>> > >> >_GMaster: Entry not present on master. Fixing gfid mismatch in
>> > >slave.
>> > >> >Deleting the entry    retry_count=1   entry=({u'uid': 108,
>u'gfid':
>> > >> >u'a4d52e40-2e2f-4885-be5f-65fe95a8ebd7', u'gid': 117, u'mode':
>> > >33204,
>> > >> >u'entry':
>> > >>
>> >
>>>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/polycom_00a859f7xxxx.cfg',
>> > >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False,
>u'gfid_mismatch':
>> > >True,
>> > >> >u'slave_name': None, u'slave_gfid':
>> > >> >u'ece8da77-b5ea-45a7-9af7-7d4d8f55f74a', u'name_mismatch':
>False,
>> > >> >u'dst':
>> > >> >False})
>> > >> >[2020-05-29 21:57:30.629532] I [master(worker
>> > >>
>>/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures]
>> > >> >_GMaster: Entry not present on master. Fixing gfid mismatch in
>> > >slave.
>> > >> >Deleting the entry    retry_count=1   entry=({u'uid': 108,
>u'gfid':
>> > >> >u'3c525ad8-aeb2-46b6-9c41-7fb4987916f8', u'gid': 117, u'mode':
>> > >33204,
>> > >> >u'entry':
>> > >>
>> >
>>
>>>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/00a859f7xxxx-directory.xml',
>> > >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False,
>u'gfid_mismatch':
>> > >True,
>> > >> >u'slave_name': None, u'slave_gfid':
>> > >> >u'06717b5a-d842-495d-bd25-aab9cd454490', u'name_mismatch':
>False,
>> > >> >u'dst':
>> > >> >False})
>> > >> >[2020-05-29 21:57:30.659123] I [master(worker
>> > >> >/nodirectwritedata/gluster/gvol0):942:handle_entry_failures]
>> > >_GMaster:
>> > >> >Sucessfully fixed entry ops with gfid mismatch    
>retry_count=1
>> > >> >[2020-05-29 21:57:30.659343] I [master(worker
>> > >> >/nodirectwritedata/gluster/gvol0):1194:process_change]
>_GMaster:
>> > >Retry
>> > >> >original entries. count = 1
>> > >> >[2020-05-29 21:57:30.725810] I [master(worker
>> > >> >/nodirectwritedata/gluster/gvol0):1197:process_change]
>_GMaster:
>> > >> >Sucessfully fixed all entry ops with gfid mismatch
>> > >> >[2020-05-29 21:57:31.747319] I [master(worker
>> > >> >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync
>Time
>> > >Taken
>> > >> >duration=0.7409 num_files=18    job=1   return_code=0
>> > >> >
>> > >> >We've verified that the files like polycom_00a859f7xxxx.cfg
>referred
>> > >to
>> > >> >in
>> > >> >the error do exist on the master nodes and slave.
>> > >> >
>> > >> >We found this bug fix:
>> > >> >https://bugzilla.redhat.com/show_bug.cgi?id=1642865
>> > >> >
>> > >> >However that fix went in 5.1, and we're running 5.12 on the
>master
>> > >> >nodes
>> > >> >and slave. A couple of GlusterFS clients connected to the
>master
>> > >nodes
>> > >> >are
>> > >> >running 5.13.
>> > >> >
>> > >> >Would anyone have any suggestions? Thank you in advance.
>> > >>
>> > ________
>> >
>> >
>> >
>> > Community Meeting Calendar:
>> >
>> > Schedule -
>> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>> > Bridge: https://bluejeans.com/441850968
>> >
>> > Gluster-users mailing list
>> > Gluster-users@xxxxxxxxxxx
>> > https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>>
________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users