Hey David, can you check the old logs for gfid mismatch and get a list of files that were causing the high cpu . Maybe they are related somehow (maybe created by the same software , same client version or something else) which could help about that. Also take a look in geo-replication-slave logs. Does the issue still occurs ? Best Regards, Strahil Nikolov На 6 юни 2020 г. 1:21:55 GMT+03:00, David Cunningham <dcunningham@xxxxxxxxxxxxx> написа: >Hi Sunny and Strahil, > >Thanks again for your responses. We don't have a lot of renaming >activity - >maybe some, but not a lot. We do have files which are open for writing >for >quite a while - they're call recordings being written as the call >happens. > >We've installed GlusterFS using the Ubuntu packages and I'd really like >to >avoid compiling and applying patches. > >After enabling DEBUG the log at >/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log >doesn't >seem to show anything very unusual: > >[2020-06-03 02:49:01.992177] I [master(worker >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry Time >Taken > MKD=0 MKN=0 LIN=0 SYM=0 REN=0 RMD=0CRE=0 duration=0.0000 >UNL=0 >[2020-06-03 02:49:01.992465] I [master(worker >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: Data/Metadata >Time Taken SETA=0 SETX=0 >meta_duration=0.0000data_duration=13.0954 > DATA=8 XATT=0 >[2020-06-03 02:49:01.992863] I [master(worker >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch >Completed >changelog_end=1591152508 entry_stime=(1591152352, 0) >changelog_start=1591152494 stime=(1591152507, 0) >duration=13.1077 > num_changelogs=2 mode=live_changelog >[2020-06-03 02:49:02.958687] D [repce(worker >/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call >19017:139678812452608:1591152542.96 keep_alive({'version': (1, 0), >'retval': 0, 'uuid': '8ee85fae-f3aa-4285-ad48-67a1dc17ed73', 'timeout': >1591152662, 'volume_mark': (1583043396, 161632)},) ... >[2020-06-03 02:49:02.979139] D [repce(worker >/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call >19017:139678812452608:1591152542.96 keep_alive -> 28 >[2020-06-03 02:49:06.998127] D [master(worker >/nodirectwritedata/gluster/gvol0):551:crawlwrap] _GMaster: ... crawl >#114 >done, took 30.180089 seconds >[2020-06-03 02:49:07.10132] D [repce(worker >/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call >19017:139679441716992:1591152547.01 scan() ... >[2020-06-03 02:49:07.10781] D [repce(worker >/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call >19017:139679441716992:1591152547.01 scan -> None >[2020-06-03 02:49:07.10935] D [repce(worker >/nodirectwritedata/gluster/gvol0):196:push] RepceClient: call >19017:139679441716992:1591152547.01 getchanges() ... >[2020-06-03 02:49:07.11579] D [repce(worker >/nodirectwritedata/gluster/gvol0):216:__call__] RepceClient: call >19017:139679441716992:1591152547.01 getchanges -> >['/var/lib/misc/gluster/gsyncd/gvol0_nvfs10_gvol0/nodirectwritedata-gluster-gvol0/.processing/CHANGELOG.1591152522'] >[2020-06-03 02:49:07.11720] I [master(worker >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's time >stime=(1591152507, 0) > >Am I looking in the right place to find out what that gsyncd.py process >is >doing? > > >On Tue, 2 Jun 2020 at 21:58, Sunny Kumar <sunkumar@xxxxxxxxxx> wrote: > >> Hi David, >> >> You haven't answered my previous question regarding the type of your >> workload. >> --- >> You can use the below command to enable debug log. >> >> `gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config >log-level >> DEBUG` >> >> and after capturing log again switch back to info mode: >> >> `gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config >log-level >> INFO` >> >> Please share the debug log and geo-rep config to debug further: >> for config: >> >> `gluster vol geo-rep <mastervol> <slavehost>::<slavevol> config` >> >> /sunny >> >> >> On Tue, Jun 2, 2020 at 10:18 AM Strahil Nikolov ><hunter86_bg@xxxxxxxxx> >> wrote: >> > >> > Hi David, >> > >> > in which log do you see the entries ? >> > >> > I think I got an explanation why you see the process only on one >of the >> master nodes - geo-rep session is established from only 1 master >node /I >> hope someone corrects me if I'm wrong/ to one slave node. Thus it >will be >> natural to see the high CPU usage on only 1 master node in your >situation. >> > >> > Do you see anything else in the : >> var/log/glusterfs/geo-replication/<logs> (master nodes) or in >> /var/log/glusterfs/geo-replication-slaves (slaves) that could hint of >the >> exact issue. I have a vague feeling that that python script is >constantly >> looping over some data causing the CPU hog. >> > >> > Sadly, I can't find an instruction for increasing the log level of >the >> geo rep log . >> > >> > >> > Best Regards, >> > Strahil Nikolov >> > >> > >> > На 2 юни 2020 г. 6:14:46 GMT+03:00, David Cunningham < >> dcunningham@xxxxxxxxxxxxx> написа: >> > >Hi Strahil and Sunny, >> > > >> > >Thank you for the replies. I checked the gfid on the master and >slaves >> > >and >> > >they are the same. After moving the file away and back again it >doesn't >> > >seem to be having the issue with that file any more. >> > > >> > >We are still getting higher CPU usage on one of the master nodes >than >> > >the >> > >others. It logs this every few seconds: >> > > >> > >[2020-06-02 03:10:15.637815] I [master(worker >> > >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry >Time >> > >Taken >> > > MKD=0 MKN=0 LIN=0 SYM=0 REN=0 RMD=0CRE=0 >duration=0.0000 >> > >UNL=0 >> > >[2020-06-02 03:10:15.638010] I [master(worker >> > >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: >Data/Metadata >> > >Time Taken SETA=0 SETX=0 >> > >meta_duration=0.0000data_duration=12.7878 >> > > DATA=4 XATT=0 >> > >[2020-06-02 03:10:15.638286] I [master(worker >> > >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch >> > >Completed >> > >changelog_end=1591067378 entry_stime=(1591067167, 0) >> > >changelog_start=1591067364 stime=(1591067377, 0) >> > >duration=12.8068 >> > > num_changelogs=2 mode=live_changelog >> > >[2020-06-02 03:10:20.658601] I [master(worker >> > >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's >time >> > > stime=(1591067377, 0) >> > >[2020-06-02 03:10:34.21799] I [master(worker >> > >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time >Taken >> > > duration=0.3826 num_files=8 job=1 return_code=0 >> > >[2020-06-02 03:10:46.440535] I [master(worker >> > >/nodirectwritedata/gluster/gvol0):1384:process] _GMaster: Entry >Time >> > >Taken >> > > MKD=0 MKN=0 LIN=0 SYM=0 REN=1 RMD=0CRE=2 >duration=0.1314 >> > >UNL=1 >> > >[2020-06-02 03:10:46.440809] I [master(worker >> > >/nodirectwritedata/gluster/gvol0):1394:process] _GMaster: >Data/Metadata >> > >Time Taken SETA=0 SETX=0 >> > >meta_duration=0.0000data_duration=13.0171 >> > > DATA=14 XATT=0 >> > >[2020-06-02 03:10:46.441205] I [master(worker >> > >/nodirectwritedata/gluster/gvol0):1404:process] _GMaster: Batch >> > >Completed >> > >changelog_end=1591067420 entry_stime=(1591067419, 0) >> > >changelog_start=1591067392 stime=(1591067419, 0) >> > >duration=13.0322 >> > > num_changelogs=3 mode=live_changelog >> > >[2020-06-02 03:10:51.460925] I [master(worker >> > >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's >time >> > > stime=(1591067419, 0) >> > > >> > >[2020-06-02 03:11:04.448913] I [master(worker >> > >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync Time >Taken >> > >duration=0.3466 num_files=3 job=1 return_code=0 >> > > >> > >Whereas the other master nodes only log this: >> > > >> > >[2020-06-02 03:11:33.886938] I [gsyncd(config-get):308:main] ><top>: >> > >Using >> > >session config file >> > >>path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf >> > >[2020-06-02 03:11:33.993175] I [gsyncd(status):308:main] <top>: >Using >> > >session config file >> > >>path=/var/lib/glusterd/geo-replication/gvol0_nvfs10_gvol0/gsyncd.conf >> > > >> > >Can anyone help with what might cause the high CPU usage on one >master >> > >node? The process is this one, and is using 70-100% of CPU: >> > > >> > >python2 >/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py >> > >worker gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path >> > >/nodirectwritedata/gluster/gvol0 --local-node cafs30 >--local-node-id >> > >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id >> > >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13 >--subvol-num 1 >> > >--resource-remote nvfs30 --resource-remote-id >> > >1e698ccd-aeec-4ec4-96fe-383da8fc3b78 >> > > >> > >Thank you in advance! >> > > >> > > >> > > >> > > >> > >On Sat, 30 May 2020 at 20:20, Strahil Nikolov ><hunter86_bg@xxxxxxxxx> >> > >wrote: >> > > >> > >> Hey David, >> > >> >> > >> for me a gfid mismatch means that the file was >replaced/recreated >> > > - >> > >> just like vim in linux does (and it is expected for config >file). >> > >> >> > >> Have you checked the gfid of the file on both source and >> > >destination, >> > >> do they really match or they are different ? >> > >> >> > >> What happens when you move away the file from the slave , >does it >> > >fixes >> > >> the issue ? >> > >> >> > >> Best Regards, >> > >> Strahil Nikolov >> > >> >> > >> На 30 май 2020 г. 1:10:56 GMT+03:00, David Cunningham < >> > >> dcunningham@xxxxxxxxxxxxx> написа: >> > >> >Hello, >> > >> > >> > >> >We're having an issue with a geo-replication process with >unusually >> > >> >high >> > >> >CPU use and giving "Entry not present on master. Fixing gfid >> > >mismatch >> > >> >in >> > >> >slave" errors. Can anyone help on this? >> > >> > >> > >> >We have 3 GlusterFS replica nodes (we'll call the master), >which >> > >also >> > >> >push >> > >> >data to a remote server (slave) using geo-replication. This has >been >> > >> >running fine for a couple of months, but yesterday one of the >master >> > >> >nodes >> > >> >started having unusually high CPU use. It's this process: >> > >> > >> > >> >root@cafs30:/var/log/glusterfs# ps aux | grep 32048 >> > >> >root 32048 68.7 0.6 1843140 845756 ? Rl 02:51 >493:51 >> > >> >python2 >> > >> >/usr/lib/x86_64-linux-gnu/glusterfs/python/syncdaemon/gsyncd.py >> > >worker >> > >> >gvol0 nvfs10::gvol0 --feedback-fd 15 --local-path >> > >> >/nodirectwritedata/gluster/gvol0 --local-node cafs30 >--local-node-id >> > >> >b7521445-ee93-4fed-8ced-6a609fa8c7d4 --slave-id >> > >> >cdcdb210-839c-4306-a4dc-e696b165ed17 --rpc-fd 12,11,9,13 >> > >--subvol-num 1 >> > >> >--resource-remote nvfs30 --resource-remote-id >> > >> >1e698ccd-aeec-4ec4-96fe-383da8fc3b78 >> > >> > >> > >> >Here's what is being logged in >> > >> >>/var/log/glusterfs/geo-replication/gvol0_nvfs10_gvol0/gsyncd.log: >> > >> > >> > >> >[2020-05-29 21:57:18.843524] I [master(worker >> > >> >/nodirectwritedata/gluster/gvol0):1470:crawl] _GMaster: slave's >time >> > >> > stime=(1590789408, 0) >> > >> >[2020-05-29 21:57:30.626172] I [master(worker >> > >> >>/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] >> > >> >_GMaster: Entry not present on master. Fixing gfid mismatch in >> > >slave. >> > >> >Deleting the entry retry_count=1 entry=({u'uid': 108, >u'gfid': >> > >> >u'7c0b75e5-d8b7-454f-8010-112d613c599e', u'gid': 117, u'mode': >> > >33204, >> > >> >u'entry': >> > >> >u'.gfid/c5422396-1578-4b50-a29d-315be2a9c5d8/00a859f7xxxx.cfg', >> > >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, >u'gfid_mismatch': >> > >True, >> > >> >u'slave_name': None, u'slave_gfid': >> > >> >u'ec4b0ace-2ec4-4ea5-adbc-9f519b81917c', u'name_mismatch': >False, >> > >> >u'dst': >> > >> >False}) >> > >> >[2020-05-29 21:57:30.627893] I [master(worker >> > >> >>/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] >> > >> >_GMaster: Entry not present on master. Fixing gfid mismatch in >> > >slave. >> > >> >Deleting the entry retry_count=1 entry=({u'uid': 108, >u'gfid': >> > >> >u'a4d52e40-2e2f-4885-be5f-65fe95a8ebd7', u'gid': 117, u'mode': >> > >33204, >> > >> >u'entry': >> > >> >> > >>>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/polycom_00a859f7xxxx.cfg', >> > >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, >u'gfid_mismatch': >> > >True, >> > >> >u'slave_name': None, u'slave_gfid': >> > >> >u'ece8da77-b5ea-45a7-9af7-7d4d8f55f74a', u'name_mismatch': >False, >> > >> >u'dst': >> > >> >False}) >> > >> >[2020-05-29 21:57:30.629532] I [master(worker >> > >> >>/nodirectwritedata/gluster/gvol0):813:fix_possible_entry_failures] >> > >> >_GMaster: Entry not present on master. Fixing gfid mismatch in >> > >slave. >> > >> >Deleting the entry retry_count=1 entry=({u'uid': 108, >u'gfid': >> > >> >u'3c525ad8-aeb2-46b6-9c41-7fb4987916f8', u'gid': 117, u'mode': >> > >33204, >> > >> >u'entry': >> > >> >> > >> >>>u'.gfid/f857c42e-22f1-4ce4-8f2e-13bdadedde45/00a859f7xxxx-directory.xml', >> > >> >u'op': u'CREATE'}, 17, {u'slave_isdir': False, >u'gfid_mismatch': >> > >True, >> > >> >u'slave_name': None, u'slave_gfid': >> > >> >u'06717b5a-d842-495d-bd25-aab9cd454490', u'name_mismatch': >False, >> > >> >u'dst': >> > >> >False}) >> > >> >[2020-05-29 21:57:30.659123] I [master(worker >> > >> >/nodirectwritedata/gluster/gvol0):942:handle_entry_failures] >> > >_GMaster: >> > >> >Sucessfully fixed entry ops with gfid mismatch >retry_count=1 >> > >> >[2020-05-29 21:57:30.659343] I [master(worker >> > >> >/nodirectwritedata/gluster/gvol0):1194:process_change] >_GMaster: >> > >Retry >> > >> >original entries. count = 1 >> > >> >[2020-05-29 21:57:30.725810] I [master(worker >> > >> >/nodirectwritedata/gluster/gvol0):1197:process_change] >_GMaster: >> > >> >Sucessfully fixed all entry ops with gfid mismatch >> > >> >[2020-05-29 21:57:31.747319] I [master(worker >> > >> >/nodirectwritedata/gluster/gvol0):1954:syncjob] Syncer: Sync >Time >> > >Taken >> > >> >duration=0.7409 num_files=18 job=1 return_code=0 >> > >> > >> > >> >We've verified that the files like polycom_00a859f7xxxx.cfg >referred >> > >to >> > >> >in >> > >> >the error do exist on the master nodes and slave. >> > >> > >> > >> >We found this bug fix: >> > >> >https://bugzilla.redhat.com/show_bug.cgi?id=1642865 >> > >> > >> > >> >However that fix went in 5.1, and we're running 5.12 on the >master >> > >> >nodes >> > >> >and slave. A couple of GlusterFS clients connected to the >master >> > >nodes >> > >> >are >> > >> >running 5.13. >> > >> > >> > >> >Would anyone have any suggestions? Thank you in advance. >> > >> >> > ________ >> > >> > >> > >> > Community Meeting Calendar: >> > >> > Schedule - >> > Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC >> > Bridge: https://bluejeans.com/441850968 >> > >> > Gluster-users mailing list >> > Gluster-users@xxxxxxxxxxx >> > https://lists.gluster.org/mailman/listinfo/gluster-users >> >> ________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users