Following Changes/ideas identified to improve the Geo-replication Performance. Please add your ideas/issues to the list 1. Entry stime and Data/Meta stime ---------------------------------- Now we use only one xattr to maintain the state of sync, called stime. When a Geo-replication worker restarts, it starts from that stime and sync files. get_changes from <STIME> to <CURRENT TIME> perform <ENTRY> operations perform <META> operations perform <DATA> operations If data operation is failed worker crashes and restarts and reprocess the changelogs again. Entry, Meta and Data operations will be retried. If we maintain entry_stime seperately then we can avoid reprocessing of entry operations which are completed previously. 2. In case of Rsync/Tar failure, do not repeat Entry Operations --------------------------------------------------------------- In case of Rsync/Tar failures, Changelogs are reprocessed again. Instead re trigger only Rsync/Tar job for those list of files which are failed. 3. Better Rsync Queue --------------------- Now Geo-rep has a Rsync/Tar queue called PostBox. Sync jobs(configurable, default is 3) will empty the Post Box and feeds it to Rsync/Tar process. Second sync job may not find any items to sync, only first job may overloaded. To avoid this, introduce a batch size to PostBox so that each sync jobs gets equal number of files to sync. 4. Handling the Tracebacks -------------------------- Collect the list of Tracebacks which are not yet handled, and look for posibility of handling it in run time. With this, workers crash will be minimized so that we can avoid initializing and changelogs reprocess efforts. 5. SSH failure handling ----------------------- If Slave node goes down, the Master worker connected to it will go to Faulty and restarts. If we can handle SSH failures intelligently, we can reestablish the SSH connection instead of restarting Geo-rep worker. With this change, Active/Passive switch for Network failures can be avoided. 6. On Worker restart, Utilizing Changelogs which are in .processing directory -------------------------------------------------------------------- On Worker restart, Start time for Geo-rep is previously updated stime. Geo-rep re-parses the Changelogs from Brick backend to Working directory even though those changelogs parsed previously but stime is not updated due to failures in sync. 1. On Geo-rep restart, Delete all files in .processing/cache and move all the changelogs available in .processing directory to .processing/cache 2. In Changelog API, look for Changelog file name in cache before parsing it. 3. If available in cache, move it to .processing 4. else parse it and generate parsed changelog in .processing -- regards Aravinda _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel