Re: Performance improvements to Gluster Geo-replication

Aravinda <avishwan@xxxxxxxxxx> · Tue, 1 Sep 2015 14:25:15 +0530

Thanks Shyam for your inputs.

regards
Aravinda

On 08/31/2015 07:17 PM, Shyam wrote:
On 08/31/2015 03:17 AM, Aravinda wrote:
Following Changes/ideas identified to improve the Geo-replication
Performance. Please add your ideas/issues to the list

1. Entry stime and Data/Meta stime
----------------------------------
Now we use only one xattr to maintain the state of sync, called
stime. When a Geo-replication worker restarts, it starts from that
stime and sync files.

     get_changes from <STIME> to <CURRENT TIME>
         perform <ENTRY> operations
         perform <META> operations
         perform <DATA> operations

If data operation is failed worker crashes and restarts and reprocess
the changelogs again. Entry, Meta and Data operations will be
retried. If we maintain entry_stime seperately then we can avoid
reprocessing of entry operations which are completed previously.

This seems like a good thing to do.

Here is something more that could be done (I am not well aware of 
geo-rep internals so maybe this cannot be done),
- Why not maintain a 'mark', till which even ENTRY/META operations are 
performed, so that even when failures occur in ENTRY/META operation 
queue, we need to restart from the mark and not all the way from the 
beginning STIME.
Changelogs has to be processed from STIME because it will have both 
ENTRY and META, but execution of ENTRY will be ignored if entry_stime is 
ahead of STIME.

I am not sure where such a 'mark' can be maintained, unless the 
processed get_changes are ordered and written to disk, or ordered 
idempotently in memory each time.
STIME is maintained as xattr in Master brick root, we can maintain one 
more xattr entry_stime.

2. In case of Rsync/Tar failure, do not repeat Entry Operations
---------------------------------------------------------------
In case of Rsync/Tar failures, Changelogs are reprocessed
again. Instead re trigger only Rsync/Tar job for those list of files
which are failed.

(this is more for my understanding)
I assume that this retry is within the same STIME -> NOW1 period. IOW, 
if the re-trigger of the tar/rsync is going to occur in the next sync 
interval, then I would assume that ENTRY/META for NOW1 -> NOW would be 
repeated, correct? The same is true for the above as well, i.e all 
ENTRY/META operations that are completed between STIME and NOW1 is not 
repeated, but events between NOW1 to NOW is, correct?
Syncing files is two step operation. Entry creation with same GFID using 
RPC and Sync Data using Rsync. There is a issue with existing code, 
Entry operations also gets repeated when only data(rsync) failed. (STIME 
-> NOW1)

3. Better Rsync Queue
---------------------
Now Geo-rep has a Rsync/Tar queue called PostBox. Sync
jobs(configurable, default is 3) will empty the Post Box and feeds it
to Rsync/Tar process. Second sync job may not find any items to sync,
only first job may overloaded. To avoid this, introduce a batch size
to PostBox so that each sync jobs gets equal number of files to sync.

Do you want to consider round-robin of entries to the sync jobs, 
something that we did in rebalance, instead of a batch size?

A batch size can again be consumed by a single sync process, and the 
next batch by the next one so on. Maybe a round-robin distribution of 
files to sync from the post-box to each sync thread may help.
Looks like good idea. We need to maintain N number of queues for N sync 
jobs, while adding entry to post box distribute to N queues. Is that right?

4. Handling the Tracebacks
--------------------------
Collect the list of Tracebacks which are not yet handled, and look for
posibility of handling it in run time. With this, workers crash will
be minimized so that we can avoid initializing and changelogs
reprocess efforts.

5. SSH failure handling
-----------------------
If Slave node goes down, the Master worker connected to it will go to
Faulty and restarts. If we can handle SSH failures intelligently, we
can reestablish the SSH connection instead of restarting Geo-rep
worker. With this change, Active/Passive switch for Network failures
can be avoided.

6. On Worker restart, Utilizing Changelogs which are in .processing
directory
--------------------------------------------------------------------
On Worker restart, Start time for Geo-rep is previously updated
stime. Geo-rep re-parses the Changelogs from Brick backend to Working
directory even though those changelogs parsed previously but stime is
not updated due to failures in sync.

     1. On Geo-rep restart, Delete all files in .processing/cache and
     move all the changelogs available in .processing directory to
     .processing/cache
     2. In Changelog API, look for Changelog file name in cache before
     parsing it.
     3. If available in cache, move it to .processing
     4. else parse it and generate parsed changelog in .processing

I did not understand the above, but that's probably just me as I am 
not fully aware of change log process yet :)
To consume the backend Changelogs, Geo-rep registers to Changelog API by 
specifying a working directory. Changelog API will parse the backend 
changelog to specific format understood by Geo-rep and copies to Working 
directory. Geo-rep will consume Changelogs from working directory. In 
each iteration "BACKEND CHANGELOGS -> PARSE TO WORKING DIR -> CONSUME"

During the parse process, Changelog API maintains three directory in 
working directory ".processing", ".processed" and ".current".
.current -> Changelogs before parse
.processing -> Changelogs parsed but not yet consumed by Geo-rep
.processed -> Changelogs consumed and Synced by Geo-rep

If Geo-rep worker restarts, we cleanup .processing directory to prevent 
picking up unexpected changelogs by Geo-rep. So "BACKEND CHANGELOGS -> 
PARSE TO WORKING DIR" is repeated even though parsed data is available 
from previous run.

While replying to this, got another idea to simplify the Changelogs 
processing.

- Do not parse/maintain Changelogs in Working directory, instead just 
maintain the list of Changelog files.
- Expose new Changelog API to parse the changelog. 
libgfchangelog.parse(FILE_PATH, CALLBACK)
- Modify Geo-rep to use this new API when it needs to parse a CHANGELOG 
file.

With this approach, on worker restart only the list of changelog files 
is lost which can be easily regenerated compared to re-parsing Changelogs.

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel