Some thoughts around RENAME issues in Geo-replication.(New solution)
All the RENAME entries from Changelog, before sending it to Slave via
RPC, validate in Master Volume.
- Get GFID on disk of RENAME target(in Master),
- if it matches with GFID recorded in Changelog, that means that
is the latest Rename of that file. Send the entry to Slave. File will be
created if not exists in Slave Volume.
- If IO Error while getting disk GFID(based on
ParentGFID/basename), that means a file is renamed after this rename is
recorded or Unlinked.
Stat for GFID using aux-gfid-mount(Master),
- If ENOENT, then the file is unlinked before rename is
propogated to Slave Volume, Send Unlink for the old filename.
- If no error, that means file exists but this is not the
latest RENAME, Don't send this RENAME to replay in Slave Volume.
Other Changes:
--------------
CREATE should fail if a file exists with Same GFID but different name.
(One more getfattr for every CREATE/MKNOD)
Challenges(not yet addressed):
------------------------------
- If a file is renamed more than once and deleted immediately before
rename is propogated to Slave volume. Unlink will be sent for the
intermediate name and final name. So unlink will fail in Slave Volume,
since does not match the basename in Slave.
- Renamed hardlinks deleted.
- Self heal traffic is recorded as MKNOD even for Links/RENAMEs
We will discuss all the Rename cases again to check if this solution
solves those issues.
Renamed file falls into other brick
-----------------------------------
Two bricks(distribute)
CREATE f1
RENAME f1 f2 -> f2 falls in other brick
Now race between b1 and b2
In b1 CREATE f1
In b2 RENAME f1 f2
with new solution,
If b1 is ahead of b2 then no issues, CREATE f1 followed by RENAME f1 to f2.
If b2 is ahead of b1 then it validates the latest name exists in master
or not, f2 exists in Master so Creates f2 in Slave if not exists.(If f1
exists it tries RENAME). CREATE from b1 will fail since same GFID file
exists.
Multiple Renames
----------------
CREATE f1
RENAME f1 f2
RENAME f2 f1
f1 falls in brick1 and f2 falls in brick2, changelogs are
Brick1
CREATE f1
RENAME f2 f1
Brick2
RENAME f1 f2
Issue: If Brick 1 changelogs executed first and then Brick 2, Slave will
have f2.
With new solution:
If b1 is ahead of b2, Validate in master and final name is f1. Rename from b2 is not considered.
Active Passive switch in georeplication
---------------------------------------
Setup: Distribute Replica
In any one of the replica,
RENAME recorded in Passive brick, when Active brick was down. When
Active brick comes back it becomes active immediately.
Passive Brick
RENAME
Active Brick
MKNOD (From self heal traffic)
Two issues:
1. If MKNOD is for sticky bit file, MKNOD will create sticky bit file in
slave(renamed file), old named file will be their. Two files with same
GFID, one old file and other one sticky bit file(target name).
2. If MKNOD is actual file, MKNOD will create new file in slave. Slave
will have old file as well as new file with same GFID.
Hopefully this issue is minimized due to enhanced Active/Passive switch.
With new solution MKNOD will not succeed since file exists with same GFID.
RENAME repeat - If two replica bricks are active
------------------------------------------------
From one brick it processes,
CREATE f1
RENAME f1 f2
From other brick it processes same changelogs again,
CREATE f1
RENAME f1 f2
Issue: Slave will have both f1 and f2 with same GFID.
Possible fix: modify MKNOD/CREATE to check disk gfid first and then
create the file. EEXIST when a file exists with same gfid but different
name.
With new solution, Slave Volume will have only f2 as expected.
--
regards
Aravinda
On 11/12/2014 11:34 AM, Aravinda wrote:
Updated the approaches to fix RENAME problems in geo-replication.
Please let me know if you have any suggestions.
--
regards
Aravinda
On 09/19/2014 02:09 PM, Aravinda wrote:
Hi All,
Summarized the RENAME issues we have in geo-replication, feel free to
add if I missed any :)
GlusterFS changelogs are stored in each brick, which records the
changes happened in the brick. Georep will run in all the nodes of
master and processes changelogs independently. Processing changelogs
is in brick level, but all the fops will be replayed on mount.
In changelog internal fops are not recorded. For RENAME case only
RENAME is recorded in hashed brick changelog(DHT's Internal fops like
creating linkto file, unlink is not recorded)
We need to start working on fixing these issues to stabilize the
Geo-replication. Comments and Suggestions welcome.
Renamed file falls into other brick
-----------------------------------
Two bricks(distribute)
CREATE f1
RENAME f1 f2 -> f2 falls in other brick
Now race between b1 and b2
In b1 CREATE f1
In b2 RENAME f1 f2
Issue: Actually not an issue. Georep sends stat with RENAME entry
ops, if source itself is not their in slave then Georep will create
the target file using the stat.
We have problem only when RENAME falls in other brick and file is
unlinked in master.
Possible fix: ?
Fail(EEXIST) CREATE if any file exists with same GFID. If source and
target file not exist then create the target file(Use default stat if
stat is not available when file unlinked in master)
Multiple Renames
----------------
CREATE f1
RENAME f1 f2
RENAME f2 f1
f1 falls in brick1 and f2 falls in brick2, changelogs are
Brick1
CREATE f1
RENAME f2 f1
Brick2
RENAME f1 f2
Issue: If Brick 1 changelogs executed first and then Brick 2, Slave
will have f2.
Possible fix: ?
Same with last approach, along with the stat, send current_name in
master volume for that GFID(may be using pathinfo xattr?), RENAME only
if target file matches with current_name sent by master.
Active Passive switch in georeplication
---------------------------------------
Setup: Distribute Replica
In any one of the replica,
RENAME recorded in Passive brick, when Active brick was down. When
Active brick comes back it becomes active immediately.
Passive Brick
RENAME
Active Brick
MKNOD (From self heal traffic)
Two issues:
1. If MKNOD is for sticky bit file, MKNOD will create sticky bit file
in slave(renamed file), old named file will be their. Two files with
same GFID, one old file and other one sticky bit file(target name).
2. If MKNOD is actual file, MKNOD will create new file in slave.
Slave will have old file as well as new file with same GFID.
Possible Fix: If a node failed previously, do not become active,
continue with current Passive.(Don't know yet how to do this, as of
now depending on node-uuid we are deciding to become Active/Passive)
Kotresh is working on the new logic to choose Active node from replica
pairs. With the new logic node will not participate in sync when comes
back immediately.
RENAME repeat - If two replica bricks are active
------------------------------------------------
From one brick it processes,
CREATE f1
RENAME f1 f2
From other brick it processes same changelogs again,
CREATE f1
RENAME f1 f2
Issue: Slave will have both f1 and f2 with same GFID.
Possible fix: modify MKNOD/CREATE to check disk gfid first and then
create the file. EEXIST when a file exists with same gfid but
different name.
Fail CREATE if a file exists with same GFID.
--
regards
Aravinda
http://aravindavk.in
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel