Problem: -------- Each geo-rep workers process Changelogs available in their bricks, if worker sees RMDIR, it tries to remove that directory recursively. Since rmdir is recorded in all the bricks, rm -rf is executed in parallel. Due to DHTs open issues of parallel rm -rf, Some of the directories will not get deleted in Slave Volume(Stale directory layout). If same named dir is created in Master, then Geo-rep will end up in inconsistent state since GFID is different for the new directory and directory exists in Slave. Solution - Fix in DHT: --------------------- Hold lock during rmdir, so that parallel rmdir will get blocked and no stale layouts. Solution - Fix in Geo-rep: -------------------------- Temporarily we can fix in Geo-rep till DHT fixes this issue. Since Meta Volume is available with each Cluster, Geo-rep can keep lock for GFID of dir to be deleted. For example, when rmdir: while True: try: # fcntl lock in Meta volume $METAVOL/.rmdirlocks/<GFID> get_lock(GFID) recursive_delete() release_and_del_lock_file() break except (EACCES, EAGAIN): continue One worker will succeed and all other workers will get ENOENT/ESTALE, which can be safely ignored. Let us know your thoughts. -- regards Aravinda _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel