Hi all,
tl;dr: to solve a bug [1] I've written a patch [2] that needs some hacks to prevent deadlocks. I think the approach is quite good and maybe we should check if the hacks can be eliminated by making use of the new behavior in DHT and other xlators.
Now a more detailed explanation:
As part of debugging an issue [1] for EC where directories got in a bad state when there were concurrent renames and mkdirs, I found that the problem really came from the fact that DHT does additional operations after creating a directory, and it were these operations what was conflicting with renames. This is not really a problem in DHT, but it served to detect the issue.
In fact, any concurrent modification of an entry while it's being removed can cause this bug. Normally, if you modify a file that is being deleted, the modification could fail (depending on the state it can fail with multiple errors, even EIO), but the file will be deleted anyway, so no big problem here. The problem is more important in the scenario described in the bug (though unlikely to be a real use case).
The root cause of the problem is that entry modification operations take locks on the entry itself, while entry creation/deletion take locks on the parent directory. This allows that both operations arrive at brick level in parallel, and we can have some bricks executing them in one order and other bricks executing them in the reverse order.
In the case of EC, having an entry disappearing in the middle of another operation can be very bad. It can disappear in multiple stages:
1. While acquiring locks. It's not much bad, but can return unwanted errors (EIO)
2. While requesting size and version information. This is very bad and will surely lead to EIO errors (instead of ESTALE that would be the most useful error) or mark healthy bricks as bad.
3. While the main operation is executed. In this case we can even have data corruption (though the file will be deleted, but it's more important for directories).
4. While updating backend info. Many combinations are possible here depending on what failed and on how many bricks.
To prevent this I've taken the following approach [2]:
1. Take into account ESTALE errors when acquiring locks, so that EC is able to report ESTALE instead of EIO on most cases.
2. Once a lock is acquired, the locks xlator will prevent the entry from being deleted until locks are released again, giving a lot of stability to client side.
I've tested this approach and it works pretty well. On heavy contention of modifications and removals, the client side sees completely stable and consistent updates, which is very good.
The problem I've seen is that it's quite easy to get into deadlocks with this approach. The main reason for the deadlocks is how DHT tries to solve the same or similar problems from DHT level (I've found this document [3] that explains them).
To prevent deadlocks I've had to completely ignore locks coming from special PIDs (< 0), but I've also had to ignore locks from domain "dht.file.migrate". Otherwise I have no way to prevent deadlocks on concurrent renames to the same destination from multiple clients.
I don't like to use these hacks in the code because they tend to be hard to maintain and prone to errors. Do you think it makes sense to see if some of the DHT synchronization methods can be simplified by making use of this feature ? with some changes, we would probably be able to remove the hacks.
Other ideas on how to implement this without delaying removes (thus preventing deadlocks) are also welcome.
Xavi
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-devel