Update of work on fixing POSIX compliance issues in Glusterfs

Raghavendra Gowdappa <rgowdapp@xxxxxxxxxx> · Tue, 2 Oct 2018 07:40:33 +0530

All,

There have been issues related to POSIX compliance especially while running Database workloads on Glusterfs. Recently we've worked on fixing some of them. This mail is an update on that effort.

The issues themselves can be classfied into following categories:
rename atomicity. When rename (src, dst) is done with dst already present, at no point in time access to dst (like open, stat, chmod etc) should fail. However, since the rename itself changes the association of dst-path from dst-inode to src-inode, inode based operations like open, stat etc that have already completed resolution of dst-path  into dst-inode will end up not finding the dst-inode after rename causing them to fail. However VFS provides a workaround for this by doing the resolution of path once again provided operations fail with ESTALE. There were some issues associated with this:
Glusterfs in some codepaths returned ENOENT even when the operation is on an inode and hence VFS didn't retry the resolution. Much of the discussion around this topic can be found at this mail thread. This issue has been fixed by various patches
VFS retries exactly once. So, when retry fails with ESTALE, VFS gives up and syscalls like open are failed. We've hit this class of issues in bugs like these. The current understanding is real world workloads won't hit this race and hence one retry mechanism is enough. NFS relies on the same mechanism of VFS and NFS developers say they've not hit bugs of this kind in real workloads.
DHT in rename codepaths acquires locks on src and dst inodes. If a parallel rename overwrote dst-inode, this locking fails and rename operation used to fail. The issue is tracked and fixed as part of this bug.
Quorum imposition by afr in open fop. afr imposes Quorum on fd based operations, but not on open. This means operations can fail on a valid fd due to lack of Quorum. Not fixed yet and is tracked on this bug.
Operations on a valid fd failing after the file was deleted by rename/unlink.
Fuse-bridge used to randomly pick fds in fstat codepath as earlier versions of fuse api didn't provide filehandle as argument of Getattr request. This resulted in fstat failures when the file was deleted either through rename/unlink after it has been successfully opened. This is fixed in this patch and this patch.
performance/open-behind fakes an open call. Due to bugs in rename/unlink codepath, it couldn't open file before the file was deleted due to rename or unlink. Fixed by this patch
Stale (meta)data cached by various performance xlators
md-cache used to cache stale fstat. Fixed by this patch.
write-behind did not provide correct stat in rename cbk when writes on src were cached in write-behind. Fixed by this patch.
write-behind did not provide correct stat in readdirp response. Fixed by this patch
Ordering of operations done on different fds by write-behind. It considered operations on different fds as independent. So an fstat done after a write is complete when both operations are on different fds, didn't fetch stat that reflected the write operation. This is fixed by this patch
readdir-ahead used to provide stale stat. The issue is fixed by this patch
Most of the caching xlators rely on ctime/mtime of stat to find out whether the current (meta)data is newer/stale than the cached (meta)data. However ctime/mtime provided by replica/afr is not always consistent as it can pick stat from any of its subvolumes. This issue can be solved once ctime generatior becomes production ready and is enabled by default. Note that ctime generator xlator can also help in fixing issues with tar, ElasticSearch etc that rely on correctness of ctime. Also, I still see a rare pgbench failure even after all the fixes to bz 1512691 due to unreliable ctime/mtime from underlying xlators.
Though this issue is not really a consistency issue, it hindered performance of read-ahead as fstats flushed read-ahead cache. Note that fstats also have an impact on write-behind when reads and writes are interleaved on a file as fstats wait on cached-writes in write-behind. A bug has been filed on fuse kernel module for implementation of noatime feature so that fstats are not issued during reads.
AMQP needed flock -w to work. Tracked as part of this issue.
The issues listed above are either fixed or work is in progress to fix them. There are still more issues which are not worked upon yet and we'll provide updates on them in future. Some of the prominent known issues (the list is not exhaustive) are:
Missing dentries when performance.parallel-readdir is enabled. Note that its a cache coherence issue, the dentries and files are still intact on backend.
Evaluate and initiate discussion on how to propagate errors encountered during commit of cached writes, to application. A wider discussion (across different filesystems) on this topic is found at: https://lwn.net/Articles/752063/. Thanks to @csabahenk for pointing this discussion.
Sanitize the stack to return ESTALE for inode missing and ENOENT for path missing. For eg., storage/posix sometimes return ENOENT for scenarios where gfid handles are missing, even though the correct error is ESTALE. Failing to return ESTALE can throw off the retry logic in VFS. An open failing with ENOENT is wrong as open is a gfid based operation. An easy fix would be to fuse-bridge convert all ENOENT errors to ESTALE in _all_ inode based fop responses. Currently its done only in open(dir) codepath. This has to be extended to other codepaths too.
Lookup and rename in DHT are not atomic. rename is a compound operation in DHT which involves some hardlinking and in the rename window both src and dst are visible as hardlinks to each other. If lookup samples src or dst in this window, it'll perceive the file to have hardlinks.
stale dentries of src in inode-tables (of fuse, protocol/server) after successful rename of src and dst. This can be caused due to a lookup on src racing with rename. This issue is not very much different from the issue of caching xlators needing a way of identifying which among the two (meta)data is latest. ctime generator xlator can be used here to compare ctime of parent directory as recorded in itable with that of in lookup response and making sure only latest dentry is linked into inode table.
Note that stale dentries can cause corruption in applications like SAS, pgbench that rely on the pattern of create a tmp file, write to it and rename it to the file to be consumed by another thread. Since src resolves to dst inode due to stale dentries having same stat of dst, the dst file ends up corrupted as writes of next cycle end up on the file being consumed for previous cycle. So, this is an important issue to be fixed.
There are few bugs on SAS 
issues with fcntl locking. 
From my limited conversation with people who use/work on SAS, it seem to rely on fsync as a checkpoint after which the changes by one job should be visible to other jobs which could be running on different mounts on a different machine. This means, fsync on one mount should update caches of other mounts too with updated data. This functionality is currently missing in Glusterfs.
regards,
Raghavendra
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel