Hi All,
In glusterfs, there is an issue regarding the fallocate behavior. In short, if someone does fallocate from the mount point with some size that is greater than the available size in the backend filesystem where the file is present, then fallocate can fail with a subset of the required number of blocks allocated and then failing in the backend filesystem with ENOSPC error.
The behavior of fallocate in itself is simlar to how it would have been on a disk filesystem (atleast xfs where it was checked). i.e. allocates subset of the required number of blocks and then fail with ENOSPC. And the file in itself would show the number of blocks in stat to be whatever was allocated as part of fallocate. Please refer [1] where the issue is explained.
Now, there is one small difference between how the behavior is between glusterfs and xfs.
In xfs after fallocate fails, doing 'stat' on the file shows the number of blocks that have been allocated. Whereas in glusterfs, the number of blocks is shown as zero which makes tools like "du" show zero consumption. This difference in behavior in glusterfs is because of libglusterfs on how it handles sparse files etc for calculating number of blocks (mentioned in [1])
At this point I can think of 3 things on how to handle this.
1) Except for how many blocks are shown in the stat output for the file from the mount point (on which fallocate was done), the remaining behavior of attempting to allocate the requested size and failing when the filesystem becomes full is similar to that of XFS.
Hence, what is required is to come up with a solution on how libglusterfs calculate blocks for sparse files etc (without breaking any of the existing components and features). This makes the behavior similar to that of backend filesystem. This might require its own time to fix libglusterfs logic without impacting anything else.
OR
2) Once the fallocate fails in the backend filesystem, make posix xlator in the brick truncate the file to the previous size of the file before attempting fallocate. A patch [2] has been sent for this. But there is an issue with this when there are parallel writes and fallocate operations happening on the same file. It can lead to a data loss.
a) statpre is obtained ===> before fallocate is attempted, get the stat hence the size of the file
b) A parrallel Write fop on the same file that extends the file is successful
c) Fallocate fails
d) ftruncate truncates it to size given by statpre (i.e. the previous stat and the size obtained in step a)
OR
3) Make posix check for available disk size before doing fallocate. i.e. in fallocate once posix gets the number of bytes to be allocated for the file from a particular offset, it checks whether so many bytes are available or not in the disk. If not, fail the fallocate fop with ENOSPC (without attempting it on the backend filesystem).
There still is a probability of a parallel write happening while this fallocate is happening and by the time falllocate system call is attempted on the disk, the available space might have been less than what was calculated before fallocate.
i.e. following things can happen
a) statfs ===> get the available space of the backend filesystem
b) a parallel write succeeds and extends the file
c) fallocate is attempted assuming there is sufficient space in the backend
While the above situation can arise, I think we are still fine. Because fallocate is attempted from the offset received in the fop. So, irrespective of whether write extended the file or not, the fallocate itself will be attempted for so many bytes from the offset which we found to be available by getting statfs information.
Please provide feedback.
Regards,
Raghavendra
_______________________________________________ Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/836554017 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/486278655 Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-devel