On 18/10/2018 14.00, Avi Kivity wrote:
This can happen, and indeed I see our default hint is 1MB, so our small files use a 1MB hint. Looks like we should remove that 1MB hint since it's reducing allocation flexibility for XFS without a good return.
I convinced myself that this is the root cause, it fits perfectly with your explanation. I still think that XFS should allocate *something* rather than ENOSPC, but I can also understand someone wanting a guarantee.
On the other hand, I worry that because we bypass the page cache, XFS doesn't get to see the entire file at one time and so it will get fragmented.
That's what happens. I write 1000 4k writes to 400 files, in parallel, AIO+DIO. I got 400 perfectly-fragmented files, each had 1000 extents.
So I'll remove the default hint for small files, and replace it with larger buffer sizes so we batch more and don't get 8k-sized extents (which is our default buffer size).
Suppose I write a 4k file with a 1MB hint. How is that trailing (1MB-4k) marked? Free extent, free extent with extra annotation, or allocated extent? We may need to deallocate those extents? (will FALLOC_FL_PUNCH_HOLE do the trick?)
I found an 11-year-old post from you that says those reservations are freed on close:
https://linux-xfs.oss.sgi.narkive.com/Bpctu4DN/reducing-memory-requirements-for-high-extent-xfs-files#post6 This is consistent with xfs_db reporting those areas are free.