RE: [RFC PATCH] ceph: fix ceph_fallocate() ignoring of FALLOC_FL_ALLOCATE_RANGE mode

Viacheslav Dubeyko <Slava.Dubeyko@xxxxxxx> · Wed, 19 Mar 2025 21:34:33 +0000

On Wed, 2025-03-19 at 08:24 -0700, Gregory Farnum wrote:
> On Tue, Mar 18, 2025 at 4:48 PM Viacheslav Dubeyko <slava@xxxxxxxxxxx> wrote:
> > From: Viacheslav Dubeyko <Slava.Dubeyko@xxxxxxx>
> > 
> > The fio test reveals the issue for the case of file size
> > is not aligned on 4K (for example, 4122, 8600, 10K etc).
> > The reproducing path:
> > 
> > target_dir=/mnt/cephfs
> > report_dir=/report
> > size=100ki
> > nrfiles=10
> > pattern=0x74657374
> > 
> > fio --runtime=5M --rw=write --bs=4k --size=$size \
> > --nrfiles=$nrfiles --numjobs=16 --buffer_pattern=0x74657374 \
> > --iodepth=1 --direct=0 --ioengine=libaio --group_reporting \
> > --name=fiotest --directory=$target_dir \
> > --output $report_dir/sequential_write.log
> > 
> > fio --runtime=5M --verify_only --verify=pattern \
> > --verify_pattern=0x74657374 --size=$size --nrfiles=$nrfiles \
> > --numjobs=16 --bs=4k --iodepth=1 --direct=0 --name=fiotest \
> > --ioengine=libaio --group_reporting --verify_fatal=1 \
> > --verify_state_save=0 --directory=$target_dir \
> > --output $report_dir/verify_sequential_write.log
> > 
> > The essence of the issue that the write phase calls
> > the fallocate() to pre-allocate 10K of file size and, then,
> > it writes only 8KB of data. However, CephFS code
> > in ceph_fallocate() ignores the FALLOC_FL_ALLOCATE_RANGE
> > mode and, finally, file is 8K in size only. As a result,
> > verification phase initiates wierd behaviour of CephFS code.
> > CephFS code calls ceph_fallocate() again and completely
> > re-write the file content by some garbage. Finally,
> > verification phase fails because file contains unexpected
> > data pattern.
> 
> 
> CephFS doesn’t really support fallocate in the general case to begin
> with — we don’t want to go out and create an arbitrary number of 4MiB
> objects in response to a large allocation command.
> 

The fallocate() simply increase the size of file in metadata but not allocate or
fill up any content of the file. So, no objects will be created after
fallocate() call.

(1) Set new size of file in metadata:
+			ceph_inode_set_size(inode, offset + length);

(2) Mark inode dirty:
+			ceph_fallocate_mark_dirty(inode, &prealloc_cf);

Nothing more is happened on fallocate() side. The rest is managed by write path
if we try to store some content in the file.

>  AFAIK the only one
> we really do is letting you set a specific file size up (or down,
> maybe?). 
> 

It's not completely correct for the kernel side. Now CephFS kernel code supports
only (FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE). It means that
ceph_fallocate() NEVER changes the size of the file.  

> Do we actually want to support this specific sub-piece of the
> API? What happens if somebody uses FALLOC_FL_ALLOCATE_RANGE to set a
> size of 4MiB+1KiB? Is this synched with the current state of the user
> space client? I know Milind just made some changes around userspace
> fallocate to rationalize our behavior.
> 

The support of FALLOC_FL_ALLOCATE_RANGE simply change the size of file in
metadata and nothing more. If we don't support the FALLOC_FL_ALLOCATE_RANGE
mode, then current behavior of fallocate() call looks not completely correct.
Because, user expect to see the correct size of file. Let's imagine we used
fallocate() to extend file to 10K size and, then we wrote 8K data in the file,
then we will see only 8K size of file. However, we should have 10K file size and
we should see 8K written content and zeroed rest of file's content.

But CephFS kernel code has more dangerous behavior now. If we used fallocate()
to increase the file size to 10K and, then, we have written 8K again, so, we see
8K file size. But if we try to verify the file's content, then instead of only
read requests we have somehow the ceph_fallocate() call that completely remove
content of the file and write requests fill it by some garbage then. It happens
only for unaligned for 4K size of the file (10K, for example) but it works well
for aligned size of file. I explained the reproduction path above.

> 
> > fio: got pattern 'd0', wanted '74'. Bad bits 3
> > fio: bad pattern block offset 0
> > pattern: verify failed at file /mnt/cephfs/fiotest.3.0 offset 0, length 2631490270 (requested block: offset=0, length=4096, flags=8)
> > fio: verify type mismatch (36969 media, 18 given)
> > fio: got pattern '25', wanted '74'. Bad bits 3
> > fio: bad pattern block offset 0
> > pattern: verify failed at file /mnt/cephfs/fiotest.4.0 offset 0, length 1694436820 (requested block: offset=0, length=4096, flags=8)
> > fio: verify type mismatch (6714 media, 18 given)
> > 
> > Expected state ot the file:
> > 
> > hexdump -C ./fiotest.0.0
> > 00000000 74 65 73 74 74 65 73 74 74 65 73 74 74 65 73 74 |testtesttesttest| *
> > 00002000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| *
> > 00002190 00 00 00 00 00 00 00 00 |........|
> > 00002198
> > 
> > Real state of the file:
> > 
> > head -n 2 ./fiotest.0.0
> > 00000000 35 e0 28 cc 38 a0 99 16 06 9c 6a a9 f2 cd e9 0a |5.(.8.....j.....|
> > 00000010 80 53 2a 07 09 e5 0d 15 70 4a 25 f7 0b 39 9d 18 |.S*.....pJ%..9..|
> > 
> > The patch reworks ceph_fallocate() method by means of adding
> > support of FALLOC_FL_ALLOCATE_RANGE mode. Also, it adds the checking
> > that new size can be allocated by means of checking inode_newsize_ok(),
> > fsc->max_file_size, and ceph_quota_is_max_bytes_exceeded().
> > Invalidation and making dirty logic is moved into dedicated
> > methods.
> > 
> > There is one peculiarity for the case of generic/103 test.
> > CephFS logic receives max_file_size from MDS server and it's 1TB
> > by default. As a result, generic/103 can fail if max_file_size
> > is smaller than volume size:
> > 
> > generic/103 6s ... - output mismatch (see /home/slavad/XFSTESTS/xfstests-dev/results//generic/103.out.bad)
> > --- tests/generic/103.out 2025-02-25 13:05:32.494668258 -0800
> > +++ /home/slavad/XFSTESTS/xfstests-dev/results//generic/103.out.bad 2025-03-17 22:28:26.475750878 -0700
> > @ -1,2 +1,3 @
> > QA output created by 103
> > +fallocate: No space left on device
> > Silence is golden.
> > 
> > The solution is to set the max_file_size equal to volume size:
> 
> 
> That is really not a good idea. Is there really a test that tries to
> fallocate that much space? We probably just want to skip it…cleaning
> up files set to that size isn’t much fun.

The generic/213 tries to fallocate() file is bigger than volume size. It's not
the issue here. And the generic/103 tries to fallocate() file of (volume_size -
512K) expecting success of the operation. Then it tries to add big xattrs
expecting to receive -ENOSPC. So, I have 7TB volume size and 1TB max_file_size
by default. As a result, if I don't set up max_file_size to 7TB, then
generic/103 fails because it cannot fallocate() file of size that fits to the
size of the volume. I don't think that skipping generic/103 is a good idea.

Thanks,
Slava.