Re: [RFC PATCH] ceph: fix ceph_fallocate() ignoring of FALLOC_FL_ALLOCATE_RANGE mode

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 19 Mar 2025 08:24:21 -0700

On Tue, Mar 18, 2025 at 4:48 PM Viacheslav Dubeyko <slava@xxxxxxxxxxx> wrote:
> From: Viacheslav Dubeyko <Slava.Dubeyko@xxxxxxx>
>
> The fio test reveals the issue for the case of file size
> is not aligned on 4K (for example, 4122, 8600, 10K etc).
> The reproducing path:
>
> target_dir=/mnt/cephfs
> report_dir=/report
> size=100ki
> nrfiles=10
> pattern=0x74657374
>
> fio --runtime=5M --rw=write --bs=4k --size=$size \
> --nrfiles=$nrfiles --numjobs=16 --buffer_pattern=0x74657374 \
> --iodepth=1 --direct=0 --ioengine=libaio --group_reporting \
> --name=fiotest --directory=$target_dir \
> --output $report_dir/sequential_write.log
>
> fio --runtime=5M --verify_only --verify=pattern \
> --verify_pattern=0x74657374 --size=$size --nrfiles=$nrfiles \
> --numjobs=16 --bs=4k --iodepth=1 --direct=0 --name=fiotest \
> --ioengine=libaio --group_reporting --verify_fatal=1 \
> --verify_state_save=0 --directory=$target_dir \
> --output $report_dir/verify_sequential_write.log
>
> The essence of the issue that the write phase calls
> the fallocate() to pre-allocate 10K of file size and, then,
> it writes only 8KB of data. However, CephFS code
> in ceph_fallocate() ignores the FALLOC_FL_ALLOCATE_RANGE
> mode and, finally, file is 8K in size only. As a result,
> verification phase initiates wierd behaviour of CephFS code.
> CephFS code calls ceph_fallocate() again and completely
> re-write the file content by some garbage. Finally,
> verification phase fails because file contains unexpected
> data pattern.

CephFS doesn’t really support fallocate in the general case to begin
with — we don’t want to go out and create an arbitrary number of 4MiB
objects in response to a large allocation command. AFAIK the only one
we really do is letting you set a specific file size up (or down,
maybe?). Do we actually want to support this specific sub-piece of the
API? What happens if somebody uses FALLOC_FL_ALLOCATE_RANGE to set a
size of 4MiB+1KiB? Is this synched with the current state of the user
space client? I know Milind just made some changes around userspace
fallocate to rationalize our behavior.

> fio: got pattern 'd0', wanted '74'. Bad bits 3
> fio: bad pattern block offset 0
> pattern: verify failed at file /mnt/cephfs/fiotest.3.0 offset 0, length 2631490270 (requested block: offset=0, length=4096, flags=8)
> fio: verify type mismatch (36969 media, 18 given)
> fio: got pattern '25', wanted '74'. Bad bits 3
> fio: bad pattern block offset 0
> pattern: verify failed at file /mnt/cephfs/fiotest.4.0 offset 0, length 1694436820 (requested block: offset=0, length=4096, flags=8)
> fio: verify type mismatch (6714 media, 18 given)
>
> Expected state ot the file:
>
> hexdump -C ./fiotest.0.0
> 00000000 74 65 73 74 74 65 73 74 74 65 73 74 74 65 73 74 |testtesttesttest| *
> 00002000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| *
> 00002190 00 00 00 00 00 00 00 00 |........|
> 00002198
>
> Real state of the file:
>
> head -n 2 ./fiotest.0.0
> 00000000 35 e0 28 cc 38 a0 99 16 06 9c 6a a9 f2 cd e9 0a |5.(.8.....j.....|
> 00000010 80 53 2a 07 09 e5 0d 15 70 4a 25 f7 0b 39 9d 18 |.S*.....pJ%..9..|
>
> The patch reworks ceph_fallocate() method by means of adding
> support of FALLOC_FL_ALLOCATE_RANGE mode. Also, it adds the checking
> that new size can be allocated by means of checking inode_newsize_ok(),
> fsc->max_file_size, and ceph_quota_is_max_bytes_exceeded().
> Invalidation and making dirty logic is moved into dedicated
> methods.
>
> There is one peculiarity for the case of generic/103 test.
> CephFS logic receives max_file_size from MDS server and it's 1TB
> by default. As a result, generic/103 can fail if max_file_size
> is smaller than volume size:
>
> generic/103 6s ... - output mismatch (see /home/slavad/XFSTESTS/xfstests-dev/results//generic/103.out.bad)
> --- tests/generic/103.out 2025-02-25 13:05:32.494668258 -0800
> +++ /home/slavad/XFSTESTS/xfstests-dev/results//generic/103.out.bad 2025-03-17 22:28:26.475750878 -0700
> @ -1,2 +1,3 @
> QA output created by 103
> +fallocate: No space left on device
> Silence is golden.
>
> The solution is to set the max_file_size equal to volume size:

That is really not a good idea. Is there really a test that tries to
fallocate that much space? We probably just want to skip it…cleaning
up files set to that size isn’t much fun.
-Greg