On Tue, Mar 18, 2025 at 4:48 PM Viacheslav Dubeyko <slava@xxxxxxxxxxx> wrote: > From: Viacheslav Dubeyko <Slava.Dubeyko@xxxxxxx> > > The fio test reveals the issue for the case of file size > is not aligned on 4K (for example, 4122, 8600, 10K etc). > The reproducing path: > > target_dir=/mnt/cephfs > report_dir=/report > size=100ki > nrfiles=10 > pattern=0x74657374 > > fio --runtime=5M --rw=write --bs=4k --size=$size \ > --nrfiles=$nrfiles --numjobs=16 --buffer_pattern=0x74657374 \ > --iodepth=1 --direct=0 --ioengine=libaio --group_reporting \ > --name=fiotest --directory=$target_dir \ > --output $report_dir/sequential_write.log > > fio --runtime=5M --verify_only --verify=pattern \ > --verify_pattern=0x74657374 --size=$size --nrfiles=$nrfiles \ > --numjobs=16 --bs=4k --iodepth=1 --direct=0 --name=fiotest \ > --ioengine=libaio --group_reporting --verify_fatal=1 \ > --verify_state_save=0 --directory=$target_dir \ > --output $report_dir/verify_sequential_write.log > > The essence of the issue that the write phase calls > the fallocate() to pre-allocate 10K of file size and, then, > it writes only 8KB of data. However, CephFS code > in ceph_fallocate() ignores the FALLOC_FL_ALLOCATE_RANGE > mode and, finally, file is 8K in size only. As a result, > verification phase initiates wierd behaviour of CephFS code. > CephFS code calls ceph_fallocate() again and completely > re-write the file content by some garbage. Finally, > verification phase fails because file contains unexpected > data pattern. CephFS doesn’t really support fallocate in the general case to begin with — we don’t want to go out and create an arbitrary number of 4MiB objects in response to a large allocation command. AFAIK the only one we really do is letting you set a specific file size up (or down, maybe?). Do we actually want to support this specific sub-piece of the API? What happens if somebody uses FALLOC_FL_ALLOCATE_RANGE to set a size of 4MiB+1KiB? Is this synched with the current state of the user space client? I know Milind just made some changes around userspace fallocate to rationalize our behavior. > fio: got pattern 'd0', wanted '74'. Bad bits 3 > fio: bad pattern block offset 0 > pattern: verify failed at file /mnt/cephfs/fiotest.3.0 offset 0, length 2631490270 (requested block: offset=0, length=4096, flags=8) > fio: verify type mismatch (36969 media, 18 given) > fio: got pattern '25', wanted '74'. Bad bits 3 > fio: bad pattern block offset 0 > pattern: verify failed at file /mnt/cephfs/fiotest.4.0 offset 0, length 1694436820 (requested block: offset=0, length=4096, flags=8) > fio: verify type mismatch (6714 media, 18 given) > > Expected state ot the file: > > hexdump -C ./fiotest.0.0 > 00000000 74 65 73 74 74 65 73 74 74 65 73 74 74 65 73 74 |testtesttesttest| * > 00002000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * > 00002190 00 00 00 00 00 00 00 00 |........| > 00002198 > > Real state of the file: > > head -n 2 ./fiotest.0.0 > 00000000 35 e0 28 cc 38 a0 99 16 06 9c 6a a9 f2 cd e9 0a |5.(.8.....j.....| > 00000010 80 53 2a 07 09 e5 0d 15 70 4a 25 f7 0b 39 9d 18 |.S*.....pJ%..9..| > > The patch reworks ceph_fallocate() method by means of adding > support of FALLOC_FL_ALLOCATE_RANGE mode. Also, it adds the checking > that new size can be allocated by means of checking inode_newsize_ok(), > fsc->max_file_size, and ceph_quota_is_max_bytes_exceeded(). > Invalidation and making dirty logic is moved into dedicated > methods. > > There is one peculiarity for the case of generic/103 test. > CephFS logic receives max_file_size from MDS server and it's 1TB > by default. As a result, generic/103 can fail if max_file_size > is smaller than volume size: > > generic/103 6s ... - output mismatch (see /home/slavad/XFSTESTS/xfstests-dev/results//generic/103.out.bad) > --- tests/generic/103.out 2025-02-25 13:05:32.494668258 -0800 > +++ /home/slavad/XFSTESTS/xfstests-dev/results//generic/103.out.bad 2025-03-17 22:28:26.475750878 -0700 > @ -1,2 +1,3 @ > QA output created by 103 > +fallocate: No space left on device > Silence is golden. > > The solution is to set the max_file_size equal to volume size: That is really not a good idea. Is there really a test that tries to fallocate that much space? We probably just want to skip it…cleaning up files set to that size isn’t much fun. -Greg