On Fri 21 Aug 2020 01:42:52 PM CEST, Alberto Garcia wrote: > On Fri 21 Aug 2020 01:05:06 PM CEST, Brian Foster <bfoster@xxxxxxxxxx> wrote: >>> > 1) off: for every write request QEMU initializes the cluster (64KB) >>> > with fallocate(ZERO_RANGE) and then writes the 4KB of data. >>> > >>> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest >>> > of the cluster with zeroes. >>> > >>> > 3) metadata: all clusters were allocated when the image was created >>> > but they are sparse, QEMU only writes the 4KB of data. >>> > >>> > 4) falloc: all clusters were allocated with fallocate() when the image >>> > was created, QEMU only writes 4KB of data. >>> > >>> > 5) full: all clusters were allocated by writing zeroes to all of them >>> > when the image was created, QEMU only writes 4KB of data. >>> > >>> > As I said in a previous message I'm not familiar with xfs, but the >>> > parts that I don't understand are >>> > >>> > - Why is (4) slower than (1)? >>> >>> Because fallocate() is a full IO serialisation barrier at the >>> filesystem level. If you do: >>> >>> fallocate(whole file) >>> <IO> >>> <IO> >>> <IO> >>> ..... >>> >>> The IO can run concurrent and does not serialise against anything in >>> the filesysetm except unwritten extent conversions at IO completion >>> (see answer to next question!) >>> >>> However, if you just use (4) you get: >>> >>> falloc(64k) >>> <wait for inflight IO to complete> >>> <allocates 64k as unwritten> >>> <4k io> >>> .... >>> falloc(64k) >>> <wait for inflight IO to complete> >>> .... >>> <4k IO completes, converts 4k to written> >>> <allocates 64k as unwritten> >>> <4k io> >>> falloc(64k) >>> <wait for inflight IO to complete> >>> .... >>> <4k IO completes, converts 4k to written> >>> <allocates 64k as unwritten> >>> <4k io> >>> .... >>> >> >> Option 4 is described above as initial file preallocation whereas >> option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto >> is reporting that the initial file preallocation mode is slower than >> the per cluster prealloc mode. Berto, am I following that right? After looking more closely at the data I can see that there is a peak of ~30K IOPS during the first 5 or 6 seconds and then it suddenly drops to ~7K for the rest of the test. I was running fio with --ramp_time=5 which ignores the first 5 seconds of data in order to let performance settle, but if I remove that I can see the effect more clearly. I can observe it with raw files (in 'off' and 'prealloc' modes) and qcow2 files in 'prealloc' mode. With qcow2 and preallocation=off the performance is stable during the whole test. Berto