On Fri, Aug 21, 2020 at 02:12:32PM +0200, Alberto Garcia wrote: > On Fri 21 Aug 2020 01:42:52 PM CEST, Alberto Garcia wrote: > > On Fri 21 Aug 2020 01:05:06 PM CEST, Brian Foster <bfoster@xxxxxxxxxx> wrote: > >>> > 1) off: for every write request QEMU initializes the cluster (64KB) > >>> > with fallocate(ZERO_RANGE) and then writes the 4KB of data. > >>> > > >>> > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest > >>> > of the cluster with zeroes. > >>> > > >>> > 3) metadata: all clusters were allocated when the image was created > >>> > but they are sparse, QEMU only writes the 4KB of data. > >>> > > >>> > 4) falloc: all clusters were allocated with fallocate() when the image > >>> > was created, QEMU only writes 4KB of data. > >>> > > >>> > 5) full: all clusters were allocated by writing zeroes to all of them > >>> > when the image was created, QEMU only writes 4KB of data. > >>> > > >>> > As I said in a previous message I'm not familiar with xfs, but the > >>> > parts that I don't understand are > >>> > > >>> > - Why is (4) slower than (1)? > >>> > >>> Because fallocate() is a full IO serialisation barrier at the > >>> filesystem level. If you do: > >>> > >>> fallocate(whole file) > >>> <IO> > >>> <IO> > >>> <IO> > >>> ..... > >>> > >>> The IO can run concurrent and does not serialise against anything in > >>> the filesysetm except unwritten extent conversions at IO completion > >>> (see answer to next question!) > >>> > >>> However, if you just use (4) you get: > >>> > >>> falloc(64k) > >>> <wait for inflight IO to complete> > >>> <allocates 64k as unwritten> > >>> <4k io> > >>> .... > >>> falloc(64k) > >>> <wait for inflight IO to complete> > >>> .... > >>> <4k IO completes, converts 4k to written> > >>> <allocates 64k as unwritten> > >>> <4k io> > >>> falloc(64k) > >>> <wait for inflight IO to complete> > >>> .... > >>> <4k IO completes, converts 4k to written> > >>> <allocates 64k as unwritten> > >>> <4k io> > >>> .... > >>> > >> > >> Option 4 is described above as initial file preallocation whereas > >> option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto > >> is reporting that the initial file preallocation mode is slower than > >> the per cluster prealloc mode. Berto, am I following that right? > > After looking more closely at the data I can see that there is a peak of > ~30K IOPS during the first 5 or 6 seconds and then it suddenly drops to > ~7K for the rest of the test. How big is the filesystem, how big is the log? (xfs_info output, please!) In general, there are three typical causes of this. The first is typical of the initial burst of allocations running on an empty journal, then allocation transactions getting throttling back to the speed at which metadata can be flushed once the journal fills up. If you have a small filesystem and a default sized log, this is quite likely to happen. The second is that have large logs and you are running on hardware where device cache flushes and FUA writes hammer overall device performance. Hence when the CIL initially fills up and starts flushing (journal writes are pre-flush + FUA so do both) device performance goes way down because now it has to write it's cached data to physical media rather than just cache it in volatile device RAM. IOWs, journal writes end up forcing all volatile data to stable media and so that can slow the device down. ALso, cache flushes might not be queued commands, hence journal writes will also create IO pipeline stalls... The third is the hardware capability. Consumer hardware is designed to have extremely fast bursty behaviour, but then steady state performance is much lower (think "SLC" burst caches in TLC SSDs). I have isome consumer SSDs here that can sustain 400MB/s random 4kB write for about 10-15s, then they drop to about 50MB/s once the burst buffer is full. OTOH, I have enterprise SSDs that will sustain a _much_ higher rate of random 4kB writes indefinitely than the consumer SSDs burst at. However, most consumer workloads don't move this sort of data around, so this sort of design tradeoff is fine for that market (Benchmarketing 101 stuff :). IOWs, this behaviour could be filesystem config, it could be cache flush behaviour, it could simply be storage device design capability. Or it could be a combination of all three things. Watching a set of fast sampling metrics that tell you what the device and filesytem are doing in real time (e.g. I use PCP for this and visualise ithe behaviour in real time via pmchart) gives a lot of insight into exactly what is changing during transient workload changes liek starting a benchmark... > I was running fio with --ramp_time=5 which ignores the first 5 seconds > of data in order to let performance settle, but if I remove that I can > see the effect more clearly. I can observe it with raw files (in 'off' > and 'prealloc' modes) and qcow2 files in 'prealloc' mode. With qcow2 and > preallocation=off the performance is stable during the whole test. What does "preallocation=off" mean again? Is that using fallocate(ZERO_RANGE) prior to the data write rather than preallocating the metadata/entire file? If so, I would expect the limiting factor is the rate at which IO can be issued because of the fallocate() triggered pipeline bubbles. That leaves idle device time so you're not pushing the limits of the hardware and hence none of the behaviours above will be evident... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx