On Fri, Aug 21, 2020 at 07:58:11AM +1000, Dave Chinner wrote: > On Thu, Aug 20, 2020 at 10:03:10PM +0200, Alberto Garcia wrote: > > Cc: linux-xfs > > > > On Wed 19 Aug 2020 07:53:00 PM CEST, Brian Foster wrote: > > > In any event, if you're seeing unclear or unexpected performance > > > deltas between certain XFS configurations or other fs', I think the > > > best thing to do is post a more complete description of the workload, > > > filesystem/storage setup, and test results to the linux-xfs mailing > > > list (feel free to cc me as well). As it is, aside from the questions > > > above, it's not really clear to me what the storage stack looks like > > > for this test, if/how qcow2 is involved, what the various > > > 'preallocation=' modes actually mean, etc. > > > > (see [1] for a bit of context) > > > > I repeated the tests with a larger (125GB) filesystem. Things are a bit > > faster but not radically different, here are the new numbers: > > > > |----------------------+-------+-------| > > | preallocation mode | xfs | ext4 | > > |----------------------+-------+-------| > > | off | 8139 | 11688 | > > | off (w/o ZERO_RANGE) | 2965 | 2780 | > > | metadata | 7768 | 9132 | > > | falloc | 7742 | 13108 | > > | full | 41389 | 16351 | > > |----------------------+-------+-------| > > > > The numbers are I/O operations per second as reported by fio, running > > inside a VM. > > > > The VM is running Debian 9.7 with Linux 4.9.130 and the fio version is > > 2.16-1. I'm using QEMU 5.1.0. > > > > fio is sending random 4KB write requests to a 25GB virtual drive, this > > is the full command line: > > > > fio --filename=/dev/vdb --direct=1 --randrepeat=1 --eta=always > > --ioengine=libaio --iodepth=32 --numjobs=1 --name=test --size=25G > > --io_limit=25G --ramp_time=5 --rw=randwrite --bs=4k --runtime=60 > > > > The virtual drive (/dev/vdb) is a freshly created qcow2 file stored on > > the host (on an xfs or ext4 filesystem as the table above shows), and > > it is attached to QEMU using a virtio-blk-pci device: > > > > -drive if=virtio,file=image.qcow2,cache=none,l2-cache-size=200M > > You're not using AIO on this image file, so it can't do > concurrent IO? what happens when you add "aio=native" to this? > > > cache=none means that the image is opened with O_DIRECT and > > l2-cache-size is large enough so QEMU is able to cache all the > > relevant qcow2 metadata in memory. > > What happens when you just use a sparse file (i.e. a raw image) with > aio=native instead of using qcow2? XFS, ext4, btrfs, etc all support > sparse files so using qcow2 to provide sparse image file support is > largely an unnecessary layer of indirection and overhead... > > And with XFS, you don't need qcow2 for snapshots either because you > can use reflink copies to take an atomic copy-on-write snapshot of > the raw image file... (assuming you made the xfs filesystem with > reflink support (which is the TOT default now)). > > I've been using raw sprase files on XFS for all my VMs for over a > decade now, and using reflink to create COW copies of golden > image files iwhen deploying new VMs for a couple of years now... > > > The host is running Linux 4.19.132 and has an SSD drive. > > > > About the preallocation modes: a qcow2 file is divided into clusters > > of the same size (64KB in this case). That is the minimum unit of > > allocation, so when writing 4KB to an unallocated cluster QEMU needs > > to fill the other 60KB with zeroes. So here's what happens with the > > different modes: > > Which is something that sparse files on filesystems do not need to > do. If, on XFS, you really want 64kB allocation clusters, use an > extent size hint of 64kB. Though for image files, I highly recommend > using 1MB or larger extent size hints. > > > > 1) off: for every write request QEMU initializes the cluster (64KB) > > with fallocate(ZERO_RANGE) and then writes the 4KB of data. > > > > 2) off w/o ZERO_RANGE: QEMU writes the 4KB of data and fills the rest > > of the cluster with zeroes. > > > > 3) metadata: all clusters were allocated when the image was created > > but they are sparse, QEMU only writes the 4KB of data. > > > > 4) falloc: all clusters were allocated with fallocate() when the image > > was created, QEMU only writes 4KB of data. > > > > 5) full: all clusters were allocated by writing zeroes to all of them > > when the image was created, QEMU only writes 4KB of data. > > > > As I said in a previous message I'm not familiar with xfs, but the > > parts that I don't understand are > > > > - Why is (4) slower than (1)? > > Because fallocate() is a full IO serialisation barrier at the > filesystem level. If you do: > > fallocate(whole file) > <IO> > <IO> > <IO> > ..... > > The IO can run concurrent and does not serialise against anything in > the filesysetm except unwritten extent conversions at IO completion > (see answer to next question!) > > However, if you just use (4) you get: > > falloc(64k) > <wait for inflight IO to complete> > <allocates 64k as unwritten> > <4k io> > .... > falloc(64k) > <wait for inflight IO to complete> > .... > <4k IO completes, converts 4k to written> > <allocates 64k as unwritten> > <4k io> > falloc(64k) > <wait for inflight IO to complete> > .... > <4k IO completes, converts 4k to written> > <allocates 64k as unwritten> > <4k io> > .... > Option 4 is described above as initial file preallocation whereas option 1 is per 64k cluster prealloc. Prealloc mode mixup aside, Berto is reporting that the initial file preallocation mode is slower than the per cluster prealloc mode. Berto, am I following that right? Brian > until all the clusters in the qcow2 file are intialised. IOWs, each > fallocate() call serialises all IO in flight. Compare that to using > extent size hints on a raw sparse image file for the same thing: > > <set 64k extent size hint> > <4k IO> > <allocates 64k as unwritten> > .... > <4k IO> > <allocates 64k as unwritten> > .... > <4k IO> > <allocates 64k as unwritten> > .... > ... > <4k IO completes, converts 4k to written> > <4k IO completes, converts 4k to written> > <4k IO completes, converts 4k to written> > .... > > See the difference in IO pipelining here? You get the same "64kB > cluster initialised at a time" behaviour as qcow2, but you don't get > the IO pipeline stalls caused by fallocate() having to drain all the > IO in flight before it does the allocation. > > > - Why is (5) so much faster than everything else? > > The full file allocation in (5) means the IO doesn't have to modify > the extent map hence all extent mapping is uses shared locking and > the entire IO path can run concurrently without serialisation at > all. > > Thing is, once your writes into sprase image files regularly start > hitting written extents, the performance of (1), (2) and (4) will > trend towards (5) as writes hit already allocated ranges of the file > and the serialisation of extent mapping changes goes away. This > occurs with guest filesystems that perform overwrite in place (such > as XFS) and hence overwrites of existing data will hit allocated > space in the image file and not require further allocation. > > IOWs, typical "write once" benchmark testing indicates the *worst* > performance you are going to see. As the guest filesytsem ages and > initialises more of the underlying image file, it will get faster, > not slower. > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx >