On 03/01/2018 02:19, Dave Chinner wrote:
Cached writes smaller than a *page* will cause RMW cycles in the
page cache, regardless of the block size of the filesystem.
Sure, in this case a page-sized r/m/w cycle happen in the pagecache.
However it seems to me that, when flushed to disk, writes happens at the
block level granularity, as you can see from tests[1,2] below. Am I
wrong? I am missing something?
Ok, there is a difference between *sector size* and *filesystem
block size*. You seem to be using them interchangably in your
question, and that's not correct.
True, maybe I have issues grasping the concept of sector size from XFS
point of view. I understand sector size as an hardware property of the
underlying block device, but how does it relate to the filesystem?
I naively supposed that an XFS filesystem created with 4k *sector* size
(ie: mkfs.xfs -s size=4096) would prevent 512 bytes O_DIRECT writes, but
my test[3] shows that even of such a filesystem a 512B direct write is
possible, indeed.
Is sector size information only used by XFS own metadata and journaling
in order to avoid costly device-level r/m/w cycles on 512e devices? I
understand that on 4Kn device you *have* to avoid sub-sector writes, or
the transfer will fail.
.... this is not correct for direct IO. The direct IO path does not
do RMW cycles at all.
Put simply: a 512B DIO write on a (real or emulated) 512B sector
device with a 4k FSB will be serialised by the filesystem and do a
single 512B sector write to the device. However, if the device
reports as a 4k sector device then a 512B DIO write will be rejected
by the filesystem because sub-sector IO is not possible.
Ok, this was as expected.
I want to put some context on the original question, and why I am so
interested on r/m/w cycles. SSD's flash-page size has, in recent years
(2014+), ballooned to 8/16/32K. I wonder if a matching blocksize and/or
sector size are needed to avoid (some of) device-level r/m/w cycles,
which can dramatically increase flash write amplification (with reduced
endurance).
Thanks.
------ test output below ------
# Block device properties
[root@blackhole queue]# blockdev --getss --getpbsz --getiomin --getbsz
/dev/sda3
512
512
512
4096
[1] # XFS with blocksize=4K and sectorsize=512B (default)
[root@blackhole queue]# mkfs.xfs /dev/sda3
meta-data=/dev/sda3 isize=512 agcount=4, agsize=65536 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0
data = bsize=4096 blocks=262144, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=2560, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
[root@blackhole queue]# mount /dev/sda3 /mnt/test/
# Preallocate file to minimize metadata traffic
[root@blackhole test]# fallocate /mnt/test/test.img -l 256M
# Write 512B via pagecache
[root@blackhole test]# while true; do echo 3 > /proc/sys/vm/drop_caches;
dd if=/dev/urandom of=/mnt/test/test.img bs=512 count=1 oflag=dsync
conv=nocreat,notrunc; sleep 1; done
# Dstat results: 4K reads/writes (read-modify-write)
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
read writ
4096B 4096B
4096B 4096B
4096B 4096B
# Write 512B via O_DIRECT
[root@blackhole test]# while true; do echo 3 > /proc/sys/vm/drop_caches;
dd if=/dev/urandom of=/mnt/test/test.img bs=512 count=1 oflag=dsync
conv=nocreat,notrunc oflag=direct; sleep 1; done
# Dstat results: 512B writes
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
read writ
0 512B
0 512B
0 512B
[2] # XFS with blocksize=1K and sectorsize=512B
[root@blackhole mnt]# umount /mnt/test/
[root@blackhole mnt]# mkfs.xfs /dev/sda3 -f -b size=1024
meta-data=/dev/sda3 isize=512 agcount=4, agsize=262144 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0
data = bsize=1024 blocks=1048576, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=1024 blocks=10240, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
[root@blackhole mnt]# mount /dev/sda3 /mnt/test/
# Preallocate file to minimize metadata traffic
[root@blackhole mnt]# fallocate /mnt/test/test.img -l 256M
# Write 512B via pagecache
[root@blackhole mnt]# while true; do echo 3 > /proc/sys/vm/drop_caches;
dd if=/dev/urandom of=/mnt/test/test.img bs=512 count=1 oflag=dsync
conv=nocreat,notrunc; sleep 1; done
# Dstat results: 1K reads/writes (read-modify-write)
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
read writ
1024B 1024B
1024B 1024B
1024B 1024B
# Write 512B via O_DIRECT
while true; do echo 3 > /proc/sys/vm/drop_caches; dd if=/dev/urandom
of=/mnt/test/test.img bs=512 count=1 oflag=dsync conv=nocreat,notrunc
oflag=direct; sleep 1; done
# Dstat results: 512B writes
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
read writ
0 512B
0 512B
0 512B
[3] # XFS with blocksize=4K and sectorsize=4K
[root@blackhole mnt]# umount /mnt/test/
[root@blackhole mnt]# mkfs.xfs /dev/sda3 -f -s size=4096
meta-data=/dev/sda3 isize=512 agcount=4, agsize=65536 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=0, sparse=0
data = bsize=4096 blocks=262144, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=2560, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
[root@blackhole mnt]# mount /dev/sda3 /mnt/test/
# Preallocate file to minimize metadata traffic
[root@blackhole mnt]# fallocate /mnt/test/test.img -l 256M
# Write 512B via pagecache
[root@blackhole mnt]# while true; do echo 3 > /proc/sys/vm/drop_caches;
dd if=/dev/urandom of=/mnt/test/test.img bs=512 count=1 oflag=dsync
conv=nocreat,notrunc; sleep 1; done
# Dstat results: 4K reads/writes (read-modify-write)
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
read writ
4096B 4096B
4096B 4096B
4096B 4096B
# Write 512B via O_DIRECT
[root@blackhole mnt]# while true; do echo 3 > /proc/sys/vm/drop_caches;
dd if=/dev/urandom of=/mnt/test/test.img bs=512 count=1 oflag=dsync
conv=nocreat,notrunc oflag=direct; sleep 1; done
# Dstat results: 512B writes
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
read writ
0 512B
0 512B
0 512B
--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@xxxxxxxxxx - info@xxxxxxxxxx
GPG public key ID: FF5F32A8
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html