Re: Block size and read-modify-write

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 03/01/2018 02:19, Dave Chinner wrote:
Cached writes smaller than a *page* will cause RMW cycles in the
page cache, regardless of the block size of the filesystem.

Sure, in this case a page-sized r/m/w cycle happen in the pagecache. However it seems to me that, when flushed to disk, writes happens at the block level granularity, as you can see from tests[1,2] below. Am I wrong? I am missing something?

Ok, there is a difference between *sector size* and *filesystem
block size*. You seem to be using them interchangably in your
question, and that's not correct.

True, maybe I have issues grasping the concept of sector size from XFS point of view. I understand sector size as an hardware property of the underlying block device, but how does it relate to the filesystem?

I naively supposed that an XFS filesystem created with 4k *sector* size (ie: mkfs.xfs -s size=4096) would prevent 512 bytes O_DIRECT writes, but my test[3] shows that even of such a filesystem a 512B direct write is possible, indeed.

Is sector size information only used by XFS own metadata and journaling in order to avoid costly device-level r/m/w cycles on 512e devices? I understand that on 4Kn device you *have* to avoid sub-sector writes, or the transfer will fail.


.... this is not correct for direct IO. The direct IO path does not
do RMW cycles at all.

Put simply: a 512B DIO write on a (real or emulated) 512B sector
device with a 4k FSB will be serialised by the filesystem and do a
single 512B sector write to the device.  However, if the device
reports as a 4k sector device then a 512B DIO write will be rejected
by the filesystem because sub-sector IO is not possible.

Ok, this was as expected.

I want to put some context on the original question, and why I am so interested on r/m/w cycles. SSD's flash-page size has, in recent years (2014+), ballooned to 8/16/32K. I wonder if a matching blocksize and/or sector size are needed to avoid (some of) device-level r/m/w cycles, which can dramatically increase flash write amplification (with reduced endurance).

Thanks.


------ test output below ------

# Block device properties
[root@blackhole queue]# blockdev --getss --getpbsz --getiomin --getbsz /dev/sda3
512
512
512
4096

[1] # XFS with blocksize=4K and sectorsize=512B (default)
[root@blackhole queue]# mkfs.xfs /dev/sda3
meta-data=/dev/sda3              isize=512    agcount=4, agsize=65536 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0, sparse=0
data     =                       bsize=4096   blocks=262144, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
[root@blackhole queue]# mount /dev/sda3 /mnt/test/
# Preallocate file to minimize metadata traffic
[root@blackhole test]# fallocate /mnt/test/test.img -l 256M
# Write 512B via pagecache
[root@blackhole test]# while true; do echo 3 > /proc/sys/vm/drop_caches; dd if=/dev/urandom of=/mnt/test/test.img bs=512 count=1 oflag=dsync conv=nocreat,notrunc; sleep 1; done
# Dstat results: 4K reads/writes (read-modify-write)
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
 read  writ
4096B 4096B
4096B 4096B
4096B 4096B
# Write 512B via O_DIRECT
[root@blackhole test]# while true; do echo 3 > /proc/sys/vm/drop_caches; dd if=/dev/urandom of=/mnt/test/test.img bs=512 count=1 oflag=dsync conv=nocreat,notrunc oflag=direct; sleep 1; done
# Dstat results: 512B writes
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
 read  writ
   0   512B
   0   512B
   0   512B

[2] # XFS with blocksize=1K and sectorsize=512B
[root@blackhole mnt]# umount /mnt/test/
[root@blackhole mnt]# mkfs.xfs /dev/sda3 -f -b size=1024
meta-data=/dev/sda3              isize=512    agcount=4, agsize=262144 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0, sparse=0
data     =                       bsize=1024   blocks=1048576, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=1024   blocks=10240, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
[root@blackhole mnt]# mount /dev/sda3 /mnt/test/
# Preallocate file to minimize metadata traffic
[root@blackhole mnt]# fallocate /mnt/test/test.img -l 256M
# Write 512B via pagecache
[root@blackhole mnt]# while true; do echo 3 > /proc/sys/vm/drop_caches; dd if=/dev/urandom of=/mnt/test/test.img bs=512 count=1 oflag=dsync conv=nocreat,notrunc; sleep 1; done
# Dstat results: 1K reads/writes (read-modify-write)
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
 read  writ
1024B 1024B
1024B 1024B
1024B 1024B
# Write 512B via O_DIRECT
while true; do echo 3 > /proc/sys/vm/drop_caches; dd if=/dev/urandom of=/mnt/test/test.img bs=512 count=1 oflag=dsync conv=nocreat,notrunc oflag=direct; sleep 1; done
# Dstat results: 512B writes
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
 read  writ
   0   512B
   0   512B
   0   512B

[3] # XFS with blocksize=4K and sectorsize=4K
[root@blackhole mnt]# umount /mnt/test/
[root@blackhole mnt]# mkfs.xfs /dev/sda3 -f -s size=4096
meta-data=/dev/sda3              isize=512    agcount=4, agsize=65536 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=0, sparse=0
data     =                       bsize=4096   blocks=262144, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
[root@blackhole mnt]# mount /dev/sda3 /mnt/test/
# Preallocate file to minimize metadata traffic
[root@blackhole mnt]# fallocate /mnt/test/test.img -l 256M
# Write 512B via pagecache
[root@blackhole mnt]# while true; do echo 3 > /proc/sys/vm/drop_caches; dd if=/dev/urandom of=/mnt/test/test.img bs=512 count=1 oflag=dsync conv=nocreat,notrunc; sleep 1; done
# Dstat results: 4K reads/writes (read-modify-write)
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
 read  writ
4096B 4096B
4096B 4096B
4096B 4096B
# Write 512B via O_DIRECT
[root@blackhole mnt]# while true; do echo 3 > /proc/sys/vm/drop_caches; dd if=/dev/urandom of=/mnt/test/test.img bs=512 count=1 oflag=dsync conv=nocreat,notrunc oflag=direct; sleep 1; done
# Dstat results: 512B writes
[root@blackhole ~]# dstat -d -D /dev/sda3
--dsk/sda3-
 read  writ
   0   512B
   0   512B
   0   512B

--
Danti Gionatan
Supporto Tecnico
Assyoma S.r.l. - www.assyoma.it
email: g.danti@xxxxxxxxxx - info@xxxxxxxxxx
GPG public key ID: FF5F32A8
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux