On Tue, 18 Sep 2018, Christoph Hellwig wrote: > On Tue, Sep 18, 2018 at 10:22:15AM -0400, Mikulas Patocka wrote: > > > On Tue, Sep 18, 2018 at 07:46:47AM -0400, Mikulas Patocka wrote: > > > > I would ask the XFS developers about this - why does mkfs.xfs select > > > > sector size 512 by default? > > > > > > Because the underlying device told it that it supported a > > > sector size of 512 bytes? > > > > SSDs lie about this. They have 4k sectors internally, but report 512. > > SSDs can't lie about the sector size because they don't even have > sectors in the disk sense, they have program and erase block size, They have remapping table that maps each 4k block to a location on the NAND flash. > and some kind of FTL granularity (think of it like a file system block > size - even a 4k block size file can do smaller writes with > read-modify-write cycles, so can SSDs). > > SSDs can just properly implement the guarantees they inherited from > disk by other means. So if an SSD claims it supports 512 byte blocks > it better can deal with them atomically. If they have issues in that > area (like Intel did recently where they corrupted data left right > and center if you actually did 512byte writes) they are simply buggy. > > SATA and SAS SSDs can always use the same trick as modern disks to > support 512 byte access where really needed (e.g. BIOS and legacy > OSes) but give a strong hint to modern OSes that they don't want that > to be actually used with the physical block exponent. NVMe doesn't > have anything like that yet, but we are working on something like > that in the NVMe TWG. The question is - why do you want to use 512-byte writes if they perform badly? For example, the Kingston NVME SSD has 242k IOPS for 4k writes and 45k IOPS for 2k writes. The same problem is with dm-writecache - it can run with 512-byte sectors, but there's unnecessary overhead. It would be much better if XFS did 4k aligned writes. You can do 4k writes and assume that only 512-byte units are written atomically - that would be safe for old 512-byte sector disk and it wouldn't degrade performance on SSDs. If the XFS data blocks are aligned on 4k boundary, you can do 4k-aligned I/Os on the metadata as well. You could allocate metadata in 512-byte quantities and you could do 4k reads and writes on them. You would over-read and over-write a bit, but it will perform better due to avoiding the read-modify-write logic in the SSD. I'm not an expert in the XFS journal - but could the journal writes be just padded to 4k boundary (even if the journal space is allocated in 512-byte quantities)? Mikulas