On 2020/02/20 9:55, Randy Dunlap wrote: > Hi Damien, > > Typo etc. corrections below: Thanks. Will correct these. Since this is now in the kernel, you can send a patch too :) > > On 2/6/20 7:16 PM, Damien Le Moal wrote: >> Add the new file Documentation/filesystems/zonefs.txt to document >> zonefs principles and user-space tool usage. >> >> Signed-off-by: Damien Le Moal <damien.lemoal@xxxxxxx> >> Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx> >> --- >> Documentation/filesystems/zonefs.txt | 404 +++++++++++++++++++++++++++ >> MAINTAINERS | 1 + >> 2 files changed, 405 insertions(+) >> create mode 100644 Documentation/filesystems/zonefs.txt >> >> diff --git a/Documentation/filesystems/zonefs.txt b/Documentation/filesystems/zonefs.txt >> new file mode 100644 >> index 000000000000..935bf22031ca >> --- /dev/null >> +++ b/Documentation/filesystems/zonefs.txt >> @@ -0,0 +1,404 @@ >> +ZoneFS - Zone filesystem for Zoned block devices >> + >> +Introduction >> +============ >> + > ... >> + >> +Zoned block devices >> +------------------- >> + > ... >> + >> +Zonefs Overview >> +=============== >> + > ... > >> + >> +On-disk metadata >> +---------------- >> + > ... > >> + >> +Zone type sub-directories >> +------------------------- >> + > ... > >> + >> +Zone files >> +---------- >> + > ... > >> + >> +Conventional zone files >> +----------------------- >> + > ... > >> + >> +Sequential zone files >> +--------------------- >> + >> +The size of sequential zone files grouped in the "seq" sub-directory represents >> +the file's zone write pointer position relative to the zone start sector. >> + >> +Sequential zone files can only be written sequentially, starting from the file >> +end, that is, write operations can only be append writes. Zonefs makes no >> +attempt at accepting random writes and will fail any write request that has a >> +start offset not corresponding to the end of the file, or to the end of the last >> +write issued and still in-flight (for asynchrnous I/O operations). > asynchronous > >> + >> +Since dirty page writeback by the page cache does not guarantee a sequential >> +write pattern, zonefs prevents buffered writes and writeable shared mappings >> +on sequential files. Only direct I/O writes are accepted for these files. >> +zonefs relies on the sequential delivery of write I/O requests to the device >> +implemented by the block layer elevator. An elevator implementing the sequential >> +write feature for zoned block device (ELEVATOR_F_ZBD_SEQ_WRITE elevator feature) >> +must be used. This type of elevator (e.g. mq-deadline) is the set by default > > is set by default > >> +for zoned block devices on device initialization. >> + > ... > >> + >> +Format options >> +-------------- >> + > ... > >> + >> +IO error handling >> +----------------- >> + > ... > >> + >> + >> +* Unaligned write errors: These errors result from the host issuing write >> + requests with a start sector that does not correspond to a zone write pointer >> + position when the write request is executed by the device. Even though zonefs >> + enforces sequential file write for sequential zones, unaligned write errors >> + may still happen in the case of a partial failure of a very large direct I/O >> + operation split into multiple BIOs/requests or asynchronous I/O operations. >> + If one of the write request within the set of sequential write requests >> + issued to the device fails, all write requests after queued after it will > > requests queued after it > >> + become unaligned and fail. >> + > ... > >> + >> +All I/O errors detected by zonefs are notified to the user with an error code >> +return for the system call that trigered or detected the error. The recovery > > triggered > >> +actions taken by zonefs in response to I/O errors depend on the I/O type (read >> +vs write) and on the reason for the error (bad sector, unaligned writes or zone >> +condition change). >> + > ... > >> + >> +Zonefs minimal I/O error recovery may change a file size and a file access > > and file access > >> +permissions. >> + >> +* File size changes: >> + Immediate or delayed write errors in a sequential zone file may cause the file >> + inode size to be inconsistent with the amount of data successfully written in >> + the file zone. For instance, the partial failure of a multi-BIO large write >> + operation will cause the zone write pointer to advance partially, even though >> + the entire write operation will be reported as failed to the user. In such >> + case, the file inode size must be advanced to reflect the zone write pointer >> + change and eventually allow the user to restart writing at the end of the >> + file. >> + A file size may also be reduced to reflect a delayed write error detected on >> + fsync(): in this case, the amount of data effectively written in the zone may >> + be less than originally indicated by the file inode size. After such I/O >> + error, zonefs always fixes a file inode size to reflect the amount of data > > fixes the file inode size > >> + persistently stored in the file zone. >> + >> +* Access permission changes: > ... > >> + >> +Further notes: >> +* The "errors=remount-ro" mount option is the default behavior of zonefs I/O >> + error processing if no errors mount option is specified. >> +* With the "errors=remount-ro" mount option, the change of the file access >> + permissions to read-only applies to all files. The file system is remounted >> + read-only. >> +* Access permission and file size changes due to the device transitioning zones >> + to the offline condition are permanent. Remounting or reformating the device > > usually: reformatting > >> + with mkfs.zonefs (mkzonefs) will not change back offline zone files to a good >> + state. >> +* File access permission changes to read-only due to the device transitioning >> + zones to the read-only condition are permanent. Remounting or reformating > > reformatting > >> + the device will not re-enable file write access. >> +* File access permission changes implied by the remount-ro, zone-ro and >> + zone-offline mount options are temporary for zones in a good condition. >> + Unmounting and remounting the file system will restore the previous default >> + (format time values) access rights to the files affected. >> +* The repair mount option triggers only the minimal set of I/O error recovery >> + actions, that is, file size fixes for zones in a good condition. Zones >> + indicated as being read-only or offline by the device still imply changes to >> + the zone file access permissions as noted in the table above. >> + >> +Mount options >> +------------- >> + >> +zonefs define the "errors=<behavior>" mount option to allow the user to specify >> +zonefs behavior in response to I/O errors, inode size inconsistencies or zone >> +condition chages. The defined behaviors are as follow: > > changes. > >> +* remount-ro (default) >> +* zone-ro >> +* zone-offline >> +* repair >> + >> +The I/O error actions defined for each behavior is detailed in the previous > > are > >> +section. >> + >> +Zonefs User Space Tools >> +======================= >> + > ... >> + >> +Examples >> +-------- >> + > ... > > > HTH. > -- Damien Le Moal Western Digital Research