Re: [PATCH v11 2/2] zonefs: Add documentation

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 6 Feb 2020 09:29:47 +1100

On Wed, Feb 05, 2020 at 09:08:37PM +0900, Damien Le Moal wrote:
> Add the new file Documentation/filesystems/zonefs.txt to document
> zonefs principles and user-space tool usage.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@xxxxxxx>
.....

Just looking at the error handling text...

> +Several optional features of zonefs can be enabled at format time.
> +* Conventional zone aggregation: ranges of contiguous conventional zones can be
> +  aggregated into a single larger file instead of the default one file per zone.
> +* File ownership: The owner UID and GID of zone files is by default 0 (root)
> +  but can be changed to any valid UID/GID.
> +* File access permissions: the default 640 access permissions can be changed.
> +
> +zonefs mount options
> +--------------------

This section is really all about error handling, not so much mount
options in general...

> +
> +zonefs defines several mount options allowing the user to control the file
> +system behavior when write I/O errors occur and when inconsistencies between a
> +file size and its zone write pointer position are discovered. The handling of
> +read I/O errors is not changed by these options as long as no inode size
> +corruption is detected.
> +
> +These options are as follows.
> +* errors=remount-ro (default)
> +  All write IO errors and errors due to a zone of the device going "bad"
> +  (condition changed to offline or read-only), the file system is remounted
> +  read-only after fixing the size and access permissions of the inode that
> +  suffered the IO error.

What does "fixing the size and access permissions of the inode"
mean?

> +* errors=zone-ro
> +  Any write IO error to a file zone result in the zone being considered as in a
> +  read-only condition, preventing any further modification to the file. This
> +  option does not affect the handling of errors due to offline zones. For these
> +  zones, all accesses (read and write) are disabled.

If the zone is marked RO, then shouldn't reads still work?. Oh, hold
on, you're now talking about errors that take the zone oflfine at
the device level?

Perhaps a table describing what IO can be done to a zone vs the
device once an error occurs would be a clearer way of describing the
behaviour.

It seems to me that a table might be a better way of decribing all
the different conditions

				Post error access permissions
				   zone		    device
mountopt	zone state	read	write	read	write
--------	----------	----	-----	----	-----
remount-ro	good		yes	no	yes	no
		RO		yes	no	yes	no
		Offline		no	no	yes	no

zone-ro		good		yes	no	yes	yes
		RO		yes	no	yes	yes
		Offline		no	no	yes	yes

zone-offline	good		no	no	yes	yes
		RO		no	no	yes	yes
		Offline		no	no	yes	yes

repair		good		yes	yes	yes	yes
		RO		yes	no	yes	yes
		Offline		no	no	yes	yes

And then you can document that an offline zone will always appear to
have a size of 0, be immutable, etc, while a read-only zone will
have a size that reflects the amount of valid data in the zone that
can be read.

IOWs, you don't need to mix the definitions of zone state appearence
and behaviour with descriptions of what actions the different mount
options take when a write error occurs.

> +* errors=zone-offline
> +  Any write IO error to a file zone result in the zone being considered as in
> +  an offline condition. This implies that the file size is changed to 0 and all
> +  read/write accesses to the file disabled, preventing all accesses by the user.
> +* errors=repair
> +  Any inconsistency between an inode size and its zone amount of written data
> +  due to IO errors or external corruption are fixed without any change to file
> +  access rights. This option does not affect the processing of zones that were
> +  signaled as read-only or offline by the device. For read-only zones, the file
> +  read accesses are disabled and for offline zones, all access permissions are
> +  removed.
> +
> +For sequential zone files, inconsistencies between an inode size and the amount
> +of data writen in its zone, that is, the position of the file zone write
> +pointer, can result from different events:
> +* When the device write cache is enabled, a differed write error can occur

"a different write error"?

> +  resulting in the amount of data written in the zone being less than the inode
> +  size.

Though I suspect that what you really mean to say is that errors can
occur in ranges of previously completed writes can occur when the
cache is flushed, hence less data being physically written than the
OS has previously be told was written by the hardware. i.e. visible
inode size goes backwards.

> +* Partial failures of large write I/O operations (e.g. one BIO of a multi-bio
> +  large direct write fails) can result in the amount of data written in the
> +  zone being larger than the inode size.
> +* External action on the disk such as write, zone reset or zone finish
> +  operations will change a file zone write pointer position resulting in a
> +  reported amount of written data being different from the file inode size.

*nod*

> +Finally, defective drives may change the condition of any zone to offline (zone
> +dead) or read-only. Such changes, when discovered with the IO errors they can
> +cause, are handled automatically regardless of the options specified at mount
> +time. For offline zones, the action taken is similar to the action defined by
> +the errors=zone-offline mount option. For read-only zones, the action used is
> +as defined by the errors=zone-ro mount option.

Hmmmm. I think that's over-complicating things and takes control of
error handling away from the user. That is, regardless of the reason
for the error, if we get a write error and the user specified
errors=remount-ro, the entire device should go read-only because
that's what the user has told the filesystem to do on write error.

This seems pretty user-unfriendly - giving them a way to control
error handling behaviour and then ignoring it for specific errors
despite the fact they mean exactly the same thing to the user: the
write failed because a zone has gone bad since the last time that
zone was accessed by the application....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx