Re: [PATCH] zonefs: Always invalidate last cache page on append write

Damien Le Moal <damien.lemoal@xxxxxxxxxxxxxxxxxx> · Wed, 29 Mar 2023 18:49:13 +0900

On 3/29/23 17:27, Damien Le Moal wrote:
> On 3/29/23 17:14, Christoph Hellwig wrote:
>> On Wed, Mar 29, 2023 at 02:58:23PM +0900, Damien Le Moal wrote:
>>> +	/*
>>> +	 * If the inode block size (sector size) is smaller than the
>>> +	 * page size, we may be appending data belonging to an already
>>> +	 * cached last page of the inode. So make sure to invalidate that
>>> +	 * last cached page. This will always be a no-op for the case where
>>> +	 * the block size is equal to the page size.
>>> +	 */
>>> +	ret = invalidate_inode_pages2_range(inode->i_mapping,
>>> +					    iocb->ki_pos >> PAGE_SHIFT, -1);
>>> +	if (ret)
>>> +		return ret;
>>
>> The missing truncate here obviously is a bug and needs fixing.
>>
>> But why does this not follow the logic in __iomap_dio_rw to to return
>> -ENOTBLK for any error so that the write falls back to buffered I/O.
> 
> This is a write to sequential zones so we cannot use buffered writes. We have to
> do a direct write to ensure ordering between writes.
> 
> Note that this is the special blocking write case where we issue a zone append.
> For async regular writes, we use iomap so this bug does not exist. But then I
> now realize that __iomap_dio_rw() falling back to buffered IOs could also create
> an issue with write ordering.

Checking this, there are no issues as it is the FS caller of iomap_dio_rw() who
has to fallback to buffered IO if it wants to. But zonefs does not do that.

> 
>> Also as far as I can tell from reading the code, -1 is not a valid
>> end special case for invalidate_inode_pages2_range, so you'll actually
>> have to pass a valid end here.
> 
> I wondered about that but then saw:
> 
> int invalidate_inode_pages2(struct address_space *mapping)
> {
> 	return invalidate_inode_pages2_range(mapping, 0, -1);
> }
> EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
> 
> which tend to indicate that "-1" is fine. The end is passed to
> find_get_entries() -> find_get_entry() where it becomes a "max" pgoff_t, so
> using -1 seems fine.
> 
> 

-- 
Damien Le Moal
Western Digital Research