Re: [PATCH 12/16] ext4: add RWF_UNCACHED write support

Jens Axboe <axboe@xxxxxxxxx> · Tue, 12 Nov 2024 11:47:57 -0700

On 11/12/24 11:11 AM, Brian Foster wrote:
> On Tue, Nov 12, 2024 at 10:13:12AM -0700, Jens Axboe wrote:
>> On 11/12/24 9:36 AM, Brian Foster wrote:
>>> On Mon, Nov 11, 2024 at 04:37:39PM -0700, Jens Axboe wrote:
>>>> IOCB_UNCACHED IO needs to prune writeback regions on IO completion,
>>>> and hence need the worker punt that ext4 also does for unwritten
>>>> extents. Add an io_end flag to manage that.
>>>>
>>>> If foliop is set to foliop_uncached in ext4_write_begin(), then set
>>>> FGP_UNCACHED so that __filemap_get_folio() will mark newly created
>>>> folios as uncached. That in turn will make writeback completion drop
>>>> these ranges from the page cache.
>>>>
>>>> Now that ext4 supports both uncached reads and writes, add the fop_flag
>>>> FOP_UNCACHED to enable it.
>>>>
>>>> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
>>>> ---
>>>>  fs/ext4/ext4.h    |  1 +
>>>>  fs/ext4/file.c    |  2 +-
>>>>  fs/ext4/inline.c  |  7 ++++++-
>>>>  fs/ext4/inode.c   | 18 ++++++++++++++++--
>>>>  fs/ext4/page-io.c | 28 ++++++++++++++++------------
>>>>  5 files changed, 40 insertions(+), 16 deletions(-)
>>>>
>>> ...
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index 54bdd4884fe6..afae3ab64c9e 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -1138,6 +1138,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
>>>>  	int ret, needed_blocks;
>>>>  	handle_t *handle;
>>>>  	int retries = 0;
>>>> +	fgf_t fgp_flags;
>>>>  	struct folio *folio;
>>>>  	pgoff_t index;
>>>>  	unsigned from, to;
>>>> @@ -1164,6 +1165,15 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
>>>>  			return 0;
>>>>  	}
>>>>  
>>>> +	/*
>>>> +	 * Set FGP_WRITEBEGIN, and FGP_UNCACHED if foliop contains
>>>> +	 * foliop_uncached. That's how generic_perform_write() informs us
>>>> +	 * that this is an uncached write.
>>>> +	 */
>>>> +	fgp_flags = FGP_WRITEBEGIN;
>>>> +	if (*foliop == foliop_uncached)
>>>> +		fgp_flags |= FGP_UNCACHED;
>>>> +
>>>>  	/*
>>>>  	 * __filemap_get_folio() can take a long time if the
>>>>  	 * system is thrashing due to memory pressure, or if the folio
>>>> @@ -1172,7 +1182,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
>>>>  	 * the folio (if needed) without using GFP_NOFS.
>>>>  	 */
>>>>  retry_grab:
>>>> -	folio = __filemap_get_folio(mapping, index, FGP_WRITEBEGIN,
>>>> +	folio = __filemap_get_folio(mapping, index, fgp_flags,
>>>>  					mapping_gfp_mask(mapping));
>>>>  	if (IS_ERR(folio))
>>>>  		return PTR_ERR(folio);
>>>
>>> JFYI, I notice that ext4 cycles the folio lock here in this path and
>>> thus follows up with a couple checks presumably to accommodate that. One
>>> is whether i_mapping has changed, which I assume means uncached state
>>> would have been handled/cleared externally somewhere..? I.e., if an
>>> uncached folio is somehow truncated/freed without ever having been
>>> written back?
>>>
>>> The next is a folio_wait_stable() call "in case writeback began ..."
>>> It's not immediately clear to me if that is possible here, but taking
>>> that at face value, is it an issue if we were to create an uncached
>>> folio, drop the folio lock, then have some other task dirty and
>>> writeback the folio (due to a sync write or something), then have
>>> writeback completion invalidate the folio before we relock it here?
>>
>> I don't either of those are an issue. The UNCACHED flag will only be set
>> on a newly created folio, it does not get inherited for folios that
>> already exist.
>>
> 
> Right.. but what I was wondering for that latter case is if the folio is
> created here by ext4, so uncached is set before it is unlocked.
> 
> On second look I guess the uncached completion invalidation should clear
> mapping and thus trigger the retry logic here. That seems reasonable
> enough, but is it still possible to race with writeback?
> 
> Maybe this is a better way to ask.. what happens if a write completes to
> an uncached folio that is already under writeback? For example, uncached
> write 1 completes, submits for writeback and returns to userspace. Then
> write 2 begins and redirties the same folio before the uncached
> writeback completes.
> 
> If I follow correctly, if write 2 is also uncached, it eventually blocks
> in writeback submission (folio_prepare_writeback() ->
> folio_wait_writeback()). It looks like folio lock is held there, so
> presumably that would bypass the completion time invalidation in
> folio_end_uncached(). But what if write 2 was not uncached or perhaps
> writeback completion won the race for folio lock vs. the write side
> (between locking the folio for dirtying and later for writeback
> submission)? Does anything prevent invalidation of the folio before the
> second write is submitted for writeback?
> 
> IOW, I'm wondering if the uncached completion time invalidation also
> needs a folio dirty check..?

Ah ok, I see what you mean. If the folio is dirty, the unmapping will
fail. But I guess with the recent change, we'll actually unmap it first.
I'll add the folio dirty check, thanks!

-- 
Jens Axboe