Re: [PATCH] filemap: skip range writeback if end offset precedes start

Brian Foster <bfoster@xxxxxxxxxx> · Mon, 7 Nov 2022 11:33:46 -0500

On Sun, Oct 30, 2022 at 09:51:00PM -0700, Andrew Morton wrote:
> On Fri, 28 Oct 2022 08:54:28 -0400 Brian Foster <bfoster@xxxxxxxxxx> wrote:
> 
> > A call to file[map]_write_and_wait_range() with an end offset that
> > precedes the start offset but happens to land in the same page can
> > trigger writeback submission but fail to wait on the submitted page.
> > Writeback submission occurs because __filemap_fdatawrite_range()
> > passes both offsets down into write_cache_pages(), which rounds down
> > to page indexes before it starts processing writeback.
> > __filemap_fdatawait_range() immediately returns if the specified end
> > offset precedes the start offset, however.
> > 
> > I suspect these checks are primarily intended to handle overflow
> > conditions. I happened to notice this behavior when investigating an
> > unrelated problem and observed that a filemap_write_and_wait_range()
> > call with unexpected parameters had seemingly unpredictable latency.
> > That latency turned out to be the submission path occasionally
> > waiting on writeback state of the page (i.e. from
> > write_cache_pages()) before issuing the currently requested
> > writepage and then unconditionally failing to wait on the latter via
> > __filemap_fdatawait_range().
> > 
> > This could probably be reasonably fixed to either elide the
> > submission, as this patch does, or modify the fdatawait path to
> > check the page indexes instead of the unaligned offsets. After
> > poking around a bit, it seemed more consistent with various other
> > filemap interfaces to check the offsets in the write path and return
> > if the end offset is not >= the start. For example,
> > filemap_range_has_page() and filemap_range_has_writeback() both
> > include similar byte granularity checks.
> > 
> > ...
> >
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -418,6 +418,9 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start,
> >  		.range_end = end,
> >  	};
> >  
> > +	if (end < start)
> > +		return 0;
> > +
> >  	return filemap_fdatawrite_wbc(mapping, &wbc);
> >  }
> 

Hi Andrew,

Sorry for the delay..

> Is there any way in which this condition can be triggered from
> userspace?  Or from any non-buggy kernelspace?
> 

Hmm.. good question. Making a quick pass through the callers of
__filemap_fdatawrite_range(), I see the following situations:

- sync_file_range() includes a higher level end < start check that
results in skipping the operation. This appears to cover the
sync_file_range() family of syscalls.

- generic_fadvise(..., POSIX_FADV_DONTNEED) - sets end = -1 if less than
start. This essentially converts the range to cover through the end of
the file.

- filemap_fdatawrite_range() and file[map]_write_and_wait_range() are
called from quite a few places with non-determistic inputs. It
wouldn't surprise me a ton if somehow it were possible for some of these
callers to do the end < start thing based on unsanitized input or buggy
logic, they may just not care depending on if they wait or not.

For example, gfs2_fsync() calls filemap_fdata[write|wait]_range()
separately, so in theory if called with end < start, the write could
submit but the wait could skip similar to having called
filemap_write_and_wait_range() (assuming the conditional whole file
write in that path doesn't trigger) if the offsets land in the same
page.

> Should we have a WARN_ON() in there to detect this?
> 

That might make sense. I suppose it depends on what expected behavior
is. It certainly doesn't make much sense to write and then not wait from
write_and_wait() variants, so callers would probably want to know about
that. It might be hard to really audit all callsites to determine
whether anybody actually relies on the "end is before start but lands on
the same page" behavior for the write only case.

We could alternatively change fdatawait to compare the shifted page
indexes to match fdatawrite behavior, but that all seems a bit fragile
because of 1. the various higher level byte granularity checks and 2. I
doubt anybody actually checks for whether the range crosses a page
boundary, which is the difference between skipping just the wait or the
write as well.

So I dunno, I could see various combinations of changes being considered
reasonable. Perhaps a good starting point would be to wrap the check in
this patch with a WARN_ON_ONCE() and let it soak in -next for a while?
That would avoid excessive noise from repetitive callers [1] but still
allow those callsites to be identified/fixed. If there is some really
weird fdatawrite-only caller that conflicts, the change could always be
loosened up from there (as unlikely as that seems).. Hm?

Brian

[1] The use case that identified this problem is a wonky call from the
XFS truncate path from a workload that makes this truncate call
repeatedly. A WARN_ON_ONCE() would have most definitely been useful IMO,
but an unconditional warning would spam the logs in this particular
case.