Re: [PATCH 1/2] fsx: support reads/writes from buffers backed by hugepages

Joanne Koong <joannelkoong@xxxxxxxxx> · Mon, 23 Dec 2024 12:59:38 -0800

On Fri, Dec 20, 2024 at 3:50 AM Brian Foster <bfoster@xxxxxxxxxx> wrote:
>
> On Thu, Dec 19, 2024 at 02:34:01PM -0800, Joanne Koong wrote:
> > On Thu, Dec 19, 2024 at 6:27 AM Brian Foster <bfoster@xxxxxxxxxx> wrote:
> > >
> > > On Wed, Dec 18, 2024 at 01:01:21PM -0800, Joanne Koong wrote:
> > > > Add support for reads/writes from buffers backed by hugepages.
> > > > This can be enabled through the '-h' flag. This flag should only be used
> > > > on systems where THP capabilities are enabled.
> > > >
> > > > Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx>
> > > > ---
> > >
> > > Firstly, thanks for taking the time to add this. This seems like a nice
> > > idea. It might be nice to have an extra sentence or two in the commit
> > > log on the purpose/motivation. For example, has this been used to detect
> > > a certain class of problem?
> >
> > Hi Brian,
> >
> > Thanks for reviewing this. That's a good idea - I'll include the
> > sentence from the cover letter to this commit message as well: "This
> > is motivated by a recent bug that was due to faulty handling for
> > userspace buffers backed by hugepages."
> >
>
> Thanks. Got a link or anything, for my own curiosity?
>
> Also, I presume the followup fstest is a reproducer?

This is the link to the bug and the ensuing discussion:
https://lore.kernel.org/linux-fsdevel/p3iss6hssbvtdutnwmuddvdadubrhfkdoosgmbewvo674f7f3y@cwnwffjqltzw/
. The tldr is that even if a filesystem does not support hugepages, it
could still encounter hugepages in the direct io case for large folios
backing userspace buffers. I missed this in commit 3b97c3652d91, which
resulted in incorrect data being forwarded in fuse. There's currently
no fstest checking against hugepages-backed userspace buffers, so this
patchset is a followup for this case and would have caught the bug.

>
> > >
> > > A few other quick comments below...
> > >
> > > >  ltp/fsx.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++-----
> > > >  1 file changed, 92 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/ltp/fsx.c b/ltp/fsx.c
> > > > index 41933354..3656fd9f 100644
> > > > --- a/ltp/fsx.c
> > > > +++ b/ltp/fsx.c
> > > > @@ -190,6 +190,7 @@ int       o_direct;                       /* -Z */
> > > >  int  aio = 0;
> > > > +}
> > > > +
> > > > +static void *
> > > > +init_hugepages_buf(unsigned len, long hugepage_size)
> > > > +{
> > > > +     void *buf;
> > > > +     long buf_size = roundup(len, hugepage_size);
> > > > +
> > > > +     if (posix_memalign(&buf, hugepage_size, buf_size)) {
> > > > +             prterr("posix_memalign for buf");
> > > > +             return NULL;
> > > > +     }
> > > > +     memset(buf, '\0', len);
> > >
> > > I'm assuming it doesn't matter, but did you want to use buf_size here to
> > > clear the whole buffer?
> >
> > I only saw buf being used up to len in the rest of the code so I
> > didn't think it was necessary, but I also don't feel strongly about
> > this and am happy to change this to clear the entire buffer if
> > preferred.
> >
>
> Yeah.. at first it looked like a bug to me, then I realized the same
> thing later. I suspect it might be wise to just clear it entirely to
> avoid any future landmines, but that could just be my internal bias
> talking too. No big deal either way.
>

Sounds great, I'll clear it entirely in v2.

> > >
> > > > +     if (madvise(buf, buf_size, MADV_COLLAPSE)) {
> > > > +             prterr("madvise collapse for buf");
> > > > +             free(buf);
> > > > +             return NULL;
> > > > +     }
> > > > +
> > > > +     return buf;
> > > > +}
> > > > @@ -3232,12 +3287,41 @@ main(int argc, char **argv)
> > > >       original_buf = (char *) malloc(maxfilelen);
> > > >       for (i = 0; i < maxfilelen; i++)
> > > >               original_buf[i] = random() % 256;
> > > > -     good_buf = (char *) malloc(maxfilelen + writebdy);
> > > > -     good_buf = round_ptr_up(good_buf, writebdy, 0);
> > > > -     memset(good_buf, '\0', maxfilelen);
> > > > -     temp_buf = (char *) malloc(maxoplen + readbdy);
> > > > -     temp_buf = round_ptr_up(temp_buf, readbdy, 0);
> > > > -     memset(temp_buf, '\0', maxoplen);
> > > > +     if (hugepages) {
> > > > +             long hugepage_size;
> > > > +
> > > > +             hugepage_size = get_hugepage_size();
> > > > +             if (hugepage_size == -1) {
> > > > +                     prterr("get_hugepage_size()");
> > > > +                     exit(99);
> > > > +             }
> > > > +
> > > > +             if (writebdy != 1 && writebdy != hugepage_size)
> > > > +                     prt("ignoring write alignment (since -h is enabled)");
> > > > +
> > > > +             if (readbdy != 1 && readbdy != hugepage_size)
> > > > +                     prt("ignoring read alignment (since -h is enabled)");
> > >
> > > I'm a little unclear on what these warnings mean. The alignments are
> > > still used in the read/write paths afaics. The non-huge mode seems to
> > > only really care about the max size of the buffers in this code.
> > >
> > > If your test doesn't actually use read/write alignments and the goal is
> > > just to keep things simple, perhaps it would be cleaner to add something
> > > like an if (hugepages && (writebdy != 1 || readbdy != 1)) check after
> > > option processing and exit out as an unsupported combination..?
> >
> > My understanding of the 'writebdy' and 'readbdy' options are that
> > they're for making reads/writes aligned to the passed-in value, which
> > depends on the starting address of the buffer being aligned to that
> > value as well. However for hugepages buffers, they must be aligned to
> > the system hugepage size (eg 2 MiB) or the madvise(... MADV_COLLAPSE)
> > call will fail. As such, it is not guaranteed that the requested
> > alignment will actually be abided by. For that reason, I thought it'd
> > be useful to print this out to the user so they know requested
> > alignments will be ignored, but it didn't seem severe enough of an
> > issue to error out and exit altogether. But maybe it'd be less
> > confusing for the user if this instead does just error out if the
> > alignment isn't a multiple of the hugepage size.
> >
>
> Ahh, I see. I missed the round_ptr_up() adjustments. That makes more
> sense now.
>
> IMO it would be a little cleaner to just bail out earlier as such. But
> either way, I suppose if you could add a small comment with this
> alignment context you've explained above with the error checks then that
> is good enough for me. Thanks!
>

Sounds great, will do for v2.

Thanks,
Joanne

> Brian
>
> > >
> > > BTW, it might also be nice to factor out this whole section of buffer
> > > initialization code (including original_buf) into an init_buffers() or
> > > some such. That could be done as a prep patch, but just a suggestion
> > > either way.
> >
> > Good idea - i'll do this refactoring for v2.
> >
> >
> > Thanks,
> > Joanne
> > >
> > > Brian
> > >
> > > > +
> > > > +             good_buf = init_hugepages_buf(maxfilelen, hugepage_size);
> > > > +             if (!good_buf) {
> > > > +                     prterr("init_hugepages_buf failed for good_buf");
> > > > +                     exit(100);
> > > > +             }
> > > > +
> > > > +             temp_buf = init_hugepages_buf(maxoplen, hugepage_size);
> > > > +             if (!temp_buf) {
> > > > +                     prterr("init_hugepages_buf failed for temp_buf");
> > > > +                     exit(101);
> > > > +             }
> > > > +     } else {
> > > > +             good_buf = (char *) malloc(maxfilelen + writebdy);
> > > > +             good_buf = round_ptr_up(good_buf, writebdy, 0);
> > > > +             memset(good_buf, '\0', maxfilelen);
> > > > +
> > > > +             temp_buf = (char *) malloc(maxoplen + readbdy);
> > > > +             temp_buf = round_ptr_up(temp_buf, readbdy, 0);
> > > > +             memset(temp_buf, '\0', maxoplen);
> > > > +     }
> > > >       if (lite) {     /* zero entire existing file */
> > > >               ssize_t written;
> > > >
> > > > --
> > > > 2.47.1
> > > >
> > > >
> > >
> >
>