On Fri, Dec 20, 2024 at 3:50 AM Brian Foster <bfoster@xxxxxxxxxx> wrote: > > On Thu, Dec 19, 2024 at 02:34:01PM -0800, Joanne Koong wrote: > > On Thu, Dec 19, 2024 at 6:27 AM Brian Foster <bfoster@xxxxxxxxxx> wrote: > > > > > > On Wed, Dec 18, 2024 at 01:01:21PM -0800, Joanne Koong wrote: > > > > Add support for reads/writes from buffers backed by hugepages. > > > > This can be enabled through the '-h' flag. This flag should only be used > > > > on systems where THP capabilities are enabled. > > > > > > > > Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx> > > > > --- > > > > > > Firstly, thanks for taking the time to add this. This seems like a nice > > > idea. It might be nice to have an extra sentence or two in the commit > > > log on the purpose/motivation. For example, has this been used to detect > > > a certain class of problem? > > > > Hi Brian, > > > > Thanks for reviewing this. That's a good idea - I'll include the > > sentence from the cover letter to this commit message as well: "This > > is motivated by a recent bug that was due to faulty handling for > > userspace buffers backed by hugepages." > > > > Thanks. Got a link or anything, for my own curiosity? > > Also, I presume the followup fstest is a reproducer? This is the link to the bug and the ensuing discussion: https://lore.kernel.org/linux-fsdevel/p3iss6hssbvtdutnwmuddvdadubrhfkdoosgmbewvo674f7f3y@cwnwffjqltzw/ . The tldr is that even if a filesystem does not support hugepages, it could still encounter hugepages in the direct io case for large folios backing userspace buffers. I missed this in commit 3b97c3652d91, which resulted in incorrect data being forwarded in fuse. There's currently no fstest checking against hugepages-backed userspace buffers, so this patchset is a followup for this case and would have caught the bug. > > > > > > > A few other quick comments below... > > > > > > > ltp/fsx.c | 100 +++++++++++++++++++++++++++++++++++++++++++++++++----- > > > > 1 file changed, 92 insertions(+), 8 deletions(-) > > > > > > > > diff --git a/ltp/fsx.c b/ltp/fsx.c > > > > index 41933354..3656fd9f 100644 > > > > --- a/ltp/fsx.c > > > > +++ b/ltp/fsx.c > > > > @@ -190,6 +190,7 @@ int o_direct; /* -Z */ > > > > int aio = 0; > > > > +} > > > > + > > > > +static void * > > > > +init_hugepages_buf(unsigned len, long hugepage_size) > > > > +{ > > > > + void *buf; > > > > + long buf_size = roundup(len, hugepage_size); > > > > + > > > > + if (posix_memalign(&buf, hugepage_size, buf_size)) { > > > > + prterr("posix_memalign for buf"); > > > > + return NULL; > > > > + } > > > > + memset(buf, '\0', len); > > > > > > I'm assuming it doesn't matter, but did you want to use buf_size here to > > > clear the whole buffer? > > > > I only saw buf being used up to len in the rest of the code so I > > didn't think it was necessary, but I also don't feel strongly about > > this and am happy to change this to clear the entire buffer if > > preferred. > > > > Yeah.. at first it looked like a bug to me, then I realized the same > thing later. I suspect it might be wise to just clear it entirely to > avoid any future landmines, but that could just be my internal bias > talking too. No big deal either way. > Sounds great, I'll clear it entirely in v2. > > > > > > > + if (madvise(buf, buf_size, MADV_COLLAPSE)) { > > > > + prterr("madvise collapse for buf"); > > > > + free(buf); > > > > + return NULL; > > > > + } > > > > + > > > > + return buf; > > > > +} > > > > @@ -3232,12 +3287,41 @@ main(int argc, char **argv) > > > > original_buf = (char *) malloc(maxfilelen); > > > > for (i = 0; i < maxfilelen; i++) > > > > original_buf[i] = random() % 256; > > > > - good_buf = (char *) malloc(maxfilelen + writebdy); > > > > - good_buf = round_ptr_up(good_buf, writebdy, 0); > > > > - memset(good_buf, '\0', maxfilelen); > > > > - temp_buf = (char *) malloc(maxoplen + readbdy); > > > > - temp_buf = round_ptr_up(temp_buf, readbdy, 0); > > > > - memset(temp_buf, '\0', maxoplen); > > > > + if (hugepages) { > > > > + long hugepage_size; > > > > + > > > > + hugepage_size = get_hugepage_size(); > > > > + if (hugepage_size == -1) { > > > > + prterr("get_hugepage_size()"); > > > > + exit(99); > > > > + } > > > > + > > > > + if (writebdy != 1 && writebdy != hugepage_size) > > > > + prt("ignoring write alignment (since -h is enabled)"); > > > > + > > > > + if (readbdy != 1 && readbdy != hugepage_size) > > > > + prt("ignoring read alignment (since -h is enabled)"); > > > > > > I'm a little unclear on what these warnings mean. The alignments are > > > still used in the read/write paths afaics. The non-huge mode seems to > > > only really care about the max size of the buffers in this code. > > > > > > If your test doesn't actually use read/write alignments and the goal is > > > just to keep things simple, perhaps it would be cleaner to add something > > > like an if (hugepages && (writebdy != 1 || readbdy != 1)) check after > > > option processing and exit out as an unsupported combination..? > > > > My understanding of the 'writebdy' and 'readbdy' options are that > > they're for making reads/writes aligned to the passed-in value, which > > depends on the starting address of the buffer being aligned to that > > value as well. However for hugepages buffers, they must be aligned to > > the system hugepage size (eg 2 MiB) or the madvise(... MADV_COLLAPSE) > > call will fail. As such, it is not guaranteed that the requested > > alignment will actually be abided by. For that reason, I thought it'd > > be useful to print this out to the user so they know requested > > alignments will be ignored, but it didn't seem severe enough of an > > issue to error out and exit altogether. But maybe it'd be less > > confusing for the user if this instead does just error out if the > > alignment isn't a multiple of the hugepage size. > > > > Ahh, I see. I missed the round_ptr_up() adjustments. That makes more > sense now. > > IMO it would be a little cleaner to just bail out earlier as such. But > either way, I suppose if you could add a small comment with this > alignment context you've explained above with the error checks then that > is good enough for me. Thanks! > Sounds great, will do for v2. Thanks, Joanne > Brian > > > > > > > BTW, it might also be nice to factor out this whole section of buffer > > > initialization code (including original_buf) into an init_buffers() or > > > some such. That could be done as a prep patch, but just a suggestion > > > either way. > > > > Good idea - i'll do this refactoring for v2. > > > > > > Thanks, > > Joanne > > > > > > Brian > > > > > > > + > > > > + good_buf = init_hugepages_buf(maxfilelen, hugepage_size); > > > > + if (!good_buf) { > > > > + prterr("init_hugepages_buf failed for good_buf"); > > > > + exit(100); > > > > + } > > > > + > > > > + temp_buf = init_hugepages_buf(maxoplen, hugepage_size); > > > > + if (!temp_buf) { > > > > + prterr("init_hugepages_buf failed for temp_buf"); > > > > + exit(101); > > > > + } > > > > + } else { > > > > + good_buf = (char *) malloc(maxfilelen + writebdy); > > > > + good_buf = round_ptr_up(good_buf, writebdy, 0); > > > > + memset(good_buf, '\0', maxfilelen); > > > > + > > > > + temp_buf = (char *) malloc(maxoplen + readbdy); > > > > + temp_buf = round_ptr_up(temp_buf, readbdy, 0); > > > > + memset(temp_buf, '\0', maxoplen); > > > > + } > > > > if (lite) { /* zero entire existing file */ > > > > ssize_t written; > > > > > > > > -- > > > > 2.47.1 > > > > > > > > > > > > > >