On Thu, Jan 16, 2025 at 05:03:41PM -0800, Joanne Koong wrote: > On Wed, Jan 15, 2025 at 4:59 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > > > > On Wed, Jan 15, 2025 at 04:47:30PM -0800, Joanne Koong wrote: > > > On Wed, Jan 15, 2025 at 1:37 PM Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > > > > > > > > On Wed, Jan 15, 2025 at 10:31:06AM -0800, Joanne Koong wrote: > > > > > Add support for reads/writes from buffers backed by hugepages. > > > > > This can be enabled through the '-h' flag. This flag should only be used > > > > > on systems where THP capabilities are enabled. > > > > > > > > > > This is motivated by a recent bug that was due to faulty handling of > > > > > userspace buffers backed by hugepages. This patch is a mitigation > > > > > against problems like this in the future. > > > > > > > > > > Signed-off-by: Joanne Koong <joannelkoong@xxxxxxxxx> > > > > > Reviewed-by: Brian Foster <bfoster@xxxxxxxxxx> > > > > > --- > > > > > ltp/fsx.c | 119 +++++++++++++++++++++++++++++++++++++++++++++++++----- > > > > > 1 file changed, 108 insertions(+), 11 deletions(-) > > > > > > > > > > diff --git a/ltp/fsx.c b/ltp/fsx.c > > > > > index 41933354..8d3a2e2c 100644 > > > > > --- a/ltp/fsx.c > > > > > +++ b/ltp/fsx.c > > > > > @@ -190,6 +190,7 @@ int o_direct; /* -Z */ > > > > > int aio = 0; > > > > > int uring = 0; > > > > > int mark_nr = 0; > > > > > +int hugepages = 0; /* -h flag */ > > > > > > > > > > int page_size; > > > > > int page_mask; > > > > > @@ -2471,7 +2472,7 @@ void > > > > > usage(void) > > > > > { > > > > > fprintf(stdout, "usage: %s", > > > > > - "fsx [-dfknqxyzBEFHIJKLORWXZ0]\n\ > > > > > + "fsx [-dfhknqxyzBEFHIJKLORWXZ0]\n\ > > > > > [-b opnum] [-c Prob] [-g filldata] [-i logdev] [-j logid]\n\ > > > > > [-l flen] [-m start:end] [-o oplen] [-p progressinterval]\n\ > > > > > [-r readbdy] [-s style] [-t truncbdy] [-w writebdy]\n\ > > > > > @@ -2484,6 +2485,7 @@ usage(void) > > > > > -e: pollute post-eof on size changes (default 0)\n\ > > > > > -f: flush and invalidate cache after I/O\n\ > > > > > -g X: write character X instead of random generated data\n\ > > > > > + -h hugepages: use buffers backed by hugepages for reads/writes\n\ > > > > > > > > If this requires MADV_COLLAPSE, then perhaps the help text shouldn't > > > > describe the switch if the support wasn't compiled in? > > > > > > > > e.g. > > > > > > > > -g X: write character X instead of random generated data\n" > > > > #ifdef MADV_COLLAPSE > > > > " -h hugepages: use buffers backed by hugepages for reads/writes\n" > > > > #endif > > > > " -i logdev: do integrity testing, logdev is the dm log writes device\n\ > > > > > > > > (assuming I got the preprocessor and string construction goo right; I > > > > might be a few cards short of a deck due to zombie attack earlier) > > > > > > Sounds great, I'll #ifdef out the help text -h line. Hope you feel better. > > > > > > > > > -i logdev: do integrity testing, logdev is the dm log writes device\n\ > > > > > -j logid: prefix debug log messsages with this id\n\ > > > > > -k: do not truncate existing file and use its size as upper bound on file size\n\ > > > [...] > > > > > +} > > > > > + > > > > > +#ifdef MADV_COLLAPSE > > > > > +static void * > > > > > +init_hugepages_buf(unsigned len, int hugepage_size, int alignment) > > > > > +{ > > > > > + void *buf; > > > > > + long buf_size = roundup(len, hugepage_size) + alignment; > > > > > + > > > > > + if (posix_memalign(&buf, hugepage_size, buf_size)) { > > > > > + prterr("posix_memalign for buf"); > > > > > + return NULL; > > > > > + } > > > > > + memset(buf, '\0', buf_size); > > > > > + if (madvise(buf, buf_size, MADV_COLLAPSE)) { > > > > > > > > If the fsx runs for a long period of time, will it be necessary to call > > > > MADV_COLLAPSE periodically to ensure that reclaim doesn't break up the > > > > hugepage? > > > > > > > > > > imo, I don't think so. My understanding is that this would be a rare > > > edge case that happens when the system is constrained on memory, in > > > which case subsequent calls to MADV_COLLAPSE would most likely fail > > > anyways. > > > > Hrmmm... well I /do/ like to run memory constrained VMs to prod reclaim > > into stressing the filesystem more. But I guess there's no good way for > > fsx to know that something happened to it. Unless there's some even > > goofier way to force a hugepage, like shmem/hugetlbfs (ugh!) :) > > I can't think of a better way to force a hugepage either. I believe > shmem and hugetlbfs would both require root privileges to do so, and > if i'm not mistaken, shmem hugepages are still subject to being broken > up by reclaim. > > > > > Will have to ponder hugepage renewasl -- maybe we should madvise every > > few thousand fsxops just to be careful? > > I can add this in, but on memory constrained VMs, would this be > effective? To me, it seems like in the majority of cases, subsequent They're not /so/ memory constrained that the initial collapsed page is likely to get split/reclaimed before your regression test finishes... > attempts at collapsing the broken pages back into a hugepage would > fail due to memory still being constrained. In which case, I guess > we'd exit the test altogether? ...but I was starting to wonder if this is something we'd actually want in a long soak test, just in case there are weird effects on the system after fsx has been running for a few days? > It kind of seems to me like if the user > wants to test out hugepages functionality of their filesystem, then > the onus is on them to run the test in an environment that can > adequately and consistently support hugepages. I guess the hard part about analyzing that is is that long soak tests aren't usually supposed to die on account of memory fragmentation. Hence me wondering if there's an "easy" way to get one huge page and not let it go. Anyway don't let my blathering hold this up; I think once you fix the help text and the #ifdef indenting around exit(86) this patch is good to go. --D > > Thanks, > Joanne > > > > > --D > > > > > > > > Thanks, > > > Joanne > > > > > > > > + prterr("madvise collapse for buf"); > > > > > + free(buf); > > > > > + return NULL; > > > > > + } > > > > > + > > > > > + return buf; > > > > > +} > > > > > +#else > > > > > +static void * > > > > > +init_hugepages_buf(unsigned len, int hugepage_size, int alignment) > > > > > +{ > > > > > + return NULL; > > > > > +} > > > > > +#endif > > > > > + > > > > > +static void > > > > > +init_buffers(void) > > > > > +{ > > > > > + int i; > > > > > + > > > > > + original_buf = (char *) malloc(maxfilelen); > > > > > + for (i = 0; i < maxfilelen; i++) > > > > > + original_buf[i] = random() % 256; > > > > > + if (hugepages) { > > > > > + long hugepage_size = get_hugepage_size(); > > > > > + if (hugepage_size == -1) { > > > > > + prterr("get_hugepage_size()"); > > > > > + exit(102); > > > > > + } > > > > > + good_buf = init_hugepages_buf(maxfilelen, hugepage_size, writebdy); > > > > > + if (!good_buf) { > > > > > + prterr("init_hugepages_buf failed for good_buf"); > > > > > + exit(103); > > > > > + } > > > > > + > > > > > + temp_buf = init_hugepages_buf(maxoplen, hugepage_size, readbdy); > > > > > + if (!temp_buf) { > > > > > + prterr("init_hugepages_buf failed for temp_buf"); > > > > > + exit(103); > > > > > + } > > > > > + } else { > > > > > + unsigned long good_buf_len = maxfilelen + writebdy; > > > > > + unsigned long temp_buf_len = maxoplen + readbdy; > > > > > + > > > > > + good_buf = calloc(1, good_buf_len); > > > > > + temp_buf = calloc(1, temp_buf_len); > > > > > + } > > > > > + good_buf = round_ptr_up(good_buf, writebdy, 0); > > > > > + temp_buf = round_ptr_up(temp_buf, readbdy, 0); > > > > > +} > > > > > + > > > > > static struct option longopts[] = { > > > > > {"replay-ops", required_argument, 0, 256}, > > > > > {"record-ops", optional_argument, 0, 255}, > > > > > @@ -2883,7 +2980,7 @@ main(int argc, char **argv) > > > > > setvbuf(stdout, (char *)0, _IOLBF, 0); /* line buffered stdout */ > > > > > > > > > > while ((ch = getopt_long(argc, argv, > > > > > - "0b:c:de:fg:i:j:kl:m:no:p:qr:s:t:uw:xyABD:EFJKHzCILN:OP:RS:UWXZ", > > > > > + "0b:c:de:fg:hi:j:kl:m:no:p:qr:s:t:uw:xyABD:EFJKHzCILN:OP:RS:UWXZ", > > > > > longopts, NULL)) != EOF) > > > > > switch (ch) { > > > > > case 'b': > > > > > @@ -2916,6 +3013,14 @@ main(int argc, char **argv) > > > > > case 'g': > > > > > filldata = *optarg; > > > > > break; > > > > > + case 'h': > > > > > + #ifndef MADV_COLLAPSE > > > > > > > > Preprocessor directives should start at column 0, like most of the rest > > > > of fstests. > > > > > > > > --D > > > >