Hi Chris, (CC Cavium people) Thanks for your series. On Fri, Nov 03, 2017 at 01:04:39PM -0400, Chris Metcalf wrote: > Here, finally, is a new spin of the task isolation work (v16), with > changes based on the issues that were raised at last year's Linux > Plumbers Conference and in the email discussion that followed. > > This version of the patch series cleans up a number of areas that were > a little dodgy in the previous patch series. > > - It no longer loops in the final code that prepares to return to > userspace; instead, it sets things up in the prctl() and then > validates when preparing to return to userspace, adjusting the > syscall return value to -EAGAIN at that point if something doesn't > line up quite correctly. > > - We no longer support the NOSIG mode that let you freely call into > the kernel multiple times while in task isolation. This was always > a little odd, since you really should be in sufficient control of > task isolation code that you can explicitly stop isolation with a > "prctl(PR_TASK_ISOLATION, 0)" before using the kernel for anything > else. It also made the semantics of migrating the task confusing. > More importantly, removing that support means that the only path > that sets up task isolation is the return from prctl(), which allows > us to make the simplification above. > > - We no longer try to signal the task isolation process from a remote > core when we detect that we are about to violate its isolation. > Instead, we just print a message there (and optionally dump stack), > and rely on the eventual interrupt on the core itself to trigger the > signal. We are always in a safe context to generate a signal when > we enter the kernel, unlike when we are deep in a call stack sending > an SMP IPI or whatever. > > - We notice the case of an unstable scheduler clock and return > EINVAL rather than spinning forever with EAGAIN (suggestion from > Francis Giraldeau). > > - The prctl() call requires zeros for arg3/4/5 (suggestion from > Eugene Syromiatnikov). > > The kernel internal isolation API is also now cleaner, and I have > included kerneldoc APIs for all the interfaces so that it should be > easier to port it to additional architectures; in fact looking at > include/linux/isolation.h is a good place to start understanding the > overall patch set. > > I removed Catalin's Reviewed-by for arm64, and Christoph's Tested-by > for x86, since this version is sufficiently different to merit > re-review and re-testing. > > Note that this is not rebased on top of Frederic's recent housekeeping > patch series, although it is largely orthogonal right now. After > Frederic's patch series lands, task isolation is enabled with > "isolcpus=nohz,domain,CPUS". We could add some shorthand for that > ("isolcpus=full,CPUS"?) or just use it as-is. > > The previous (v15) patch series is here: > > https://lkml.kernel.org/r/1471382376-5443-1-git-send-email-cmetcalf@xxxxxxxxxxxx > > This patch series is available at: > > git://git.kernel.org/pub/scm/linux/kernel/git/cmetcalf/linux-tile.git dataplane > > Some folks raised some good points at the LPC discussion and then in > email discussions that followed. Rather than trying to respond to > everyone in a flurry of emails, I'll answer some questions here: > > > Why not just instrument user_exit() to raise the isolation-lost signal? > > Andy pointed me in this direction. The advantage is that you catch > *everything*, by definition. There is a hook that can do this in the > current patch set, but you have to #define DEBUG_TASK_ISOLATION > manually to take advantage of it, because as written it has two issues: > > 1. You can't actually exit the kernel with prctl(PR_TASK_ISOLATION,0) > because the user_exit hook kills you first. > 2. You lose the ability to get much better diagnostics by waiting > until you are further into kernel entry and know what you're doing. > > You could work around #2 in several ways, but #1 is harder. I looked > at x86 for a while, and although you could imagine this, you really > want to generate a lost-isolation signal on any syscall that isn't > that exact prctl(), and it's awkward to try to do all of that checking > before user_exit(). Since in any case we do want to have the more > specific probes at the various kernel entry points where we generate > the diagnostics, I felt like it wasn't the right approach to enable > as a compilation-time default. I'm open to discussion on this though! > > > Can't we do all the exit-to-userspace work with irqs disabled? > > In fact, it turns out that you can do lru_add_drain() with irqs > disabled, so that's what we're doing in the patch series now. > > However, it doesn't seem possible to do the synchronous cancellation of > the vmstat deferred work with irqs disabled, though if there's a way, > it would be a little cleaner to do that; Christoph? We can certainly > update the statistics with interrupts disabled via > refresh_cpu_vm_stats(false), but that's not sufficient. For now, I > just issue the cancellation during sys_prctl() call, and then if it > isn't synchronized by the time we are exiting to userspace, we just > jam in an EAGAIN and let userspace retry. In practice, this doesn't > seem to ever happen. > > > What about using a per-cpu flag to stop doing new deferred work? > > Andy also suggested we could structure the code to have the prctl() > set a per-cpu flag to stop adding new future work (e.g. vmstat per-cpu > data, or lru page cache). Then, we could flush those structures right > from the sys_prctl() call, and when we were returning to user space, > we'd be confident that there wasn't going to be any new work added. > > With the current set of things that we are disabling for task > isolation, though, it didn't seem necessary. Quiescing the vmstat > shepherd seems like it is generally pretty safe since we will likely > be able to sync up the per-cpu cache and kill the deferred work with > high probability, with no expectation that additional work will show > up. And since we can flush the LRU page cache with interrupts > disabled, that turns out not to be an issue either. > > I could imagine that if we have to deal with some new kind of deferred > work, we might find the per-cpu flag becomes a good solution, but for > now we don't have a good use case for that approach. > > > How about stopping the dyn tick? > > Right now we try to stop it on return to userspace, but if we can't, > we just return EAGAIN to userspace. In practice, what I see is that > usually the tick stops immediately, but occasionally it doesn't; in > this case I've always seen that nr_running is >1, presumably with some > temporary kernel worker threads, and the user code just needs to call > prctl() until those threads are done. We could structure things with > a completion that we wait for, which is set by the timer code when it > finally does stop the tick, but this may be overkill, particularly > since we'll only be running this prctl() loop from userspace on cores > where we have no other useful work that we're trying to run anyway. > > > What about TLB flushing? > > We talked about this at Plumbers and some of the email discussion also > was about TLB flushing. I haven't tried to add it to this patch set, > because I really want to avoid scope creep; in any case, I think I > managed to convince Andy that he was going to work on it himself. :) > Paul McKenney already contributed some framework for such a patch, in > commit b8c17e6664c4 ("rcu: Maintain special bits at bottom of > ->dynticks counter"). > > What about that d*mn 1 Hz clock? > > It's still there, so this code still requires some further work before > it can actually get a process into long-term task isolation (without > the obvious one-line kernel hack). Frederic suggested a while ago > forcing updates on cpustats was required as the last gating factor; do > we think that is still true? Christoph was working on this at one > point - any progress from your point of view? I tested your series on ThunderX 2 machine. When I run 10 giga-ticks test, it always passed. If I run for more, the test exits like this: # time ./isolation 1000 /sys devices: OK (using task isolation cpu 100) prctl unaffinitized: OK prctl on cpu 0: OK ==> hello, world test_off: OK Received signal 11 successfully test_segv: OK test_fault: OK test_fault (SIGUSR1): OK test_syscall: OK test_syscall (SIGUSR1): OK test_schedule: OK test_schedule (SIGUSR1): OK testing task isolation jitter for 1000000000000 ticks ERROR: Program unexpectedly entered kernel. INFO: loop times: 1 cycles (count: 128606844716) 2 cycles (count: 31558352292) 3 cycles (count: 4) 29 cycles (count: 437) 30 cycles (count: 1874) 31 cycles (count: 221) 57 cycles (count: 4) 58 cycles (count: 6) 59 cycles (count: 1) real 15m58.643s user 15m58.626s sys 0m0.012s I pass task_isolation_debug to boot parameters: # cat /proc/cmdline BOOT_IMAGE=/boot/Image-isol nohz_full=100-110 isolcpus=100-110 task_isolation_debug root=UUID=75b9dd5b-58d8-4a50-8868-004cb7c1f25f ro text But dmesg looks empty... I investigate it, but at now I have no ideas what happens. Have you seen that before? Anyway, we are going to include your test in our scenario, with some modifications. I've added --dry-run option to make it easier to demonstrate the effect of isolation on jitter. If you like it, feel free to use this change. Tested-by: Yury Norov <ynorov@xxxxxxxxxxxxxxxxxx> >From 5c5823c1fc8441ab1486287679de77b8cea5154d Mon Sep 17 00:00:00 2001 From: Yury Norov <ynorov@xxxxxxxxxxxxxxxxxx> Date: Wed, 7 Mar 2018 02:41:08 +0300 Subject: [PATCH] isolation test: --dry-run mode Add dry-run mode for better understanding the effect of isolation. Also, make sanity checks on waitticks. Signed-off-by: Yury Norov <ynorov@xxxxxxxxxxxxxxxxxx> --- tools/testing/selftests/task_isolation/isolation.c | 47 +++++++++++++++++----- 1 file changed, 36 insertions(+), 11 deletions(-) diff --git a/tools/testing/selftests/task_isolation/isolation.c b/tools/testing/selftests/task_isolation/isolation.c index 9c0b49619b40..e90a6c85fe2a 100644 --- a/tools/testing/selftests/task_isolation/isolation.c +++ b/tools/testing/selftests/task_isolation/isolation.c @@ -53,6 +53,7 @@ #include <unistd.h> #include <stdlib.h> #include <fcntl.h> +#include <limits.h> #include <assert.h> #include <string.h> #include <errno.h> @@ -500,7 +501,7 @@ static void jitter_handler(int sig) exit(exit_status); } -static void test_jitter(unsigned long waitticks) +static void test_jitter(unsigned long waitticks, int flags) { u_int64_t start, last, elapsed; int rc; @@ -513,7 +514,8 @@ static void test_jitter(unsigned long waitticks) rc = mlockall(MCL_CURRENT); assert(rc == 0); - set_task_isolation(PR_TASK_ISOLATION_ENABLE | + if (flags & PR_TASK_ISOLATION_ENABLE) + set_task_isolation(PR_TASK_ISOLATION_ENABLE | PR_TASK_ISOLATION_SET_SIG(SIGUSR1)); last = start = get_cycle_count(); @@ -537,26 +539,49 @@ static void test_jitter(unsigned long waitticks) } while (elapsed < waitticks); jitter_test_complete = true; - prctl_isolation(0); + + if (flags & PR_TASK_ISOLATION_ENABLE) + prctl_isolation(0); + jitter_summarize(); } int main(int argc, char **argv) { /* How many billion ticks to wait after running the other tests? */ - unsigned long waitticks; + long waitticks = 10; + int flags = PR_TASK_ISOLATION_ENABLE; char buf[100]; char *result, *end; FILE *f; - if (argc == 1) - waitticks = 10; - else if (argc == 2) - waitticks = strtol(argv[1], NULL, 10); - else { - printf("syntax: isolation [gigaticks]\n"); + errno = 0; + + switch (argc) { + case 1: + break; + case 2: + if (strcmp(argv[1], "--dry-run") == 0) + flags = 0; + else + waitticks = strtol(argv[1], NULL, 10); + break; + case 3: + if (strcmp(argv[1], "--dry-run") == 0) + flags = 0; + + waitticks = strtol(argv[2], NULL, 10); + break; + default: + printf("syntax: isolation [--dry-run] [gigaticks]\n"); ksft_exit_fail(); } + + if (errno || waitticks <= 0 || waitticks > (LONG_MAX / 1000000000)) { + printf("Gigaticks: bad number.\n"); + ksft_exit_fail(); + } + waitticks *= 1000000000; /* Get a core from the /sys nohz_full device. */ @@ -637,7 +662,7 @@ int main(int argc, char **argv) return exit_status; } - test_jitter(waitticks); + test_jitter(waitticks, flags); return exit_status; } -- 2.14.1 -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html