On Tue, 2020-11-24 at 00:21 +0100, Frederic Weisbecker wrote: > On Mon, Nov 23, 2020 at 10:39:34PM +0000, Alex Belits wrote: > > > > This is different from timers. The original design was based on the > > idea that every CPU should be able to enter kernel at any time and > > run > > kernel code with no additional preparation. Then the only solution > > is > > to always do full broadcast and require all CPUs to process it. > > > > What I am trying to introduce is the idea of CPU that is not likely > > to > > run kernel code any soon, and can afford to go through an > > additional > > synchronization procedure on the next entry into kernel. The > > synchronization is not skipped, it simply happens later, early in > > kernel entry code. > > Ah I see, this is ordered that way: > > ll_isol_flags = ISOLATED > > CPU 0 CPU 1 > ------------------ ----------------- > // kernel entry > data_to_sync = 1 ll_isol_flags = ISOLATED_BROKEN > smp_mb() smp_mb() > if ll_isol_flags(CPU 1) == ISOLATED READ data_to_sync > smp_call(CPU 1) > The check for ll_isol_flags(CPU 1) is reversed, and it's a bit more complex. In terms of scenarios, on entry from isolation the following can happen: 1. Kernel entry happens simultaneously with operation that requires synchronization, kernel entry processing happens before the check for isolation on the sender side: ll_isol_flags(CPU 1) = ISOLATED CPU 0 CPU 1 ------------------ ----------------- // kernel entry if (ll_isol_flags == ISOLATED) { ll_isol_flags = ISOLATED_BROKEN data_to_sync = 1 smp_mb() // data_to_sync undetermined smp_mb() } // ll_isol_flags(CPU 1) updated if ll_isol_flags(CPU 1) != ISOLATED // interrupts enabled smp_call(CPU 1) // kernel entry again if (ll_isol_flags == ISOLATED) // nothing happens // explicit or implied barriers // data_to_sync updated // kernel exit // CPU 0 assumes, CPU 1 will see READ data_to_sync // data_to_sync = 1 when in kernel 2. Kernel entry happens simultaneously with operation that requires synchronization, kernel entry processing happens after the check for isolation on the sender side: ll_isol_flags(CPU 1) = ISOLATED CPU 0 CPU 1 ------------------ ----------------- data_to_sync = 1 // kernel entry smp_mb() // data_to_sync undetermined // should not access data_to_sync here if (ll_isol_flags == ISOLATED) { ll_isol_flags = ISOLATED_BROKEN // ll_isol_flags(CPU 1) undetermined smp_mb() // data_to_sync updated if ll_isol_flags(CPU 1) != ISOLATED } // possibly nothing happens // CPU 0 assumes, CPU 1 will see READ data_to_sync // data_to_sync = 1 when in kernel 3. Kernel entry processing completed before the check for isolation on the sender side: ll_isol_flags(CPU 1) = ISOLATED CPU 0 CPU 1 ------------------ ----------------- // kernel entry if (ll_isol_flags == ISOLATED) { ll_isol_flags = ISOLATED_BROKEN smp_mb() } // interrupts are enabled at some data_to_sync = 1 // point here, data_to_sync value smp_mb() // is undetermined, CPU 0 makes no // ll_isol_flags(CPU 1) updated // assumptions about it if ll_isol_flags(CPU 1) != ISOLATED // smp_call(CPU 1) // kernel entry again if (ll_isol_flags == ISOLATED) // nothing happens // explicit or implied barriers // data_to_sync updated // kernel exit // CPU 0 assumes, CPU 1 will see // data_to_sync = 1 when in kernel READ data_to_sync 4. Kernel entry processing happens after the check for isolation on the sender side: ll_isol_flags(CPU 1) = ISOLATED CPU 0 CPU 1 ------------------ ----------------- data_to_sync = 1 // userspace or early kernel entry smp_mb() if ll_isol_flags(CPU 1) != ISOLATED smp_call(CPU 1) // skipped // userspace or early kernel entry // CPU 0 assumes, CPU 1 will see // continues undisturbed // data_to_sync = 1 when in kernel // kernel entry // data_to_sync undetermined // should not access data_to_sync here if (ll_isol_flags == ISOLATED) { ll_isol_flags = ISOLATED_BROKEN smp_mb() // data_to_sync updated } READ data_to_sync This also applies to exit to userspace after enabling isolation -- once ll_isol_flags is set to ISOLATED, synchronization will be missed, so one final barrier should happen before returning to userspace and enabling interrupts in the process. In this case "unlucky" timing would result in smp call interrupting userspace that is already supposed to be isolated, it will trigger normal isolation breaking procedure but otherwise will be an unremarkable synchronization call. On the other hand, synchronization that was supposed to happen after setting ll_isol_flags, will be delayed to the next kernel entry, so there should be nothing that needs synchronization between the end of task_isolation_exit_to_user_mode() and entering userspace (interrupts are ok, though, they will trigger isolation breaking). This is why I have placed task_isolation_kernel_enter() and task_isolation_exit_to_user_mode() in entry/exit code, and when I have seen any ambiguity or had any doubt allowed duplicate calls to task_isolation_kernel_enter() on entry. If I overdid any of that, I would appreciate fixes and clarifications, however it should be taken into account that for now different architectures and even drivers may call common and specific functions in a slightly different order. Synchronization also applies to possible effects on pipeline (when originating CPU writes instructions). There is a version under development that delays TLB flushes in this manner. On arm64 that requires a switch to soft TLB flushes, and that's another potential ball of wax. On architectures that already use soft TLB flushes, this will be unavoidable because TLB flush becomes another IPI, and IPIs break isolation. Maybe it will be better to make a somewhat smarter TLB handling in general, so it will be possible to avoid bothering unrelated CPUs. But then, I guess, hardware may still overdo it.Then it will be closer to the situation with timers. For now, I want to first do the obvious and exclude isolated tasks from TLB updates until they are back in kernel, and do a slow full flush on isolation breaking. > You should document that, ie: explain why what you're doing is safe. I tried to do that in comments, however this is clearly insufficient despite their verbosity. Would it make sense to create a separate documentation text file? Current documentation seems to be light on detailed specifications for things under heavy development except for something very significant that affects nearly everyone. Would it make sense to include all scenarios like the above plus exit to userspace, and existing sequences of calls that should lead to synchronization calls, task_isolation_kernel_enter() or task_isolation_exit_to_user_mode()? > Also Beware though that the data to sync in question doesn't need to > be visible > in the entry code before task_isolation_kernel_enter(). Right. And after task_isolation_exit_to_user_mode() as well. This is why I had to dig through entry/exit code. I can produce a somewhat usable call sequence diagram. If by any chance I am wrong there somewhere, I welcome all relevant comments and corrections. At this point I am not sure only about things that are called when CONFIG_TRACE_IRQFLAGS is enabled, simply because I have not yet checked what they depend on, and virtualization-specific entry/exit is not entirely clear to me. Maybe for correctness sake I should have declared task isolation incompatible with those until I either know for sure or added working support for them. > You need to audit all > the callers of kick_all_cpus_sync(). Right now it's just three places. do_tune_cpucache() does the right thing, keeps CPUs from seeing old values of cachep->cpu_cache as they are being deallocated. powerpc kvm_arch_destroy_vm() cares only about CPUs with VCPUs on them, what for now should not be used with isolation. arm64 flush_icache_range() is the main reason why this could be potentially problematic, it makes sure that other CPUs will not run stale instructions, and it's called from all places where code is modified. If in other situation some kind of delayed processing could be possible, this one has to be done in the right sequence. And that's the reason for instr_sync() after smp_mb() in task_isolation_kernel_enter(). Since flush_icache_range() could skip this CPU, we have to synchronize it by ourselves. There is an important issue of not having early kernel entry / late kernel exit rely on something that may be stale (with both data and code). With the above mentioned exceptions for now, it can be demonstrated that this is correct. It should be taken into account that even though static keys are used in early kernel entry code, static keys setting does not synchronize flushes, and therefore already should be done before use. Task isolation in virtualization guests can be a perfectly valid thing to support (it just has to be propagated to the host if permitted), however this is something I will want to revisit in the future. For now, I assume that virtualization-related events are not supposed to break isolation, however it would be nice to be ready for handling that possibility. -- Alex