On Wed, Jul 31, 2019 at 12:33 PM Mark Rutland <mark.rutland@xxxxxxx> wrote: > > Hi Pavel, > > Generally, the cover letter should state up-front what the goal is (or > what problem you're trying to solve). It would be really helpful to have > that so that we understand what you're trying to achieve, and why. > > Messing with the MMU is often fraught with danger (and very painful to > debug, as you are now aware), and so far we've tried to minimize the > number of places where we have to do so. Hi Mark, I understand, this is why I first went another route of solving this problem: pre-reserving contiguous memory, and avoid relocation entirely (the same as what happens during crash reboot). But, that solution was not accepted because it introduces a change to the common code to solve ARM specific problem. So, James Morse, and other suggested that I take a look at the root of the problem, and enable MMU during relocation by doing what is already done during hibernate restore. > > On Wed, Jul 31, 2019 at 11:38:49AM -0400, Pavel Tatashin wrote: > > Changelog from previous RFC: > > - Added trans_table support for both hibernate and kexec. > > - Fixed performance issue, where enabling MMU did not yield the > > actual performance improvement. > > > > Bug: > > With the current state, this patch series works on kernels booted with EL1 > > mode, but for some reason, when elevated to EL2 mode reboot freezes in > > both QEMU and on real hardware. > > > > The freeze happens in: > > > > arch/arm64/kernel/relocate_kernel.S > > turn_on_mmu() > > > > Right after sctlr_el2 is written (MMU on EL2 is enabled) > > > > msr sctlr_el2, \tmp1 > > > > I've been studying all the relevant control registers for EL2, but do not > > see what might be causing this hang: > > > > MAIR_EL2 is set to be exactly the same as MAIR_EL1 0xbbff440c0400 > > > > TCR_EL2 0x80843510 > > Enabled bits: > > PS Physical Address Size. (0b100 44 bits, 16TB.) > > SH0 Shareability 11 Inner Shareable > > ORGN0 Normal memory, Outer Write-Back Read-Allocate Write-Allocate Cach. > > IRGN0 Normal memory, Inner Write-Back Read-Allocate Write-Allocate Cach. > > T0SZ 01 0000 > > > > SCTLR_EL2 0x30e5183f > > RES1 : Reserve ones > > M : MMU enabled > > A : Align check > > C : Cacheability control > > SA : SP Alignment check enable > > IESB : Implicit Error Synchronization event > > I : Instruction access Cacheability > > > > TTBR0_EL2 0x1b3069000 (address of trans_table) > > > > Any suggestion of what else might be missing that causes this freeze when > > MMU is enabled in EL2? > > > > ===== > > > Here is the current data from the real hardware: > > (because of bug, I forced EL1 mode by setting el2_switch always to zero in > > cpu_soft_restart()): > > > > For this experiment, the size of kernel plus initramfs is 25M. If initramfs > > was larger, than the improvements would be even greater, as time spent in > > relocation is proportional to the size of relocation. > > > > Previously: > > kernel shutdown 0.022131328s > > relocation 0.440510736s > > kernel startup 0.294706768s > > In total this takes ~0.76s... > > > > > Relocation was taking: 58.2% of reboot time > > > > Now: > > kernel shutdown 0.032066576s > > relocation 0.022158152s > > kernel startup 0.296055880s > > ... and this takes ~0.35s > > So do we really need this complexity for a few blinks of an eye? Yes, we have an extremely tight reboot budget, 0.35s is not an acceptable waste. > > Thanks, > Mark.