Hi James, Thanks for taking out time to review the code. On Friday 02 June 2017 01:54 PM, James Morse wrote: > Hi Pratyush, > > On 23/05/17 06:02, Pratyush Anand wrote: >> Purgatory sha verification is very slow when D-cache is not enabled there. >> We need to enable MMU as well to enable D-Cache.Therefore,we need to >> have an identity mapped page table in purgatory. >> >> Since debugging is very difficult in purgatory therefore we prefer to do as >> much work as possible in kexec. >> >> This patch prepares page table for purgatory in advance. We support only 4K >> page table,because it will be available on most of the platform. This page >> table is passed to the purgatory as a new segment. >> >> VA bit is fixed as 48, page table level is 3 where 3rd level page table >> contains 2M block entries. > > This had me confused: You actually have 4 levels of page tables, but the last > level is never used, because the third level always contains block entries. > > If you want to use the kernel's macros for shifts and masks (which I think you > should), you must use the same terminology. > > Because you wrote this: >> +#define PGTABLE_LEVELS 3 > > The calculated PGDIR_SHIFT is what the kernel would call PUD_SHIFT with 48bit VA. > >> +#define ARM64_HW_PGTABLE_LEVEL_SHIFT(n) ((PAGE_SHIFT - 3) * (4 - (n)) + 3) >> +#define PGDIR_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - PGTABLE_LEVELS) > > Which then explains why you have to invent a new name for PGD_SHIFT, and > calculate it manually: >> +#define EXTRA_SHIFT (PGDIR_SHIFT + PAGE_SHIFT - 3) > > > kexec-tools is closely coupled to the kernel, can we try to make this as similar > as possible so that code (and reviewers!) can easily move between the two. OK..I will try to sort out the differences ... > > (some other comments below) > > Thanks, > > > James > >> diff --git a/kexec/arch/arm64/kexec-arm64.c b/kexec/arch/arm64/kexec-arm64.c >> index 62f37585b788..af5dab331e97 100644 >> --- a/kexec/arch/arm64/kexec-arm64.c >> +++ b/kexec/arch/arm64/kexec-arm64.c >> @@ -33,6 +34,10 @@ >> #define PROP_ELFCOREHDR "linux,elfcorehdr" >> #define PROP_USABLE_MEM_RANGE "linux,usable-memory-range" >> >> +typedef unsigned long guest_addr_t; >> +typedef unsigned long host_addr_t; > > What do guest and host mean here? guest is something which will be in purgatory domain and host is something in kexec domain. I remember, you had suggested to differentiate these two address domain for better code readability. > I assumed everything that happens in purgatory would have something like > phys_addr_t (for both physical and virtual addresses). Yes, so guest_addr_t is like phys_addr_t. I can replace, if there are some better names to differentiate between these two address domain. > > >> +static int next_tbl_cnt = 1; >> + >> /* Global varables the core kexec routines expect. */ > > Nit: variables OK. > > >> >> unsigned char reuse_initrd; >> @@ -519,6 +524,118 @@ unsigned long arm64_locate_kernel_segment(struct kexec_info *info) >> return hole; >> } >> >> +static host_addr_t *create_table_entry(host_addr_t *pgtbl_buf, >> + guest_addr_t pgtbl_mem, host_addr_t *tbl, >> + guest_addr_t virt, int shift, >> + unsigned long ptrs) >> +{ >> + unsigned long index, desc, offset; >> + >> + index = (virt >> shift) & (ptrs - 1); >> + /* check if we have allocated a table already for this index */ >> + if (tbl[index]) { >> + /* >> + * table index will have entry as per purgatory (guest)page >> + * table memory. Find out corresponding buffer address of >> + * table (host). >> + */ >> + desc = tbl[index] & ~3UL; >> + offset = desc - pgtbl_mem; >> + return &pgtbl_buf[offset >> 3]; >> + } >> + >> + if (next_tbl_cnt >= MAX_PGTBLE_SZ / PAGE_SIZE) >> + die("%s:No more memory for page table\n", __func__); >> + /* >> + * Always write page table content as per page table memory >> + * allocated for purgatory(guest) area, but return corresponding >> + * buffer area allocated in kexec (host). >> + */ >> + >> + tbl[index] = (pgtbl_mem + PAGE_SIZE * next_tbl_cnt) | PMD_TYPE_TABLE; >> + >> + return &pgtbl_buf[(next_tbl_cnt++ * PAGE_SIZE) >> 3]; >> +} >> + >> +static void create_block_entry(host_addr_t *tbl, unsigned long flags, >> + guest_addr_t virt) >> +{ >> + unsigned long index; >> + unsigned long desc; >> + >> + index = pmd_index(virt); >> + desc = virt & PMD_MASK; >> + desc |= flags; >> + tbl[index] = desc; >> +} >> + >> +static void create_identity_entry(host_addr_t *pgtbl_buf, >> + guest_addr_t pgtbl_mem, guest_addr_t virt, >> + unsigned long flags) >> +{ >> + host_addr_t *tbl = pgtbl_buf; >> + tbl = create_table_entry(pgtbl_buf, pgtbl_mem, tbl, virt, >> + EXTRA_SHIFT, EXTRA_PTRS); >> + tbl = create_table_entry(pgtbl_buf, pgtbl_mem, tbl, virt, >> + IDENTITY_SHIFT, PTRS_PER_PTE); > > Are 'extra' and 'identity' names for levels in the page tables? If so can we use > the same names as the kernel. > OK,will make it similar as per 4 level page (so 3 level swapper page) in kernel. > >> + create_block_entry(tbl, flags, virt); >> +} >> + >> +/** >> + * arm64_create_pgtbl_segment - Create page table segments to be used by >> + * purgatory. Page table will have entries to access memory area of all >> + * those segments which becomes part of sha verification in purgatory. >> + * Additionally, we also create page table for purgatory segment as well. >> + */ >> + >> +static int arm64_create_pgtbl_segment(struct kexec_info *info, >> + unsigned long hole_min, unsigned long hole_max) >> +{ >> + host_addr_t *pgtbl_buf; >> + guest_addr_t pgtbl_mem, mstart, mend; >> + int i; >> + unsigned long purgatory_base, purgatory_len; >> + >> + pgtbl_buf = xmalloc(MAX_PGTBLE_SZ); >> + memset(pgtbl_buf, 0, MAX_PGTBLE_SZ); >> + pgtbl_mem = add_buffer_phys_virt(info, pgtbl_buf, MAX_PGTBLE_SZ, >> + MAX_PGTBLE_SZ, PAGE_SIZE, hole_min, hole_max, 1, 0); >> + for (i = 0; i < info->nr_segments; i++) { >> + if (info->segment[i].mem == (void *)info->rhdr.rel_addr) { >> + purgatory_base = (unsigned long)info->segment[i].mem; >> + purgatory_len = info->segment[i].memsz; >> + } >> + mstart = (unsigned long)info->segment[i].mem; >> + mend = mstart + info->segment[i].memsz; >> + mstart &= ~(SECTION_SIZE - 1); >> + while (mstart < mend) { >> + create_identity_entry(pgtbl_buf, pgtbl_mem, >> + mstart, MMU_FLAGS_NORMAL); >> + mstart += SECTION_SIZE; >> + } >> + } >> + >> + /* we will need pgtble_base in purgatory for enabling d-cache */ >> + elf_rel_set_symbol(&info->rhdr, "pgtble_base", &pgtbl_mem, >> + sizeof(pgtbl_mem)); >> + /* >> + * We need to disable d-cache before we exit from purgatory. >> + * Since, only d-cache flush by VAs is recommended, therefore we >> + * will also need memory location of all those area which will be >> + * accessed in purgatory with enabled d-cache. sha256_regions >> + * already have start and length for all the segments except >> + * purgatory. Therefore, we will need to pass start and length of >> + * purgatory additionally.n >> + */ >> + elf_rel_set_symbol(&info->rhdr, "purgatory_base", &purgatory_base, >> + sizeof(purgatory_base)); >> + elf_rel_set_symbol(&info->rhdr, "purgatory_len", &purgatory_len, >> + sizeof(purgatory_len)); >> + >> + return 0; >> +} >> + >> /** >> * arm64_load_other_segments - Prepare the dtb, initrd and purgatory segments. >> */ >> @@ -630,6 +747,8 @@ int arm64_load_other_segments(struct kexec_info *info, >> elf_rel_set_symbol(&info->rhdr, "arm64_dtb_addr", &dtb_base, >> sizeof(dtb_base)); >> >> + arm64_create_pgtbl_segment(info, hole_min, hole_max); > > hole_min->hole_max is the region from the end of the Kernel image, to the end of > usable memory. I can't see how this window is made smaller as the dtb, initramfs > then page tables are added to it... (kvmtool updates the equivalent values as it > adds things) > > Does kexec-tools have some other accounting for what memory is in use? (in which > case why are these values needed?) > Please see locate_hole() and add_segment_phys_virt(). locate_hole() finds a hole from info->memory_range lying between hole_min and hole_max. add_segment_phys_virt() assigned allocated hole to info->segment. locate_hole() uses info->segment as well to know about consumed regions. > >> return 0; >> } >> >> diff --git a/kexec/arch/arm64/kexec-mmu.h b/kexec/arch/arm64/kexec-mmu.h >> new file mode 100644 >> index 000000000000..55354b5e3002 >> --- /dev/null >> +++ b/kexec/arch/arm64/kexec-mmu.h >> @@ -0,0 +1,58 @@ >> +#if !defined(KEXEC_MMU_H) >> +#define KEXEC_MMU_H > > As lots of the symbols names and values below were copied from the kernel, we > should copy the copyright statement too. Something like: > > /* > * Based on linux v4.11's arch/arm64/include/asm/pgtable-hwdef.h > * Copyright (C) 2012 ARM Ltd. > */ > OK > >> +/* >> + * kexec creates identity page table to be used in purgatory so that > >> + * dcache verification becomes faster. > > ... so that SHA verification of the image can make use of the dcache. OK.. :-) > > >> + * >> + * These are the definitions to be used by page table creation routine. >> + * >> + * Only 4K page table, 3 level, 2M block mapping, 48bit VA is supported >> + */ >> +#define PAGE_SHIFT 12 >> +#define PGTABLE_LEVELS 3 >> +#define ARM64_HW_PGTABLE_LEVEL_SHIFT(n) ((PAGE_SHIFT - 3) * (4 - (n)) + 3) >> +#define PGDIR_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - PGTABLE_LEVELS) > >> +#define EXTRA_SHIFT (PGDIR_SHIFT + PAGE_SHIFT - 3) >> +#define EXTRA_PTRS (1 << (48 - EXTRA_SHIFT)) > > This looks like the level above PGD? What do you need this for if you have only > 3 levels of page tables? As said above, will remove these names now. > > >> +#define PUD_SHIFT PGDIR_SHIFT >> +#define PMD_SHIFT ARM64_HW_PGTABLE_LEVEL_SHIFT(2) >> +#define PMD_SIZE (1UL << PMD_SHIFT) >> +#define PMD_MASK (~(PMD_SIZE-1)) >> +#define IDENTITY_SHIFT PUD_SHIFT >> +#define PTRS_PER_PTE (1 << (PAGE_SHIFT - 3)) >> +#define PTRS_PER_PMD PTRS_PER_PTE >> +#define pmd_index(addr) (((addr) >> PMD_SHIFT) & (PTRS_PER_PMD - 1)) >> +#define PMD_TYPE_TABLE (3UL << 0) >> +#define PMD_TYPE_SECT (1UL << 0) >> +#define PMD_SECT_AF (1UL << 10) >> +#define PMD_ATTRINDX(t) ((unsigned long)(t) << 2) >> +#define MT_NORMAL 0 >> +#define PMD_FLAGS_NORMAL (PMD_TYPE_SECT | PMD_SECT_AF) >> +#define MMU_FLAGS_NORMAL (PMD_ATTRINDX(MT_NORMAL) | PMD_FLAGS_NORMAL) >> +#define SECTION_SHIFT PMD_SHIFT >> +#define SECTION_SIZE (1UL << SECTION_SHIFT) >> +#define PAGE_SIZE (1 << PAGE_SHIFT) >> +/* >> + * Since we are using 3 level of page tables, therefore minimum number of >> + * table will be 3. Each entry in level 3 page table can map 2MB memory >> + * area. Thus a level 3 page table indexed by bit 29:21 can map a total of >> + * 1G memory area. Therefore, if any segment crosses 1G boundary, then we >> + * will need one more level 3 table. Similarly, level 2 page table indexed >> + * by bit 38:30 can map a total of 512G memory area. If segment addresses >> + * are more than 512G apart then we will need two more table for each such >> + * block. >> + * Lets consider 2G as the maximum size of crashkernel. > > crashkernel+initramfs ? OK > > A good assumption. Can we test this is true when the image is loaded? > (otherwise its going to be really fun to debug!) > That we already do. If we cross that boundary we will need additional space and it will die() during kexec load. if (next_tbl_cnt >= MAX_PGTBLE_SZ / PAGE_SIZE) die("%s:No more memory for page table\n", __func__); [It is difficult to implement, get going without MMU enabled in that case] > >> This 2G memory >> + * location might not be allocated at 1G aligned boundary, so in that case >> + * we need to have 5 table size resrved to map any location of the crash > > (Nit: reserved) OK. > > >> + * kernel region. >> + * >> + * If we will ever wish to support uart debugging in purgatory then that >> + * might cross the boundary and therefore additional 2 more table space. In >> + * that case, we will need a total of 7 table space. >> + * >> + * As of now keep it fixed at 5. Increase it if any platform either >> + * supports uart or more than 2G of crash kernel size. >> + */ >> +#define MAX_PGTBLE_SZ (5 * PAGE_SIZE) >> + >> +#endif >> ~Pratyush