Re: [PATCH] arm64/mm: Introduce a variable to hold base address of linear region

Bhupesh Sharma <bhsharma@xxxxxxxxxx> · Thu, 14 Jun 2018 13:23:44 +0530

Hello James,

Thanks for your inputs, please see my responses inline.

On Wed, Jun 13, 2018 at 3:59 PM, James Morse <james.morse@xxxxxxx> wrote:
> Hi Bhupesh,
>
> On 13/06/18 06:16, Bhupesh Sharma wrote:
>> On Tue, Jun 12, 2018 at 3:42 PM, James Morse <james.morse@xxxxxxx> wrote:
>>> On 12/06/18 09:25, Bhupesh Sharma wrote:
>>>> On Tue, Jun 12, 2018 at 12:23 PM, Ard Biesheuvel wrote:
>>>>> Userland code that assumes that the linear map cannot have a hole at
>>>>> the beginning should be fixed.
>
>>>> That is a separate case (although that needs fixing as well via a
>>>> kernel patch probably as the user-space tools rely on '/proc/iomem'
>>>> contents to determine the first System RAM/reserved range).
>>>
>>> This is for kexec-tools generating the kdump vmcore ELF headers in user-space?
>>
>> Yes, but again, I would like to reiterate that the case where I see a
>> hole at the start of the System RAM range (as I listed above) is just
>> a specific case, which probably deserves a separate patch. The current
>> patch though is for a generic issue (please see more details below).
>
>
>>>> # readelf -l vmcore
>>>>
>>>> ELF Header:
>>>> ........................
>>>>
>>>> Program Headers:
>>>>   Type           Offset             VirtAddr           PhysAddr
>>>>          FileSiz            MemSiz              Flags  Align
>>>> ..............................................................................................................................................................
>>>>   LOAD        0x0000000076d40000 0xffff80017fe00000 0x0000000180000000
>>>>                 0x0000001680000000 0x0000001680000000  RWE    0
>>>>
>>>> 3. So if we do a simple calculation:
>>>>
>>>> (VirtAddr + MemSiz) = 0xffff80017fe00000 + 0x0000001680000000 =
>>>> 0xFFFF8017FFE00000 != 0xffff801800000000.
>>>>
>>>> which indicates that the end virtual memory nodes are not the same
>>>> between vmlinux and vmcore.
>>>
>>> If I've followed this properly: the problem is that to generate the ELF headers
>>> in the post-kdump vmcore, at kdump-load-time kexec-tools has to guess the
>>> virtual addresses of the 'System RAM' regions it can see in /proc/iomem.
>>>
>>> The problem you are hitting is an invisible hole at the beginning of RAM,
>>> meaning user-space's guess_phys_to_virt() is off by the size of this hole.
>>>
>>> Isn't KASLR a special case for this? You must have to correct for that after
>>> kdump has happened, based on an elf-note in the vmcore. Can't we always do this?
>>
>> No, I hit this issue both for the KASLR and non-KASLR boot cases.
>
> Because in both cases there is a hole at the beginning of the linear-map. KASLR
> is a special-case of this as the kernel adds a variable sized hole to do the
> randomization.
>
> Surely treating this as one case makes your user-space code simpler.

Ok.

>> Fixing this in kernel space seems better to me as the definition of
>
> Is there a kernel bug? Changing the definitions of internal kernel variables for
> the benefit of code digging in /proc/kcore|/dev/mem isn't going to fly.

Indeed, I am not advocating to change the kernel space code just to
suit the user-space tools. However in this particular case the
'memstart_addr' and PHY_OFFSET value are computed as 0 which IMO is
not the real representation of the start of System RAM as the 1st
memory block available in Linux starts from 2MB [as confirmed by the
'memblock_start_of_DRAM()' value of 0x200000] and indicated by
'/proc/iomem':

# head -1 /proc/iomem
00200000-0021ffff : reserved

I think reading the kernel code and finding 'memstart_addr' and
PHY_OFFSET as 0, one gets the notion that the base of System RAM
starts from 0, which is incorrect in the above case as it starts from
2MB as the 1st block is of the type EfiReservedMemType

>> 'memstart_addr' is that it indicates the start of the physical ram,
>> but since in this case there is a hole at the start of the system ram
>> visible in Linux (and thus to user-space), but 'memstart_addr' is
>> still 0 which seems contradictory at the least. This causes PHY_OFFSET
>> to be 0 as well, which is again contradictory.
>
>
>>>> This happens because the kexec-tools rely on 'proc/iomem' contents
>>>> while 'memstart_addr' is computed as 0 by kernel (as value of
>>>> memblock_start_of_DRAM() < ARM64_MEMSTART_ALIGN).
>>>
>>>> Returning back to this patch, this is a generic requirement where we
>>>> need the linear region start/base addresses in user-space applications
>>>> which is used to read addresses which lie in the linear region (for
>>>> e.g. when we read /proc/kcore contents).
>
> [...]
>
>>> This patch adds a variable that nothing uses, its going to be removed. You can't
>>> depend on reading this via /dev/mem.
>>>
>>> Could you add the information you need as an elf-note to the vmcore instead? You
>>> must already pick these up to handle kaslr. (from memory, this is where the
>>> kaslr-offset is described to user-space after we kdump).
>
>
>> No you are mixing up the two cases (please see above), the issue which
>> this patch fixes is for use cases where we don't have the vmcore
>> available in case of 'live' debugging via makedumpfile and crash tools
>> (we only have '/proc/kcore' or 'vmlinux' available in such cases). I
>> detailed the use case in [1] better (in a reply to Ard), I will detail
>> the use-case again below:
>
> Okay, so not kdump...
>
>
>> One specific use case that I am working on at the moment is the
>> makedumpfile '--mem-usage', which allows one to see the page numbers
>> of current system (1st kernel) in different use (please see
>> MAKEDUMPFILE(8) for more details).
>
> https://linux.die.net/man/8/makedumpfile :
> | Name: makedumpfile - make a small dumpfile of kdump
>
> ... but now we are talking about kdump again ...
>
>
>> Using this we can know how many pages are dumpable when different
>> dump_level is specified when invoking the makedumpfile.
>>
>> Normally, makedumpfile analyses the contents of '/proc/kcore' (while
>> excluding the crashkernel range), and then calculates the page number
>> of different kind per vmcoreinfo.
>
> $ apt-get source makedumpfile
> $ cd makedumpfile-1.5.3
> $ grep -r "kcore" .
> $
>
> I suspect there are two pieces of software with the same name here.

Here is the makedumpfile upstream git tree -
git://git.code.sf.net/p/makedumpfile/code

$ grep -r "kcore" .

./elf_info.c:int set_kcore_vmcoreinfo(uint64_t vmcoreinfo_addr,
uint64_t vmcoreinfo_len)
<..snip..>
./makedumpfile.8:# makedumpfile \-f \-\-mem\-usage /proc/kcore
<..snip..>

>> This use case requires directly reading the '/proc/kcore' and the
>> hence the PAGE_OFFSET value is used to determine the base address of
>> the linear region, whose value is not static in case of KASLR boot.
>
> Eh? I thought PAGE_OFFSET was a compile-time constant, and it was PHYS_OFFSET
> has a value other the aligned base of memory for KASLR.

Indeed, I tried to capture the dilemma in [1], just to recap:

'arch/arm64/include/asm/memory.h' defines PAGE_OFFSET as:

/*
 * PAGE_OFFSET - the virtual address of the start of the linear map (top
 *         (VA_BITS - 1))
 */
#define PAGE_OFFSET        (UL(0xffffffffffffffff) - \
    (UL(1) << (VA_BITS - 1)) + 1)

However, for the KASLR case, we set the 'memstart_offset_seed ' to
use the 16-bits of the 'kaslr-seed' to randomize the linear region in
'arch/arm64/kernel/kaslr.c' :

u64 __init kaslr_early_init(u64 dt_phys)
{
<snip..>
    /* use the top 16 bits to randomize the linear region */
    memstart_offset_seed = seed >> 48;
<snip..>
}

So, either we should have a uniform way of representing the virtual
base of the linear range both in KASLR and non-KASLR boot cases (macro
or variable?). or  we should rather look at removing the PAGE_OFFSET
usage from
the kernel (or atleast the confusing comment from 'memory.h') - again
please see [1] for the suggested approaches (bottom part of the query)

>
>> Another use-case is where the crash-utility uses the PAGE_OFFSET value
>> to perform a virtual-to-physical conversion for the address lying in
>> the linear region:
>
> In all cases the problem you have is assuming the first 'System RAM' value in
> /proc/iomem is the base of DRAM, which you can use a PHYS_OFFSET in your
> user-space phys2virt() calculation.
>
> What information do you need to make this work?
>
> You can evidently read kernel variables, why can't you read memstart_addr and do:
> | #define __phys_to_virt(x)                             \
> |                       ((unsigned long)((x) - memstart_addr) | PAGE_OFFSET)
>
> based on the physical addresses in /proc/iomem, and PAGE_OFFSET pulled out of
> the vmlinux.
>
> Reading memstart_addr is fragile, we might need to rename it
> wednesday_memstart_addr. If user-space needs this value to work with
> /proc/{kcore,vmcore} we should expose something like 'p2v_offset' as an elf-note
> on those files. (looks like they both have elf-headers).

Again I had suggested reading memstart_addr as one of the approaches
in [1], but seems we couldn't reach a conclusion, so I sent out this
approach to trigger another round of discussion.

BTW adding 'p2v_offset' as an elf-note seems like a good idea. If this
seems suitable, I can try and spin patch(es) using this approach (both
for the kernel and user-space tools).

Please share your views,

[1] https://www.spinics.net/lists/arm-kernel/msg655933.html

Thanks,
Bhupesh

_______________________________________________
kexec mailing list
kexec@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/kexec