Re: [Qemu-ppc] KVM and variable-endianness guest CPUs

Anup Patel <anup@xxxxxxxxxxxxxx> · Wed, 22 Jan 2014 14:27:29 +0530

Hi Alex,

On Wed, Jan 22, 2014 at 12:11 PM, Alexander Graf <agraf@xxxxxxx> wrote:
>
>
>> Am 22.01.2014 um 07:31 schrieb Anup Patel <anup@xxxxxxxxxxxxxx>:
>>
>> On Wed, Jan 22, 2014 at 11:09 AM, Victor Kamensky
>> <victor.kamensky@xxxxxxxxxx> wrote:
>>> Hi Guys,
>>>
>>> Christoffer and I had a bit heated chat :) on this
>>> subject last night. Christoffer, really appreciate
>>> your time! We did not really reach agreement
>>> during the chat and Christoffer asked me to follow
>>> up on this thread.
>>> Here it goes. Sorry, it is very long email.
>>>
>>> I don't believe we can assign any endianity to
>>> mmio.data[] byte array. I believe mmio.data[] and
>>> mmio.len acts just memcpy and that is all. As
>>> memcpy does not imply any endianity of underlying
>>> data mmio.data[] should not either.
>>>
>>> Here is my definition:
>>>
>>> mmio.data[] is array of bytes that contains memory
>>> bytes in such form, for read case, that if those
>>> bytes are placed in guest memory and guest executes
>>> the same read access instruction with address to this
>>> memory, result would be the same as real h/w device
>>> memory access. Rest of KVM host and hypervisor
>>> part of code should really take care of mmio.data[]
>>> memory so it will be delivered to vcpu registers and
>>> restored by hypervisor part in such way that guest CPU
>>> register value is the same as it would be for real
>>> non-emulated h/w read access (that is emulation part).
>>> The same goes for write access, if guest writes into
>>> memory and those bytes are just copied to emulated
>>> h/w register it would have the same effect as real
>>> mapped h/w register write.
>>>
>>> In shorter form, i.e for len=4 access: endianity of integer
>>> at &mmio.data[0] address should match endianity
>>> of emulated h/w device behind phys_addr address,
>>> regardless what is endianity of emulator, KVM host,
>>> hypervisor, and guest
>>>
>>> Examples that illustrate my definition
>>> --------------------------------------
>>>
>>> 1) LE guest (E bit is off in ARM speak) reads integer
>>> (4 bytes) from mapped h/w LE device register -
>>> mmio.data[3] contains MSB, mmio.data[0] contains LSB.
>>>
>>> 2) BE guest (E bit is on in ARM speak) reads integer
>>> from mapped h/w LE device register - mmio.data[3]
>>> contains MSB, mmio.data[0] contains LSB. Note that
>>> if &mmio.data[0] memory would be placed in guest
>>> address space and instruction restarted with new
>>> address, then it would meet BE guest expectations
>>> - the guest knows that it reads LE h/w so it will byteswap
>>> register before processing it further. This is BE guest ARM
>>> case (regardless of what KVM host endianity is).
>>>
>>> 3) BE guest reads integer from mapped h/w BE device
>>> register - mmio.data[0] contains MSB, mmio.data[3]
>>> contains LSB. Note that if &mmio.data[0] memory would
>>> be placed in guest address space and instruction
>>> restarted with new address, then it would meet BE
>>> guest expectation - the guest knows that it reads
>>> BE h/w so it will proceed further without any other
>>> work. I guess, it is BE ppc case.
>>>
>>>
>>> Arguments in favor of memcpy semantics of mmio.data[]
>>> ------------------------------------------------------
>>>
>>> x) What are possible values of 'len'? Previous discussions
>>> imply that is always powers of 2. Why is that? Maybe
>>> there will be CPU that would need to do 5 bytes mmio
>>> access, or 6 bytes. How do you assign endianity to
>>> such case? 'len' 5 or 6, or any works fine with
>>> memcpy semantics. I admit it is hypothetical case, but
>>> IMHO it tests how clean ABI definition is.
>>>
>>> x) Byte array does not have endianity because it
>>> does not have any structure. If one would want to
>>> imply structure why mmio is not defined in such way
>>> so structure reflected in mmio definition?
>>> Something like:
>>>
>>>
>>>                /* KVM_EXIT_MMIO */
>>>                struct {
>>>                          __u64 phys_addr;
>>>                          union {
>>>                               __u8 byte;
>>>                               __u16 hword;
>>>                               __u32 word;
>>>                               __u64 dword;
>>>                          }  data;
>>>                          __u32 len;
>>>                          __u8  is_write;
>>>                } mmio;
>>>
>>> where len is really serves as union discriminator and
>>> only allowed len values are 1, 2, 4, 8.
>>> In this case, I agree, endianity of integer types
>>> should be defined. I believe, use of byte array strongly
>>> implies that original intent was to have semantics of
>>> byte stream copy, just like memcpy does.
>>>
>>> x) Note there is nothing wrong with user kernel ABI to
>>> use just bytes stream as parameter. There is already
>>> precedents like 'read' and 'write' system calls :).
>>>
>>> x) Consider case when KVM works with emulated memory mapped
>>> h/w devices where some devices operate in LE mode and others
>>> operate in BE mode. It is defined by semantics of real h/w
>>> device which is it, and should be emulated by emulator and KVM
>>> given all other context. As far as mmio.data[] array concerned, if the
>>> same integer value is read from these devices registers, mmio.data[]
>>> memory should contain integer in opposite endianity for these
>>> two cases, i.e MSB is data[0] in one case and MSB is
>>> data[3] is in another case. It cannot be the same, because
>>> except emulator and guest kernel, all other, like KVM host
>>> and hypervisor, have no clue what endianity of device
>>> actually is - it should treat mmio.data[] in the same way.
>>> But resulting guest target CPU register would need to contain
>>> normal integer value in one case and byteswapped in another,
>>> because guest kernel would use it directly in one case and
>>> byteswap it in another. Byte stream semantics allows to do
>>> that. I don't see how it could happen if you fixate mmio.data[]
>>> endianity in such way that it would contain integer in
>>> the same format for BE and LE emulated device types.
>>>
>>> If by this point you agree, that mmio.data[] user-land/kernel
>>> ABI semantics should be just memcpy, stop reading :). If not,
>>> you may would like to take a look at below appendix where I
>>> described in great details endianity of data at different
>>> points along mmio processing code path of existing ARM LE KVM,
>>> and proposed ARM BE KVM. Note appendix, is very long and very
>>> detailed, sorry about that, but I feel that earlier more
>>> digested explanations failed, so it driven me to write out
>>> all details how I see them. If I am wrong, I hope it would be
>>> easier for folks to point in detailed explanation places
>>> where my logic goes bad. Also, I am not sure whether this
>>> mail thread is good place to discuss all details described
>>> in the appendix. Christoffer, please advise whether I should take
>>> that one back on [1]. But I hope this bigger picture may help to
>>> see the mmio.data[] semantics issue in context.
>>>
>>> More inline and appendix is at the end.
>>>
>>>> On 20 January 2014 11:19, Christoffer Dall <christoffer.dall@xxxxxxxxxx> wrote:
>>>>> On Mon, Jan 20, 2014 at 03:22:11PM +0100, Alexander Graf wrote:
>>>>>
>>>>>> On 17.01.2014, at 19:52, Peter Maydell <peter.maydell@xxxxxxxxxx> wrote:
>>>>>>
>>>>>>> On 17 January 2014 17:53, Peter Maydell <peter.maydell@xxxxxxxxxx> wrote:
>>>>>>> Specifically, the KVM API says "here's a uint8_t[] byte
>>>>>>> array and a length", and the current QEMU code treats that
>>>>>>> as "this is a byte array written as if the guest CPU
>>>>>>> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>>>>>>> I/O access to this buffer rather than to the device".
>>>>>>>
>>>>>>> The KVM API docs don't actually specify the endianness
>>>>>>> semantics of the byte array, but I think that that really
>>>>>>> needs to be nailed down. I can think of a couple of options:
>>>>>>> * always LE
>>>>>>> * always BE
>>>>>>>  [these first two are non-starters because they would
>>>>>>>  break either x86 or PPC existing code]
>>>>>>> * always the endianness the guest is at the time
>>>>>>> * always some arbitrary endianness based purely on the
>>>>>>>  endianness the KVM implementation used historically
>>>>>>> * always the endianness of the host QEMU binary
>>>>>>> * something else?
>>>>>>>
>>>>>>> Any preferences? Current QEMU code basically assumes
>>>>>>> "always the endianness of TARGET_WORDS_BIGENDIAN",
>>>>>>> which is pretty random.
>>>>>>
>>>>>> Having thought a little more about this, my opinion is:
>>>>>>
>>>>>> * we should specify that the byte order of the mmio.data
>>>>>>  array is host kernel endianness (ie same endianness
>>>>>>  as the QEMU process itself) [this is what it actually
>>>>>>  is, I think, for all the cases that work today]
>>>
>>> In above please consider two types of mapped emulated
>>> h/w devices: BE and LE they cannot have mmio.data in the
>>> same endianity. Currently in all observable cases LE ARM
>>> and BE PPC devices endianity matches kernel/qemu
>>> endianity but it would break when BE ARM is introduced
>>> or LE PPC or one would start emulating BE devices on LE
>>> ARM.
>>>
>>>>>> * we should fix the code path in QEMU for handling
>>>>>>  mmio.data which currently has the implicit assumption
>>>>>>  that when using KVM TARGET_WORDS_BIGENDIAN is the same
>>>>>>  as the QEMU host process endianness (because it's using
>>>>>>  load/store functions which swap if TARGET_WORDS_BIGENDIAN
>>>>>>  is different from HOST_WORDS_BIGENDIAN)
>>>
>>> I do not follow above. Maybe I am missing bigger context.
>>> What is CPU under discussion in above? On ARM V7 system
>>> when LE device is accessed as integer &mmio.data[0] address
>>> would contain integer is in LE format, ie mmio.data[0] is LSB.
>>>
>>> Here is gdb session of LE qemu running on V7 LE kernel and
>>> TC1 LE guest. Guest kernel accesses sys_cfgstat register which is
>>> arm_sysctl registers with offset of 0xa8. Note.arm_sysct is memory
>>> mapped LE device.
>>> Please check run->mmio structure after read
>>> (cpu_physical_memory_rw) completes it is in 4 bytes integer in
>>> LE format mmio.data[0] is LSB and is equal to 1
>>> (s->syscfgstat value):
>>>
>>> (gdb) bt
>>> #0  arm_sysctl_read (opaque=0x95a600, offset=168, size=4) at
>>> /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
>>> #1  0x0023b9b4 in memory_region_read_accessor (mr=0x95b8e0,
>>> addr=<optimized out>, value=0xb5c0dc18, size=4, shift=0,
>>> mask=4294967295)
>>>    at /home/root/20131219/qemu-be/memory.c:407
>>> #2  0x0023aba4 in access_with_adjusted_size (addr=4294967295,
>>> value=0xb5c0dc18, value@entry=0xb5c0dc10, size=size@entry=4,
>>> access_size_min=1,
>>>    access_size_max=2357596, access=access@entry=0x23b96c
>>> <memory_region_read_accessor>, mr=mr@entry=0x95b8e0) at
>>> /home/root/20131219/qemu-be/memory.c:477
>>> #3  0x0023f95c in memory_region_dispatch_read1 (size=4, addr=168,
>>> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:944
>>> #4  memory_region_dispatch_read (size=4, pval=0xb5c0dc68, addr=168,
>>> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:966
>>> #5  io_mem_read (mr=mr@entry=0x95b8e0, addr=<optimized out>,
>>> pval=pval@entry=0xb5c0dc68, size=size@entry=4) at
>>> /home/root/20131219/qemu-be/memory.c:1743
>>> #6  0x001abd38 in address_space_rw (as=as@entry=0x8102d8
>>> <address_space_memory>, addr=469827752, buf=buf@entry=0xb6fd6028 "",
>>> len=4, is_write=false,
>>>    is_write@entry=true) at /home/root/20131219/qemu-be/exec.c:2025
>>> #7  0x001abf90 in cpu_physical_memory_rw (addr=<optimized out>,
>>> buf=buf@entry=0xb6fd6028 "", len=<optimized out>, is_write=0)
>>>    at /home/root/20131219/qemu-be/exec.c:2070
>>> #8  0x00239e00 in kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1701
>>> #9  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
>>> /home/root/20131219/qemu-be/cpus.c:874
>>> #10 0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
>>> #11 0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> #12 0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>>> (gdb) p /x s->sys_cfgstat
>>> $25 = 0x1
>>> (gdb) finish
>>> Run till exit from #0  arm_sysctl_read (opaque=0x95a600, offset=168,
>>> size=4) at /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
>>> memory_region_read_accessor (mr=0x95b8e0, addr=<optimized out>,
>>> value=0xb5c0dc18, size=4, shift=0, mask=4294967295) at
>>> /home/root/20131219/qemu-be/memory.c:408
>>> 408        trace_memory_region_ops_read(mr, addr, tmp, size);
>>> Value returned is $26 = 1
>>> (gdb) enable 2
>>> (gdb) cont
>>> Continuing.
>>>
>>> Breakpoint 2, kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1660
>>> 1660            kvm_arch_pre_run(cpu, run);
>>> (gdb) bt
>>> #0  kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1660
>>> #1  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
>>> /home/root/20131219/qemu-be/cpus.c:874
>>> #2  0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
>>> #3  0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> #4  0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>>> (gdb) p /x run->mmio
>>> $27 = {phys_addr = 0x1c0100a8, data = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0,
>>> 0x0, 0x0}, len = 0x4, is_write = 0x0}
>>>
>>> Also please look at adjust_endianness function and
>>> struct MemoryRegion 'endianness' field. IMHO in qemu it
>>> works quite nicely already. MemoryRegion 'read' and 'write'
>>> callbacks return/get data in native format adjust_endianness
>>> function checks whether emulated device endianness matches
>>> emulator endianness and if it is different it does byteswap
>>> according to size. As in above example arm_sysctl_ops memory
>>> region should be marked as DEVICE_LITTLE_ENDIAN when it
>>> returns s->sys_cfgstat value LE qemu sees that endianity
>>> matches and it does not byteswap of result, so integer at
>>> &mmio.data[0] address is in LE form. When qemu would
>>> run in BE mode on BE kernel, it would see that endianity
>>> mismatches and it will byteswap s->sys_cfgstat native value
>>> (BE), so mmio.data would contain integer in LE format again.
>>>
>>> Note in currently committed code arm_sysctl_ops endianity
>>> is DEVICE_NATIVE_ENDIAN, which is wrong - real vexpress
>>> arm_sysctl device always gives/receives data in LE format regardless
>>> of current CPSR E bit value, so it cannot be marked as NATIVE.
>>> LE and BE kernels always read it as LE device; BE kernel follows
>>> with byteswap. It was OK while we just run qemu in LE, but it
>>> should be fixed to be LITTLE_ENDIAN for BE qemu work correctly
>>> ... and actually that device and few other ARM specific devices
>>> endianity change to LITTLE_ENDIAN was the only change in qemu
>>> to make BE KVM to work.
>>>
>>>>>
>>>>> Yes, I fully agree :).
>>>> Great, I'll prepare a patch for the KVM API documentation.
>>>>
>>>> -Christoffer
>>>> _______________________________________________
>>>> kvmarm mailing list
>>>> kvmarm@xxxxxxxxxxxxxxxxxxxxx
>>>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>>>
>>> Thanks,
>>> Victor
>>>
>>> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-January/thread.html#223186
>>>
>>>
>>>    Appendix
>>>    Data path endianity in ARM KVM mmio
>>>    ===================================
>>>
>>> This writeup considers several scenarios and tracks endianity
>>> of data how it travels from emulator to guest CPU register, in
>>> case of ARM KVM. It starts with currently committed code for LE
>>> KVM host case and further discusses proposed BE KVM host
>>> arrangement.
>>>
>>> Just to restrict discussion writeup considers code path of
>>> integer (4 bytes) read from h/w mapped emulated device memory.
>>> Writeup considers endianity of essential places involved in such
>>> code path.
>>>
>>> For all cases when endianity is defined, it is assumed that
>>> values under consideration are in memory (opposite to be in
>>> register that does not have endianity). I.e even if function
>>> variable could be actually allocated in CPU register writeup
>>> will reference to it as it is in memory, just to keep
>>> discussion clean, except for final guest CPU register.
>>>
>>> Let's consider the following places along data path from
>>> emulator to guest CPU register:
>>>
>>> 1) emulator code that holds integer value to be read, assume
>>> it would be global 'int emulated_hw_device_val' variable.
>>> Normally in emulator it is held in native endian format - i.e
>>> it is CPSR E bit is the same as kernel CPSR E bit. Just for
>>> discussion sake assume that this h/w device registers
>>> holds 5 as its value.
>>>
>>> 2) KVM_EXIT_MMIO part of 'struct kvm_run' structure, i.e
>>> mmio.data byte array. Byte array does not have endianity,
>>> but for this discussion it would track endianity of integer
>>> at &mmio.data[0] address
>>>
>>> 3) 'data' variable type of 'unsigned long' in
>>> kvm_handle_mmio_return function before vcpu_data_host_to_guest
>>> call. KVM host mmio_read_buf function is used to fill this
>>> variable from mmio.data buffer. mmio_read_buf actually
>>> acts as memcpy from mmio.data buffer address,
>>> just taking access size in account.
>>>
>>> 4) the same 'data' variable as above, but after
>>> vcpu_data_host_to_guest function call, just before it is copied
>>> to vcpu_reg target register location. Note
>>> vcpu_data_host_to_guest function may byteswap value of 'data'
>>> depending on current KVM host endianity and value of
>>> guest CPSR E bit.
>>>
>>> 5) guest CPU spilled register array, location of target register
>>> i.e integer at vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt) address
>>>
>>> 6) finally guest CPU register filled from vcpu_reg just before
>>> guest resume execution of trapped emulated instruction. Note
>>> it is done by hypervisor part of code and hypervisor EE bit is
>>> the same as KVM host CPSR E bit.
>>>
>>> Note again, KVM host, emulator, and hypervisor part of code (guest
>>> CPU registers save and restore code) always run in the same
>>> endianity. Endianity of accessed emulated devices and endianity
>>> of guest varies independently of KVM host endianity.
>>>
>>> Below sections consider all permutations of all possible cases,
>>> it maybe quite boring to read. I've created summary table at
>>> the end, you can jump to the table, after reading few cases.
>>> But if you have objections and you see things happen differently
>>> please comment inline of the use cases steps.
>>>
>>> LE KVM host
>>> ===========
>>>
>>> Use case 1
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in LE format, matches device
>>> endianity
>>> 3) 'data' is LE
>>> 4) 'data' is LE (since guest CPSR E bit is off no byteswap)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 5 (0x00000005)
>>>
>>> guest resumes execution ... Let's say after 'ldr r1, [r0]'
>>> instruction, where r0 holds address of devices, it knows
>>> that it reads LE mapped h/w so no addition processing is
>>> needed
>>>
>>> Use case 2
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in LE format; matches device
>>> endianity
>>> 3) 'data' is LE
>>> 4) 'data' is BE (since guest CPSR E bit is on, vcpu_data_host_to_guest
>>> will do byteswap: cpu_to_be)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 0x05000000
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode (E bit on), it knows that it reads
>>> LE device memory, it needs to byteswap r1 before further
>>> processing so it does 'rev r1, r1' and proceed with result
>>>
>>> Use case 3
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native,
>>> and it should match device endianity
>>> 3) 'data' is BE
>>> 4) 'data' is BE (since guest CPSR E bit is off no byteswap)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 0x05000000
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode (E bit off), it knows that it
>>> reads BE device memory, it need to byteswap r1 before further
>>> processing so it does 'rev r1, r1' and proceeds with result
>>>
>>> Use case 4
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native,
>>> and should match device endianity
>>> 3) 'data' is BE
>>> 4) 'data' is LE (since guest CPSR E bit is on, vcpu_data_host_to_guest
>>> will do byteswap: cpu_to_be)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 5 (0x00000005)
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads BE device
>>> memory, so it does not need to do anything before further
>>> processing.
>>>
>>>
>>> Above uses cases that is exactly what we have now after Marc's
>>> commit to support BE guest on LE KVM host. Further use
>>> cases describe how it would work with BE KVM patches I proposed.
>>> It is understood that it is subject of further discussion.
>>>
>>>
>>> BE KVM host
>>> ===========
>>>
>>> Use case 5
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native;
>>> matches device endianity
>>> 3) 'data' is LE
>>> 4) 'data' is LE (since guest CPSR E bit is on, BE KVM host kernel
>>> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 0x05000000 because
>>> hypervisor runs in BE mode, so load of LE integer will be
>>> byteswapped value in register
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads LE device
>>> memory, it need to byteswap r1 before further processing so it
>>> does 'rev r1, r1' and proceeds with result
>>>
>>> Use case 6
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native;
>>> matches device endianity
>>> 3) 'data' is LE
>>> 4) 'data' is BE (since guest CPSR E bit is off, BE KVM host kernel
>>> does byteswap: cpu_to_le)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 5 (0x00000005) because
>>> hypervisor runs in BE mode, so load of BE integer will be OK
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode, it knows that it reads LE device
>>> memory, so it does not need to do anything else it just proceeds
>>>
>>> Use case 7
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in BE format; matches device
>>> endianity
>>> 3) 'data' is BE
>>> 4) 'data' is BE (since guest CPSR E bit is on, BE KVM host kernel
>>> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 5 (0x00000005) because
>>> hypervisor runs in BE mode, so load of BE integer will be OK
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads BE device
>>> memory, so it does not need to do anything else it just proceeds
>>>
>>> Use case 8
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in BE format; matches device
>>> endianity
>>> 3) 'data' is BE
>>> 4) 'data' is LE (since guest CPSR E bit is off, BE KVM host kernel
>>> does byteswap: cpu_to_le)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 0x05000000 because
>>> hypervisor runs in BE mode, so load of LE integer will be
>>> byteswapped value in register
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode, it knows that it reads BE device
>>> memory, it need to byteswap r1 before further processing so it
>>> does 'rev r1, r1' and proceeds with result
>>>
>>> Note that with BE kernel we actually have some initial portion
>>> of assembler code that is executed with CPSR bit off and it reads
>>> LE h/w - i.e it falls into use case 1.
>>>
>>> Summary Table (please use fixed font to see it correctly)
>>> ========================================
>>>
>>> --------------------------------------------------------------
>>> | Use Case # | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   |
>>> --------------------------------------------------------------
>>> | KVM Host,  | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
>>> | Emulator,  |     |     |     |     |     |     |     |     |
>>> | Hypervisor |     |     |     |     |     |     |     |     |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Device     | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Guest      | LE  | BE  | LE  | BE  | BE  | LE  | BE  | LE  |
>>> | Access     |     |     |     |     |     |     |     |     |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Step 1)    | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 2)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 3)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 4)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
>>> --------------------------------------------------------------
>>> | Step 5)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
>>> --------------------------------------------------------------
>>> | Final Reg  | no  | yes | yes | no  | yes | no  | no  | yes |
>>> | value      |     |     |     |     |     |     |     |     |
>>> | byteswapped|     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Guest      | no  | yes | yes | no  | yes | no  | no  | yes |
>>> | Follows    |     |     |     |     |     |     |     |     |
>>> | with rev   |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>>
>>> Few objservations
>>> =================
>>>
>>> x) Note above table is symmetric wrt to BE<->LE change:
>>>       1<-->7
>>>       2<-->8
>>>       3<-->5
>>>       4<-->6
>>>
>>> x) &mmio.data[0] address always holds integer in the same
>>> format as emulated device endianity
>>>
>>> x) During step 4) when vcpu_data_host_to_guest function
>>> is used, if guest E bit value different, but everything else
>>> is the same, opposite result are produced (1&2, 3&4, 5&6,
>>> 7&8)
>>>
>>> If you reached to this end :), again, thank you very much for
>>> reading it!
>>>
>>> - Victor
>>> _______________________________________________
>>> kvmarm mailing list
>>> kvmarm@xxxxxxxxxxxxxxxxxxxxx
>>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>>
>> Hi Victor,
>>
>> First of all I really appreciate the thorough description with
>> all the use-cases.
>>
>> Below would be a summary of what I understood from your
>> analysis:
>>
>> 1. Any MMIO device marked as NATIVE ENDIAN in user
>
> "Native endian" really is just a shortcut for "target endian" which is LE for ARM and BE for PPC. There shouldn't be a qemu-system-armeb or qemu-system-ppc64le.
>
> QEMU emulates everything that comes after the CPU, so imagine the ioctl struct as a bus package. Your bus doesn't care what endianness the CPU is in - it just gets data from the CPU.
>
> A bus write on the CPU however honors the endianness setting of the CPU. So when we convert from a value in register to a value on the bus we need to take this endian configuration into account.
>
> That's exactly what we are talking about here. KVM should do the cpu configured register->bus endian mapping while QEMU does the bus->device endian map.

Thanks for the info on QEMU side handling of MMIO data.

I was not aware that we would be only have "target endian = LE"
for ARM/ARM64 in QEMU. I think Marc Z had mentioned similar
thing about MMIO this in our previous discussions on his patches.
(Please refer, http://www.spinics.net/lists/arm-kernel/msg283313.html)

This clearly means MMIO data passed to user space (QEMU) has
to of host endianness so that QEMU can take care of bust->device
endian map.

Current vcpu_data_guest_to_host() and vcpu_data_host_to_guest()
does not perform endianness conversion of MMIO data to LE when
we are running LE guest on BE host so we do need Victor's patch
for fixing vcpu_data_guest_to_host() and vcpu_data_host_to_guest().
(Already reported long time back by me,
http://www.spinics.net/lists/arm-kernel/msg283308.html)

Regards,
Anup

>
>
> Alex
>
>> space tool (QEMU or KVMTOOL) is bad for cross-endian
>> Guest. For supporting cross-endian Guest we need to have
>> all MMIO device with fixed ENDIANESS.
>>
>> 2. We don't need to do any endianness conversions in KVM
>> for MMIO writes that are being forwarded to user space. It is
>> the job of user space (QEMU or KVMTOOL) to interpret the
>> endianness of MMIO write data based on device endianness.
>>
>> 3. The MMIO read operation is the one which will need
>> explicit handling in KVM because the target VCPU register
>> of MMIO read operation should be loaded with MMIO data
>> (returned from user space) based upon current VCPU
>> endianness (i.e. VCPU CPSR.E bit).
>>
>> 4. In-kernel emulated devices (such as VGIC) will have not
>> require any explicit endianness conversion of MMIO data for
>> MMIO write operations (same as point 2).
>>
>> 5. In-kernel emulated devices (such as VGIC) will have to
>> explicit endianness conversion of MMIO data for MMIO read
>> operations based on device endianness (same as point 3).
>>
>> I hope above summary of my understanding is as-per your
>> description. If so then I am in-support of your description.
>>
>> I think your description (and above 5 points) takes care of
>> all use cases of cross-endianness without changing current
>> MMIO ABI.
>>
>> Regards,
>> Anup
>>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html