Re: [Qemu-ppc] KVM and variable-endianness guest CPUs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 21 January 2014 22:41, Alexander Graf <agraf@xxxxxxx> wrote:
>
>
>> Am 22.01.2014 um 07:31 schrieb Anup Patel <anup@xxxxxxxxxxxxxx>:
>>
>> On Wed, Jan 22, 2014 at 11:09 AM, Victor Kamensky
>> <victor.kamensky@xxxxxxxxxx> wrote:
>>> Hi Guys,
>>>
>>> Christoffer and I had a bit heated chat :) on this
>>> subject last night. Christoffer, really appreciate
>>> your time! We did not really reach agreement
>>> during the chat and Christoffer asked me to follow
>>> up on this thread.
>>> Here it goes. Sorry, it is very long email.
>>>
>>> I don't believe we can assign any endianity to
>>> mmio.data[] byte array. I believe mmio.data[] and
>>> mmio.len acts just memcpy and that is all. As
>>> memcpy does not imply any endianity of underlying
>>> data mmio.data[] should not either.
>>>
>>> Here is my definition:
>>>
>>> mmio.data[] is array of bytes that contains memory
>>> bytes in such form, for read case, that if those
>>> bytes are placed in guest memory and guest executes
>>> the same read access instruction with address to this
>>> memory, result would be the same as real h/w device
>>> memory access. Rest of KVM host and hypervisor
>>> part of code should really take care of mmio.data[]
>>> memory so it will be delivered to vcpu registers and
>>> restored by hypervisor part in such way that guest CPU
>>> register value is the same as it would be for real
>>> non-emulated h/w read access (that is emulation part).
>>> The same goes for write access, if guest writes into
>>> memory and those bytes are just copied to emulated
>>> h/w register it would have the same effect as real
>>> mapped h/w register write.
>>>
>>> In shorter form, i.e for len=4 access: endianity of integer
>>> at &mmio.data[0] address should match endianity
>>> of emulated h/w device behind phys_addr address,
>>> regardless what is endianity of emulator, KVM host,
>>> hypervisor, and guest
>>>
>>> Examples that illustrate my definition
>>> --------------------------------------
>>>
>>> 1) LE guest (E bit is off in ARM speak) reads integer
>>> (4 bytes) from mapped h/w LE device register -
>>> mmio.data[3] contains MSB, mmio.data[0] contains LSB.
>>>
>>> 2) BE guest (E bit is on in ARM speak) reads integer
>>> from mapped h/w LE device register - mmio.data[3]
>>> contains MSB, mmio.data[0] contains LSB. Note that
>>> if &mmio.data[0] memory would be placed in guest
>>> address space and instruction restarted with new
>>> address, then it would meet BE guest expectations
>>> - the guest knows that it reads LE h/w so it will byteswap
>>> register before processing it further. This is BE guest ARM
>>> case (regardless of what KVM host endianity is).
>>>
>>> 3) BE guest reads integer from mapped h/w BE device
>>> register - mmio.data[0] contains MSB, mmio.data[3]
>>> contains LSB. Note that if &mmio.data[0] memory would
>>> be placed in guest address space and instruction
>>> restarted with new address, then it would meet BE
>>> guest expectation - the guest knows that it reads
>>> BE h/w so it will proceed further without any other
>>> work. I guess, it is BE ppc case.
>>>
>>>
>>> Arguments in favor of memcpy semantics of mmio.data[]
>>> ------------------------------------------------------
>>>
>>> x) What are possible values of 'len'? Previous discussions
>>> imply that is always powers of 2. Why is that? Maybe
>>> there will be CPU that would need to do 5 bytes mmio
>>> access, or 6 bytes. How do you assign endianity to
>>> such case? 'len' 5 or 6, or any works fine with
>>> memcpy semantics. I admit it is hypothetical case, but
>>> IMHO it tests how clean ABI definition is.
>>>
>>> x) Byte array does not have endianity because it
>>> does not have any structure. If one would want to
>>> imply structure why mmio is not defined in such way
>>> so structure reflected in mmio definition?
>>> Something like:
>>>
>>>
>>>                /* KVM_EXIT_MMIO */
>>>                struct {
>>>                          __u64 phys_addr;
>>>                          union {
>>>                               __u8 byte;
>>>                               __u16 hword;
>>>                               __u32 word;
>>>                               __u64 dword;
>>>                          }  data;
>>>                          __u32 len;
>>>                          __u8  is_write;
>>>                } mmio;
>>>
>>> where len is really serves as union discriminator and
>>> only allowed len values are 1, 2, 4, 8.
>>> In this case, I agree, endianity of integer types
>>> should be defined. I believe, use of byte array strongly
>>> implies that original intent was to have semantics of
>>> byte stream copy, just like memcpy does.
>>>
>>> x) Note there is nothing wrong with user kernel ABI to
>>> use just bytes stream as parameter. There is already
>>> precedents like 'read' and 'write' system calls :).
>>>
>>> x) Consider case when KVM works with emulated memory mapped
>>> h/w devices where some devices operate in LE mode and others
>>> operate in BE mode. It is defined by semantics of real h/w
>>> device which is it, and should be emulated by emulator and KVM
>>> given all other context. As far as mmio.data[] array concerned, if the
>>> same integer value is read from these devices registers, mmio.data[]
>>> memory should contain integer in opposite endianity for these
>>> two cases, i.e MSB is data[0] in one case and MSB is
>>> data[3] is in another case. It cannot be the same, because
>>> except emulator and guest kernel, all other, like KVM host
>>> and hypervisor, have no clue what endianity of device
>>> actually is - it should treat mmio.data[] in the same way.
>>> But resulting guest target CPU register would need to contain
>>> normal integer value in one case and byteswapped in another,
>>> because guest kernel would use it directly in one case and
>>> byteswap it in another. Byte stream semantics allows to do
>>> that. I don't see how it could happen if you fixate mmio.data[]
>>> endianity in such way that it would contain integer in
>>> the same format for BE and LE emulated device types.
>>>
>>> If by this point you agree, that mmio.data[] user-land/kernel
>>> ABI semantics should be just memcpy, stop reading :). If not,
>>> you may would like to take a look at below appendix where I
>>> described in great details endianity of data at different
>>> points along mmio processing code path of existing ARM LE KVM,
>>> and proposed ARM BE KVM. Note appendix, is very long and very
>>> detailed, sorry about that, but I feel that earlier more
>>> digested explanations failed, so it driven me to write out
>>> all details how I see them. If I am wrong, I hope it would be
>>> easier for folks to point in detailed explanation places
>>> where my logic goes bad. Also, I am not sure whether this
>>> mail thread is good place to discuss all details described
>>> in the appendix. Christoffer, please advise whether I should take
>>> that one back on [1]. But I hope this bigger picture may help to
>>> see the mmio.data[] semantics issue in context.
>>>
>>> More inline and appendix is at the end.
>>>
>>>> On 20 January 2014 11:19, Christoffer Dall <christoffer.dall@xxxxxxxxxx> wrote:
>>>>> On Mon, Jan 20, 2014 at 03:22:11PM +0100, Alexander Graf wrote:
>>>>>
>>>>>> On 17.01.2014, at 19:52, Peter Maydell <peter.maydell@xxxxxxxxxx> wrote:
>>>>>>
>>>>>>> On 17 January 2014 17:53, Peter Maydell <peter.maydell@xxxxxxxxxx> wrote:
>>>>>>> Specifically, the KVM API says "here's a uint8_t[] byte
>>>>>>> array and a length", and the current QEMU code treats that
>>>>>>> as "this is a byte array written as if the guest CPU
>>>>>>> (a) were in TARGET_WORDS_BIGENDIAN order and (b) wrote its
>>>>>>> I/O access to this buffer rather than to the device".
>>>>>>>
>>>>>>> The KVM API docs don't actually specify the endianness
>>>>>>> semantics of the byte array, but I think that that really
>>>>>>> needs to be nailed down. I can think of a couple of options:
>>>>>>> * always LE
>>>>>>> * always BE
>>>>>>>  [these first two are non-starters because they would
>>>>>>>  break either x86 or PPC existing code]
>>>>>>> * always the endianness the guest is at the time
>>>>>>> * always some arbitrary endianness based purely on the
>>>>>>>  endianness the KVM implementation used historically
>>>>>>> * always the endianness of the host QEMU binary
>>>>>>> * something else?
>>>>>>>
>>>>>>> Any preferences? Current QEMU code basically assumes
>>>>>>> "always the endianness of TARGET_WORDS_BIGENDIAN",
>>>>>>> which is pretty random.
>>>>>>
>>>>>> Having thought a little more about this, my opinion is:
>>>>>>
>>>>>> * we should specify that the byte order of the mmio.data
>>>>>>  array is host kernel endianness (ie same endianness
>>>>>>  as the QEMU process itself) [this is what it actually
>>>>>>  is, I think, for all the cases that work today]
>>>
>>> In above please consider two types of mapped emulated
>>> h/w devices: BE and LE they cannot have mmio.data in the
>>> same endianity. Currently in all observable cases LE ARM
>>> and BE PPC devices endianity matches kernel/qemu
>>> endianity but it would break when BE ARM is introduced
>>> or LE PPC or one would start emulating BE devices on LE
>>> ARM.
>>>
>>>>>> * we should fix the code path in QEMU for handling
>>>>>>  mmio.data which currently has the implicit assumption
>>>>>>  that when using KVM TARGET_WORDS_BIGENDIAN is the same
>>>>>>  as the QEMU host process endianness (because it's using
>>>>>>  load/store functions which swap if TARGET_WORDS_BIGENDIAN
>>>>>>  is different from HOST_WORDS_BIGENDIAN)
>>>
>>> I do not follow above. Maybe I am missing bigger context.
>>> What is CPU under discussion in above? On ARM V7 system
>>> when LE device is accessed as integer &mmio.data[0] address
>>> would contain integer is in LE format, ie mmio.data[0] is LSB.
>>>
>>> Here is gdb session of LE qemu running on V7 LE kernel and
>>> TC1 LE guest. Guest kernel accesses sys_cfgstat register which is
>>> arm_sysctl registers with offset of 0xa8. Note.arm_sysct is memory
>>> mapped LE device.
>>> Please check run->mmio structure after read
>>> (cpu_physical_memory_rw) completes it is in 4 bytes integer in
>>> LE format mmio.data[0] is LSB and is equal to 1
>>> (s->syscfgstat value):
>>>
>>> (gdb) bt
>>> #0  arm_sysctl_read (opaque=0x95a600, offset=168, size=4) at
>>> /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
>>> #1  0x0023b9b4 in memory_region_read_accessor (mr=0x95b8e0,
>>> addr=<optimized out>, value=0xb5c0dc18, size=4, shift=0,
>>> mask=4294967295)
>>>    at /home/root/20131219/qemu-be/memory.c:407
>>> #2  0x0023aba4 in access_with_adjusted_size (addr=4294967295,
>>> value=0xb5c0dc18, value@entry=0xb5c0dc10, size=size@entry=4,
>>> access_size_min=1,
>>>    access_size_max=2357596, access=access@entry=0x23b96c
>>> <memory_region_read_accessor>, mr=mr@entry=0x95b8e0) at
>>> /home/root/20131219/qemu-be/memory.c:477
>>> #3  0x0023f95c in memory_region_dispatch_read1 (size=4, addr=168,
>>> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:944
>>> #4  memory_region_dispatch_read (size=4, pval=0xb5c0dc68, addr=168,
>>> mr=0x95b8e0) at /home/root/20131219/qemu-be/memory.c:966
>>> #5  io_mem_read (mr=mr@entry=0x95b8e0, addr=<optimized out>,
>>> pval=pval@entry=0xb5c0dc68, size=size@entry=4) at
>>> /home/root/20131219/qemu-be/memory.c:1743
>>> #6  0x001abd38 in address_space_rw (as=as@entry=0x8102d8
>>> <address_space_memory>, addr=469827752, buf=buf@entry=0xb6fd6028 "",
>>> len=4, is_write=false,
>>>    is_write@entry=true) at /home/root/20131219/qemu-be/exec.c:2025
>>> #7  0x001abf90 in cpu_physical_memory_rw (addr=<optimized out>,
>>> buf=buf@entry=0xb6fd6028 "", len=<optimized out>, is_write=0)
>>>    at /home/root/20131219/qemu-be/exec.c:2070
>>> #8  0x00239e00 in kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1701
>>> #9  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
>>> /home/root/20131219/qemu-be/cpus.c:874
>>> #10 0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
>>> #11 0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> #12 0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>>> (gdb) p /x s->sys_cfgstat
>>> $25 = 0x1
>>> (gdb) finish
>>> Run till exit from #0  arm_sysctl_read (opaque=0x95a600, offset=168,
>>> size=4) at /home/root/20131219/qemu-be/hw/misc/arm_sysctl.c:127
>>> memory_region_read_accessor (mr=0x95b8e0, addr=<optimized out>,
>>> value=0xb5c0dc18, size=4, shift=0, mask=4294967295) at
>>> /home/root/20131219/qemu-be/memory.c:408
>>> 408        trace_memory_region_ops_read(mr, addr, tmp, size);
>>> Value returned is $26 = 1
>>> (gdb) enable 2
>>> (gdb) cont
>>> Continuing.
>>>
>>> Breakpoint 2, kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1660
>>> 1660            kvm_arch_pre_run(cpu, run);
>>> (gdb) bt
>>> #0  kvm_cpu_exec (cpu=cpu@entry=0x8758f8) at
>>> /home/root/20131219/qemu-be/kvm-all.c:1660
>>> #1  0x001a3f78 in qemu_kvm_cpu_thread_fn (arg=0x8758f8) at
>>> /home/root/20131219/qemu-be/cpus.c:874
>>> #2  0xb6cae06c in start_thread (arg=0xb5c0e310) at pthread_create.c:314
>>> #3  0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> #4  0xb69f5070 in ?? () at
>>> ../ports/sysdeps/unix/sysv/linux/arm/clone.S:97 from /lib/libc.so.6
>>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>>> (gdb) p /x run->mmio
>>> $27 = {phys_addr = 0x1c0100a8, data = {0x1, 0x0, 0x0, 0x0, 0x0, 0x0,
>>> 0x0, 0x0}, len = 0x4, is_write = 0x0}
>>>
>>> Also please look at adjust_endianness function and
>>> struct MemoryRegion 'endianness' field. IMHO in qemu it
>>> works quite nicely already. MemoryRegion 'read' and 'write'
>>> callbacks return/get data in native format adjust_endianness
>>> function checks whether emulated device endianness matches
>>> emulator endianness and if it is different it does byteswap
>>> according to size. As in above example arm_sysctl_ops memory
>>> region should be marked as DEVICE_LITTLE_ENDIAN when it
>>> returns s->sys_cfgstat value LE qemu sees that endianity
>>> matches and it does not byteswap of result, so integer at
>>> &mmio.data[0] address is in LE form. When qemu would
>>> run in BE mode on BE kernel, it would see that endianity
>>> mismatches and it will byteswap s->sys_cfgstat native value
>>> (BE), so mmio.data would contain integer in LE format again.
>>>
>>> Note in currently committed code arm_sysctl_ops endianity
>>> is DEVICE_NATIVE_ENDIAN, which is wrong - real vexpress
>>> arm_sysctl device always gives/receives data in LE format regardless
>>> of current CPSR E bit value, so it cannot be marked as NATIVE.
>>> LE and BE kernels always read it as LE device; BE kernel follows
>>> with byteswap. It was OK while we just run qemu in LE, but it
>>> should be fixed to be LITTLE_ENDIAN for BE qemu work correctly
>>> ... and actually that device and few other ARM specific devices
>>> endianity change to LITTLE_ENDIAN was the only change in qemu
>>> to make BE KVM to work.
>>>
>>>>>
>>>>> Yes, I fully agree :).
>>>> Great, I'll prepare a patch for the KVM API documentation.
>>>>
>>>> -Christoffer
>>>> _______________________________________________
>>>> kvmarm mailing list
>>>> kvmarm@xxxxxxxxxxxxxxxxxxxxx
>>>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>>>
>>> Thanks,
>>> Victor
>>>
>>> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-January/thread.html#223186
>>>
>>>
>>>    Appendix
>>>    Data path endianity in ARM KVM mmio
>>>    ===================================
>>>
>>> This writeup considers several scenarios and tracks endianity
>>> of data how it travels from emulator to guest CPU register, in
>>> case of ARM KVM. It starts with currently committed code for LE
>>> KVM host case and further discusses proposed BE KVM host
>>> arrangement.
>>>
>>> Just to restrict discussion writeup considers code path of
>>> integer (4 bytes) read from h/w mapped emulated device memory.
>>> Writeup considers endianity of essential places involved in such
>>> code path.
>>>
>>> For all cases when endianity is defined, it is assumed that
>>> values under consideration are in memory (opposite to be in
>>> register that does not have endianity). I.e even if function
>>> variable could be actually allocated in CPU register writeup
>>> will reference to it as it is in memory, just to keep
>>> discussion clean, except for final guest CPU register.
>>>
>>> Let's consider the following places along data path from
>>> emulator to guest CPU register:
>>>
>>> 1) emulator code that holds integer value to be read, assume
>>> it would be global 'int emulated_hw_device_val' variable.
>>> Normally in emulator it is held in native endian format - i.e
>>> it is CPSR E bit is the same as kernel CPSR E bit. Just for
>>> discussion sake assume that this h/w device registers
>>> holds 5 as its value.
>>>
>>> 2) KVM_EXIT_MMIO part of 'struct kvm_run' structure, i.e
>>> mmio.data byte array. Byte array does not have endianity,
>>> but for this discussion it would track endianity of integer
>>> at &mmio.data[0] address
>>>
>>> 3) 'data' variable type of 'unsigned long' in
>>> kvm_handle_mmio_return function before vcpu_data_host_to_guest
>>> call. KVM host mmio_read_buf function is used to fill this
>>> variable from mmio.data buffer. mmio_read_buf actually
>>> acts as memcpy from mmio.data buffer address,
>>> just taking access size in account.
>>>
>>> 4) the same 'data' variable as above, but after
>>> vcpu_data_host_to_guest function call, just before it is copied
>>> to vcpu_reg target register location. Note
>>> vcpu_data_host_to_guest function may byteswap value of 'data'
>>> depending on current KVM host endianity and value of
>>> guest CPSR E bit.
>>>
>>> 5) guest CPU spilled register array, location of target register
>>> i.e integer at vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt) address
>>>
>>> 6) finally guest CPU register filled from vcpu_reg just before
>>> guest resume execution of trapped emulated instruction. Note
>>> it is done by hypervisor part of code and hypervisor EE bit is
>>> the same as KVM host CPSR E bit.
>>>
>>> Note again, KVM host, emulator, and hypervisor part of code (guest
>>> CPU registers save and restore code) always run in the same
>>> endianity. Endianity of accessed emulated devices and endianity
>>> of guest varies independently of KVM host endianity.
>>>
>>> Below sections consider all permutations of all possible cases,
>>> it maybe quite boring to read. I've created summary table at
>>> the end, you can jump to the table, after reading few cases.
>>> But if you have objections and you see things happen differently
>>> please comment inline of the use cases steps.
>>>
>>> LE KVM host
>>> ===========
>>>
>>> Use case 1
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in LE format, matches device
>>> endianity
>>> 3) 'data' is LE
>>> 4) 'data' is LE (since guest CPSR E bit is off no byteswap)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 5 (0x00000005)
>>>
>>> guest resumes execution ... Let's say after 'ldr r1, [r0]'
>>> instruction, where r0 holds address of devices, it knows
>>> that it reads LE mapped h/w so no addition processing is
>>> needed
>>>
>>> Use case 2
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in LE format; matches device
>>> endianity
>>> 3) 'data' is LE
>>> 4) 'data' is BE (since guest CPSR E bit is on, vcpu_data_host_to_guest
>>> will do byteswap: cpu_to_be)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 0x05000000
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode (E bit on), it knows that it reads
>>> LE device memory, it needs to byteswap r1 before further
>>> processing so it does 'rev r1, r1' and proceed with result
>>>
>>> Use case 3
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native,
>>> and it should match device endianity
>>> 3) 'data' is BE
>>> 4) 'data' is BE (since guest CPSR E bit is off no byteswap)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 0x05000000
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode (E bit off), it knows that it
>>> reads BE device memory, it need to byteswap r1 before further
>>> processing so it does 'rev r1, r1' and proceeds with result
>>>
>>> Use case 4
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is LE (host CPSR E bit is off); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is LE
>>> 2) &mmio.data[0] holds integer in BE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native,
>>> and should match device endianity
>>> 3) 'data' is BE
>>> 4) 'data' is LE (since guest CPSR E bit is on, vcpu_data_host_to_guest
>>> will do byteswap: cpu_to_be)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 5 (0x00000005)
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads BE device
>>> memory, so it does not need to do anything before further
>>> processing.
>>>
>>>
>>> Above uses cases that is exactly what we have now after Marc's
>>> commit to support BE guest on LE KVM host. Further use
>>> cases describe how it would work with BE KVM patches I proposed.
>>> It is understood that it is subject of further discussion.
>>>
>>>
>>> BE KVM host
>>> ===========
>>>
>>> Use case 5
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native;
>>> matches device endianity
>>> 3) 'data' is LE
>>> 4) 'data' is LE (since guest CPSR E bit is on, BE KVM host kernel
>>> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 0x05000000 because
>>> hypervisor runs in BE mode, so load of LE integer will be
>>> byteswapped value in register
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads LE device
>>> memory, it need to byteswap r1 before further processing so it
>>> does 'rev r1, r1' and proceeds with result
>>>
>>> Use case 6
>>> ----------
>>>
>>> Emulated h/w device gives data in LE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in LE format; emulator byteswaps
>>> it because it knows that device endianity is opposite to native;
>>> matches device endianity
>>> 3) 'data' is LE
>>> 4) 'data' is BE (since guest CPSR E bit is off, BE KVM host kernel
>>> does byteswap: cpu_to_le)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 5 (0x00000005) because
>>> hypervisor runs in BE mode, so load of BE integer will be OK
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode, it knows that it reads LE device
>>> memory, so it does not need to do anything else it just proceeds
>>>
>>> Use case 7
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in BE mode; and guest does access with CPSR E bit on
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in BE format; matches device
>>> endianity
>>> 3) 'data' is BE
>>> 4) 'data' is BE (since guest CPSR E bit is on, BE KVM host kernel
>>> does *not* do byteswap: cpu_to_be no effect in BE host kernel)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is BE
>>> 6) final guest target CPU register contains 5 (0x00000005) because
>>> hypervisor runs in BE mode, so load of BE integer will be OK
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in BE mode, it knows that it reads BE device
>>> memory, so it does not need to do anything else it just proceeds
>>>
>>> Use case 8
>>> ----------
>>>
>>> Emulated h/w device gives data in BE form; emulator and KVM
>>> host endianity is BE (host CPSR E bit is on); guest compiled
>>> in LE mode; and guest does access with CPSR E bit off
>>>
>>> 1) 'emulated_hw_device_val' emulator variable is BE
>>> 2) &mmio.data[0] holds integer in BE format; matches device
>>> endianity
>>> 3) 'data' is BE
>>> 4) 'data' is LE (since guest CPSR E bit is off, BE KVM host kernel
>>> does byteswap: cpu_to_le)
>>> 5) integer at 'vcpu_reg(vcpu, vcpu->arch.mmio_decode.rt)' is LE
>>> 6) final guest target CPU register contains 0x05000000 because
>>> hypervisor runs in BE mode, so load of LE integer will be
>>> byteswapped value in register
>>>
>>> guest resumes execution after 'ldr r1, [r0]', guest kernel
>>> knows that it runs in LE mode, it knows that it reads BE device
>>> memory, it need to byteswap r1 before further processing so it
>>> does 'rev r1, r1' and proceeds with result
>>>
>>> Note that with BE kernel we actually have some initial portion
>>> of assembler code that is executed with CPSR bit off and it reads
>>> LE h/w - i.e it falls into use case 1.
>>>
>>> Summary Table (please use fixed font to see it correctly)
>>> ========================================
>>>
>>> --------------------------------------------------------------
>>> | Use Case # | 1   | 2   | 3   | 4   | 5   | 6   | 7   | 8   |
>>> --------------------------------------------------------------
>>> | KVM Host,  | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
>>> | Emulator,  |     |     |     |     |     |     |     |     |
>>> | Hypervisor |     |     |     |     |     |     |     |     |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Device     | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Guest      | LE  | BE  | LE  | BE  | BE  | LE  | BE  | LE  |
>>> | Access     |     |     |     |     |     |     |     |     |
>>> | Endianity  |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Step 1)    | LE  | LE  | LE  | LE  | BE  | BE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 2)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 3)    | LE  | LE  | BE  | BE  | LE  | LE  | BE  | BE  |
>>> --------------------------------------------------------------
>>> | Step 4)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
>>> --------------------------------------------------------------
>>> | Step 5)    | LE  | BE  | BE  | LE  | LE  | BE  | BE  | LE  |
>>> --------------------------------------------------------------
>>> | Final Reg  | no  | yes | yes | no  | yes | no  | no  | yes |
>>> | value      |     |     |     |     |     |     |     |     |
>>> | byteswapped|     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>> | Guest      | no  | yes | yes | no  | yes | no  | no  | yes |
>>> | Follows    |     |     |     |     |     |     |     |     |
>>> | with rev   |     |     |     |     |     |     |     |     |
>>> --------------------------------------------------------------
>>>
>>> Few objservations
>>> =================
>>>
>>> x) Note above table is symmetric wrt to BE<->LE change:
>>>       1<-->7
>>>       2<-->8
>>>       3<-->5
>>>       4<-->6
>>>
>>> x) &mmio.data[0] address always holds integer in the same
>>> format as emulated device endianity
>>>
>>> x) During step 4) when vcpu_data_host_to_guest function
>>> is used, if guest E bit value different, but everything else
>>> is the same, opposite result are produced (1&2, 3&4, 5&6,
>>> 7&8)
>>>
>>> If you reached to this end :), again, thank you very much for
>>> reading it!
>>>
>>> - Victor
>>> _______________________________________________
>>> kvmarm mailing list
>>> kvmarm@xxxxxxxxxxxxxxxxxxxxx
>>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
>>
>> Hi Victor,
>>
>> First of all I really appreciate the thorough description with
>> all the use-cases.
>>
>> Below would be a summary of what I understood from your
>> analysis:
>>
>> 1. Any MMIO device marked as NATIVE ENDIAN in user
>
> "Native endian" really is just a shortcut for "target endian"
> which is LE for ARM and BE for PPC. There shouldn't be
> a qemu-system-armeb or qemu-system-ppc64le.

I disagree. Fully functional ARM BE system is what we've
been working on for last few months. 'We' is Linaro
Networking Group, Endian subteam and some other guys
in ARM and across community. Why we do that is a bit
beyond of this discussion.

ARM BE patches for both V7 and V8 are already in mainline
kernel. But ARM BE KVM host is broken now. It is known
deficiency that I am trying to fix. Please look at [1]. Patches
for V7 BE KVM were proposed and currently under active
discussion. Currently I work on ARM V8 BE KVM changes.

So "native endian" in ARM is value of CPSR register E bit.
If it is off native endian is LE, if it is on it is BE.

Once and if we agree on ARM BE KVM host changes, the
next step would be patches in qemu one of which introduces
qemu-system-armeb. Please see [2].

> QEMU emulates everything that comes after the CPU, so
> imagine the ioctl struct as a bus package. Your bus
> doesn't care what endianness the CPU is in - it just
> gets data from the CPU.

I am not sure that I follow above. Suppose I have

move r1, #1
str r1, [r0]

where r0 is device address. Now depending on CPSR
E bit value device address will receive 1 as integer either
in LE order or in BE order. That is how ARM v7 CPU
works, regardless whether it is emulated or not.

So if E bit is off (LE case) after str is executed
 byte at r0 address will get 1
 byte at r0 + 1 address will get 0
 byte at r0 + 2 address will get 0
 byte at r0 + 3 address will get 0

If E bit is on (BE case) after str is executed
 byte at r0 address will get 0
 byte at r0 + 1 address will get 0
 byte at r0 + 2 address will get 0
 byte at r0 + 3 address will get 1

my point that mmio.data[] just carries bytes for phys_addr
mmio.data[0] would be value for byte at phys_addr,
mmio.data[1] would be value for byte at phys_addr + 1, and
so on.

> A bus write on the CPU however honors the endianness
> setting of the CPU. So when we convert from a value in
> register to a value on the bus we need to take this endian
> configuration into account.

for read it is the same mmio.data[0] just carries memory
for emulated phys_addr. It is the same as for write case.

But if one would want to look at endianity of integer
at &mmio.data[0] address its endianity would be really
defined as endianity of emulated h/w memory mapped
device.

Not sure, maybe I miss your point.

Also please consider endianity of device memory could BE or
LE and it does not depend on "native endianity" it could
exist in any combination and it would work because in
all proper place explicit byteswap would be executed
by code that works with device memory that is in opposite
endianity. Admittedly for ARM most dominating case now is
LE devices, but nothing prevent us to attach memory
mapped devices that would work in BE mode. For example
my parent company, Cisco, which Linaro assignee I am,
has a lot fabric chips that operate in BE, and once attached
to the system they would be treated properly - read in BE
mode without byteswap and read with byteswap in LE mode.
Note last point is oversimplified picture.

Thanks,
Victor

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2013-December/220973.html

[2] https://git.linaro.org/people/victor.kamensky/qemu-be.git/shortlog/refs/heads/armv7be

> That's exactly what we are talking about here. KVM
> should do the cpu configured register->bus endian
> mapping while QEMU does the bus->device endian map.
>
> Alex
>
>> space tool (QEMU or KVMTOOL) is bad for cross-endian
>> Guest. For supporting cross-endian Guest we need to have
>> all MMIO device with fixed ENDIANESS.
>>
>> 2. We don't need to do any endianness conversions in KVM
>> for MMIO writes that are being forwarded to user space. It is
>> the job of user space (QEMU or KVMTOOL) to interpret the
>> endianness of MMIO write data based on device endianness.
>>
>> 3. The MMIO read operation is the one which will need
>> explicit handling in KVM because the target VCPU register
>> of MMIO read operation should be loaded with MMIO data
>> (returned from user space) based upon current VCPU
>> endianness (i.e. VCPU CPSR.E bit).
>>
>> 4. In-kernel emulated devices (such as VGIC) will have not
>> require any explicit endianness conversion of MMIO data for
>> MMIO write operations (same as point 2).
>>
>> 5. In-kernel emulated devices (such as VGIC) will have to
>> explicit endianness conversion of MMIO data for MMIO read
>> operations based on device endianness (same as point 3).
>>
>> I hope above summary of my understanding is as-per your
>> description. If so then I am in-support of your description.
>>
>> I think your description (and above 5 points) takes care of
>> all use cases of cross-endianness without changing current
>> MMIO ABI.
>>
>> Regards,
>> Anup
>>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux