Re: [Qemu-ppc] KVM and variable-endianness guest CPUs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 22.01.2014, at 08:26, Victor Kamensky <victor.kamensky@xxxxxxxxxx> wrote:

> On 21 January 2014 22:41, Alexander Graf <agraf@xxxxxxx> wrote:
>> 
>> 
>> "Native endian" really is just a shortcut for "target endian"
>> which is LE for ARM and BE for PPC. There shouldn't be
>> a qemu-system-armeb or qemu-system-ppc64le.
> 
> I disagree. Fully functional ARM BE system is what we've
> been working on for last few months. 'We' is Linaro
> Networking Group, Endian subteam and some other guys
> in ARM and across community. Why we do that is a bit
> beyond of this discussion.
> 
> ARM BE patches for both V7 and V8 are already in mainline
> kernel. But ARM BE KVM host is broken now. It is known
> deficiency that I am trying to fix. Please look at [1]. Patches
> for V7 BE KVM were proposed and currently under active
> discussion. Currently I work on ARM V8 BE KVM changes.
> 
> So "native endian" in ARM is value of CPSR register E bit.
> If it is off native endian is LE, if it is on it is BE.
> 
> Once and if we agree on ARM BE KVM host changes, the
> next step would be patches in qemu one of which introduces
> qemu-system-armeb. Please see [2].

I think we're facing an ideology conflict here. Yes, there should be a qemu-system-arm that is BE capable. There should also be a qemu-system-ppc64 that is LE capable. But there is no point in changing the "default endiannes" for the virtual CPUs that we plug in there. Both CPUs are perfectly capable of running in LE or BE mode, the question is just what we declare the "default".

Think about the PPC bootstrap. We start off with a BE firmware, then boot into the Linux kernel which calls a hypercall to set the LE bit on every interrupt. But there's no reason this little endian kernel couldn't theoretically have big endian user space running with access to emulated device registers.

As Peter already pointed out, the actual breakage behind this is that we have a "default endianness" at all. But that's a very difficult thing to resolve and I don't think should be our primary goal. Just live with the fact that we declare ARM little endian in QEMU and swap things accordingly - then everyone's happy.

This really only ever becomes a problem if you have devices that have awareness of the CPUs endian mode. The only one on PPC that I'm aware of that falls into this category is virtio and there are patches pending to solve that. I don't know if there are any QEMU emulated devices outside of virtio with this issue on ARM, but you'll have to make the emulation code for those look at the CPU state then.

> 
>> QEMU emulates everything that comes after the CPU, so
>> imagine the ioctl struct as a bus package. Your bus
>> doesn't care what endianness the CPU is in - it just
>> gets data from the CPU.
> 
> I am not sure that I follow above. Suppose I have
> 
> move r1, #1
> str r1, [r0]
> 
> where r0 is device address. Now depending on CPSR
> E bit value device address will receive 1 as integer either
> in LE order or in BE order. That is how ARM v7 CPU
> works, regardless whether it is emulated or not.
> 
> So if E bit is off (LE case) after str is executed
> byte at r0 address will get 1
> byte at r0 + 1 address will get 0
> byte at r0 + 2 address will get 0
> byte at r0 + 3 address will get 0
> 
> If E bit is on (BE case) after str is executed
> byte at r0 address will get 0
> byte at r0 + 1 address will get 0
> byte at r0 + 2 address will get 0
> byte at r0 + 3 address will get 1
> 
> my point that mmio.data[] just carries bytes for phys_addr
> mmio.data[0] would be value for byte at phys_addr,
> mmio.data[1] would be value for byte at phys_addr + 1, and
> so on.

What we get is an instruction that traps because it wants to "write r1 (which has value=1) into address x". So at that point we get the register value.

Then we need to take a look at the E bit to see whether the write was supposed to be in non-host endianness because we need to emulate exactly the LE/BE difference you're indicating above. The way we implement this on PPC is that we simply byte swap the register value when guest_endian != host_endian.

With this in place, QEMU can just memcpy() the value into a local register and feed it into its emulation code which expects a "register value as if the CPU was running in native endianness" as parameter - with "native" meaning "little endian" for qemu-system-arm. Device emulation code doesn't know what to do with a byte array.

Take a look at QEMU's MMIO handler:

        case KVM_EXIT_MMIO:
            DPRINTF("handle_mmio\n");
            cpu_physical_memory_rw(run->mmio.phys_addr,
                                   run->mmio.data,
                                   run->mmio.len,
                                   run->mmio.is_write);
            ret = 0;
            break;

which translates to

                switch (l) {
                case 8:
                    /* 64 bit write access */
                    val = ldq_p(buf);
                    error |= io_mem_write(mr, addr1, val, 8);
                    break;
                case 4:
                    /* 32 bit write access */
                    val = ldl_p(buf);
                    error |= io_mem_write(mr, addr1, val, 4);
                    break;
                case 2:
                    /* 16 bit write access */
                    val = lduw_p(buf);
                    error |= io_mem_write(mr, addr1, val, 2);
                    break;
                case 1:
                    /* 8 bit write access */
                    val = ldub_p(buf);
                    error |= io_mem_write(mr, addr1, val, 1);
                    break;
                default:
                    abort();
                }

which calls the ldx_p primitives

#if defined(TARGET_WORDS_BIGENDIAN)
#define lduw_p(p) lduw_be_p(p)
#define ldsw_p(p) ldsw_be_p(p)
#define ldl_p(p) ldl_be_p(p)
#define ldq_p(p) ldq_be_p(p)
#define ldfl_p(p) ldfl_be_p(p)
#define ldfq_p(p) ldfq_be_p(p)
#define stw_p(p, v) stw_be_p(p, v)
#define stl_p(p, v) stl_be_p(p, v)
#define stq_p(p, v) stq_be_p(p, v)
#define stfl_p(p, v) stfl_be_p(p, v)
#define stfq_p(p, v) stfq_be_p(p, v)
#else
#define lduw_p(p) lduw_le_p(p)
#define ldsw_p(p) ldsw_le_p(p)
#define ldl_p(p) ldl_le_p(p)
#define ldq_p(p) ldq_le_p(p)
#define ldfl_p(p) ldfl_le_p(p)
#define ldfq_p(p) ldfq_le_p(p)
#define stw_p(p, v) stw_le_p(p, v)
#define stl_p(p, v) stl_le_p(p, v)
#define stq_p(p, v) stq_le_p(p, v)
#define stfl_p(p, v) stfl_le_p(p, v)
#define stfq_p(p, v) stfq_le_p(p, v)
#endif

and then passes the result as "originating register access" to the device emulation part of QEMU.


Maybe it becomes more clear if you understand the code flow that TCG is going through. With TCG whenever a write traps into MMIO we go through these functions

void
glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
                                         DATA_TYPE val, int mmu_idx)
{
    helper_te_st_name(env, addr, val, mmu_idx, GETRA());
}

#ifdef TARGET_WORDS_BIGENDIAN
# define TGT_BE(X)  (X)
# define TGT_LE(X)  BSWAP(X)
#else
# define TGT_BE(X)  BSWAP(X)
# define TGT_LE(X)  (X)
#endif

void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
                       int mmu_idx, uintptr_t retaddr)
{
[...]
    /* Handle an IO access.  */
    if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
        hwaddr ioaddr;
        if ((addr & (DATA_SIZE - 1)) != 0) {
            goto do_unaligned_access;
        }
        ioaddr = env->iotlb[mmu_idx][index];

        /* ??? Note that the io helpers always read data in the target
           byte ordering.  We should push the LE/BE request down into io.  */
        val = TGT_LE(val);
        glue(io_write, SUFFIX)(env, ioaddr, val, addr, retaddr);
        return;
    }
    [...]
}

static inline void glue(io_write, SUFFIX)(CPUArchState *env,
                                          hwaddr physaddr,
                                          DATA_TYPE val,
                                          target_ulong addr,
                                          uintptr_t retaddr)
{
    MemoryRegion *mr = iotlb_to_region(physaddr);

    physaddr = (physaddr & TARGET_PAGE_MASK) + addr;
    if (mr != &io_mem_rom && mr != &io_mem_notdirty && !can_do_io(env)) {
        cpu_io_recompile(env, retaddr);
    }

    env->mem_io_vaddr = addr;
    env->mem_io_pc = retaddr;
    io_mem_write(mr, physaddr, val, 1 << SHIFT);
}

which at the end of the chain means if you're running an same endianness on guest and host, you get the original register value as function parameter. If you run different endianness you get a swapped value as function parameter.

So at the end of all of this, if you're running qemu-system-arm (TCG) on a BE host the request into the io callback function will come in as register, then stay all the way it is until it reaches the IO callback function. Unless you define a specific endianness for your device in which case the callback may swizzle it again. But if your device defines DEVICE_LITTLE_ENDIAN or DEVICE_NATIVE_ENDIAN, it won't swizzle it.

What happens when you switch your guest to BE mode (or LE for PPC)? Very simple. The TCG frontend swizzles every memory read and write before it hits TCG's memory operations.

If you're running qemu-system-arm (KVM) on a BE host the request will come into kvm-all.c, get read with swapped endianness (ldq_p) and then passed into that way into the IO callback function. That's where the bug lies. It should behave the same way as TCG, so it needs to know the value the register originally had. So instead of doing an ldq_p() it should go through a different path that does memcpy().

But that doesn't fix the other-endian issue yet, right? Every value now would come in as the register value.

Well, unless you do the same thing TCG does inside the kernel. So the kernel would swap the reads and writes before it accesses the ioctl struct that connects kvm with QEMU. Then all abstraction layers work just fine again and we don't need any qemu-system-armeb.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux