Re: Testing the RK3288 VPU with static data on mainline kernels (Re: VPU tests)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 18/08/18 07:15, Miouyouyou (Myy) wrote:
I'm adding the Linux Rockchip LKML and Linux IOMMU LKML since mimicking
old 4.4 code leads me to other issues.

ayaka a écrit :
Have you tried my new driver?
MPP_service ? I'd like to but since the 4.4 Rockchip branch is being a
bit difficult to recompile these days, I have to make do with the old
prepackaged Rockchip-linux specific 4.4 kernel and its "vpu-service" driver.
I could try to port the code, but then I'll have other issues, as stated
below.

I don't see the configure to the iommu and the iommu is not set to
bypass either.
Well, trying to do a simple iommu_get_dma_cookie triggered a ENODEV error.
Which leds me to an old issue with RK3288 systems and mainline kernels :
CONFIG_IOMMU_DMA is not set up by default when you select the Rockchip
IOMMU driver.
It's only enabled if you also enables the MediaTek IOMMU driver. So, I
guess that it's only enabled when using global configuration files that
target many boards at once.

I'm adding the Rockchip LKML, since I'd like to know why
CONFIG_IOMMU_DMA is not enabled, nor tested, by default when selecting
the Rockchip IOMMU driver ?
The old 4.4 drivers seems to heavily rely on it, making the whole
porting process more difficult.

Because the 32-bit Arm code has its own implementation and doesn't use IOMMU_DMA. The Arm DMA ops rely on a domain explicitly created by arm_setup_iommu_dma_ops() and wrapped in the dma_iommu_mapping address allocator.

Arm will *eventually* get converted over to IOMMU_DMA, there's just a fair few fiddly bits still to resolve.

Now, forcing CONFIG_IOMMU_DMA on mainline kernels breaks the Video
Output MMU initialization, which leads to a lot of BUG_ON from the DRM
drivers.
Unplugging the screen before the system starts allows me to boot the
system correctly (but without screen) and SSH into it.

From a mainline perspective, enabling IOMMU_DMA on 32-bit Arm is still pretty much an untested and unsupported configuration. Without any corresponding dma_map_ops to complete the glue layer, there's not an awful lot of point.

That said, enabling this option doesn't solve my issues with my VPU
driver. Meaning that the VPU starts, triggers the IRQ, stops and nothing
is written in the output...
And now I also have no useable screen.

I tried adding the gool old dance :
iommu_domain_alloc(vpu_dev);
iommu_get_dma_cookie(driver_data->iommu_domain);
iommu_group_get(vpu_dev);
iommu_dma_init_domain(driver_data->iommu_domain, 0x10000000, SZ_2G,
vpu_dev);
iommu_group_put(group);

But that doesn't change anything. The output DMA buffer is still
untouched and my custom IOMMU Fault handler is not triggered.
I'll give the DMA-Debug API a try.

FWIW you're not actually attaching the group to the new domain in that sequence, but as above that still wouldn't make anything magically succeed because the Arm DMA ops won't understand an IOMMU_DMA domain anyway.

Meanwhile, I'm also adding the Linux IOMMU LKML, since I'd like to know
what's the recommended way to initialize a device to perform DMA
operations, when there's an IOMMU, on mainline kernels ?
I see a lot of legacy code (from 4.4 kernels) that tends to use the
IOMMU and DMA API in ways that have been removed, or seem rather unused
(grep or bootlin doesn't show much use).

For example, do I still need to do iommu_get_dma_cookie ?
rk_iommu_domain_alloc seems to perform the operation automatically, and
the domain allocation is also done automatically with
iommu_get_domain_for_dev .
Should I still call iommu_dma_init_domain ?
Also, does calling dma_set_max_seg_size makes sense for a device driver
? That function seems to be reserved for DMA drivers, yet I saw it on
multiple implementations of the VPU driver, in the 4.4 kernels :
https://github.com/rockchip-linux/kernel/blob/release-4.4/drivers/video/rockchip/vpu/vpu_iommu_drm.c#L139
https://github.com/rockchip-linux/kernel/blob/release-4.4/drivers/media/platform/rockchip-vpu/rockchip_vpu_hw.c#L179

Do you still need to attach the device you're using, using
iommu_attach_device, if the attached IOMMU device is declared in its DTS
node ?

As far as I'm aware, unless you want to explicitly manage the IOMMU address space within your driver (which is beyond the scope of everything above) you shouldn't need to do anything - since the Rockchip IOMMU uses the generic "iommus" DT binding it should get picked up by dma_configure() and configured with appropriate IOMMU ops by arch_setup_dma_ops(). In general this is all designed to be transparently handled by the arch code, so touching any of it in a device driver is a sign of doing something wrong. Given that it apparently works fine for the VOP MMUs, I can't see any obvious reason why the VPU MMU would behave differently.

Robin.

On 08/18/2018 09:41 AM, Miouyouyou (Myy) wrote:
Greetings,

I'm currently testing the RK3288 VPU driver on mainline kernels 4.18+
(soon 4.19-rc1).
The boards I'm using to perform the tests are :
* A Tinkerboard with a mainline kernel patched by myself (
https://github.com/Miouyouyou/RockMyy )
* A MiQi with 4.4 kernel packaged by Armbian, MPV and a modified version
of RKMPP, version 20171218 .

Right now I'm testing the unit that decode H264 frames. This unit seems
to be referred as "hw_vpu_4831" in the old VPU "vcodec_service.c" driver
used on Rockchip 4.4 kernels.
My current goal is to perform a single H264 decode pass using static
data, in order to avoid being bothered by issues that are not directly
related to the VPU.
If that works, then it means that main part works and I can use this as
a basis to port the MPP Service driver, and the V4L2 Chromium driver.
Static data allows for determinism, which is extremely useful when
dealing with something as complex as H264 decoders.


In order to get those static data what I did was :

1. Modify an old version of RKMPP ( mpp-release_20171218 ) to take
snapshots of :
   * the 101 registers sent to the VPU;
   * the encoded frame to decode;
   * the quantization table used for this frame;
when decoding the 120 first frames of an H264 movie (played through MPV,
with the RKMPP backend).

2. Write a kernel driver that :

   * Incorporates these snapshots (registers, encoded frame, generated
quantization table) as static arrays
   (
https://github.com/Miouyouyou/Mainline-Rockchip-VPU/blob/dev/test_static_data.h

)

   * Allocates 3 DMA buffers for the encoded frame, the quantization
table
and the output.
   (
https://github.com/Miouyouyou/Mainline-Rockchip-VPU/blob/dev/test-devicetree-dma-to-from-user.c#L755

)

   * Copy the encoded frame and the quantization table into the
respective
DMA buffers.
   (
https://github.com/Miouyouyou/Mainline-Rockchip-VPU/blob/dev/test-devicetree-dma-to-from-user.c#L771

)

   * Modifies the registers snapshot, by switching the file descriptors
references by the actual IOVA of the respective DMA buffers.
   (
https://github.com/Miouyouyou/Mainline-Rockchip-VPU/blob/dev/test-devicetree-dma-to-from-user.c#L305

)

   * Setup the clocks and the IRQ handlers
   (
https://github.com/Miouyouyou/Mainline-Rockchip-VPU/blob/dev/test-devicetree-dma-to-from-user.c#L445

)
   (
https://github.com/Miouyouyou/Mainline-Rockchip-VPU/blob/dev/test-devicetree-dma-to-from-user.c#L812

)

   * Execute a decode pass
   (
https://github.com/Miouyouyou/Mainline-Rockchip-VPU/blob/dev/test-devicetree-dma-to-from-user.c#L830

)
   (
https://github.com/Miouyouyou/Mainline-Rockchip-VPU/blob/dev/test-devicetree-dma-to-from-user.c#L372

)
   (
https://github.com/Miouyouyou/Mainline-Rockchip-VPU/blob/dev/test-devicetree-dma-to-from-user.c#L424

)

What currently happens after the decode pass is that the IRQ handler
gets called.
When checking the first register (SwReg01) state in this handler, it is
always set 0x00010100 .
I write 0 to this register (SwReg01) in order to end the current VPU
job.

However, my issue is that the output buffer remains untouched.
Nothing changed in the output buffer.
The content of the output buffer is memset to 0xff on initialization and
then checked by mmap'ing the DMA buffer from user-space, and writing the
content into a file, using the simple following program :
https://github.com/Miouyouyou/Mainline-Rockchip-VPU/blob/dev/user-mode/test-mmap.c


This simple program is also used to check the VPU first 60 registers,
which are always :

uint32_t regs[60] = {
          0x67313688, 0x00000000, 0xfff80510, 0x00081201,
          0x3c022004, 0x00ef4000, 0xa40017f0, 0xb8040000,
          0x50050000, 0x00090007, 0x128398a4, 0x1ee6b16a,
          0x007ea00d, 0x00000000, 0x00000000, 0x00000000,
          0x00000000, 0x00000000, 0x00000000, 0x00000000,
          0x00000000, 0x00000000, 0x00000000, 0x00000000,
          0x00000000, 0x00000000, 0x00000000, 0x00000000,
          0x00000000, 0x00000000, 0x00000000, 0x00000000,
          0x00000000, 0x00000000, 0x00000000, 0x00000000,
          0x00000000, 0x00000000, 0x00000000, 0x00000000,
          0x007e9000, 0x0002fd00, 0x04208400, 0x0a521063,
          0x10839cc6, 0x16b52929, 0x1ce6b58c, 0x062081ef,
          0x00000000, 0x007fb050, 0xfbb56f80, 0x00000000,
          0x00000000, 0x00000000, 0xe5da0000, 0x00000008,
          0x00000000, 0x000000de, 0x00000001, 0x00000000,
};

The IOVA used during the pass are :
Output : 0x00000000 ( 1920 * 1080 * 4 bytes long )
QTable : 0x007e9000
Input : 0x007ea000

Note that the IOVA of the output buffer is 0x00000000 .
That's why regs[13] to regs[29] are set to 0x00000000 .

I see that :
* regs[0] (SwReg00) is set to some value, but the register is not
documented.
* regs[3] (SwReg03) is set to 0x00081201 instead of 0x00081200.
     The last bit set is named "sw_dec_axi_wr_id" in the RKMPP sources
but
I have no idea what it means.
* I see that regs[12] (SwReg12) is set to 0x007ea00d after the decode
pass.
    Before the decode pass, it was set to 0x007ea000, the Input IOVA.
    What the "d" (0b1101) means here ?
* regs[50] (SwReg50) and regs[54] (SwReg54) are set to some value. Do
these values have any meaning ?
* regs[58] (SwReg58) is set to 1. What does it mean ?


I've setup an IOMMU fault handler to catch potential DMA issues but the
fault handler is never called.
(
https://github.com/Miouyouyou/Mainline-Rockchip-VPU/blob/dev/test-devicetree-dma-to-from-user.c#L740

)


So, basically, I got a VPU that runs, calls the IRQ handler and provides
zero output for reasons I do not understand.
And I got no useful error messages. No crashes. No freezes. No warnings
in dmesg logs.
Nothing. It just runs, calls the IRQ handler, stops and does nothing
useful.
The only messages I get in the logs are the "printk" I setup in the IRQ
handler. (IRQ : 60 - State : 0x00010100).
https://github.com/Miouyouyou/Mainline-Rockchip-VPU/blob/dev/test-devicetree-dma-to-from-user.c#L140


Since I'm using only static data, the result is deterministic. Meaning
that there should not be any random changes.

Therefore I got a few questions, since you are more knowledgeable than
me about the internals of the VPU.


1. If the VPU fails to decode a frame, which registers are set ?
Or to rephrase it : How do I know that the VPU failed to decode a
frame ?
And does the VPU provides some information about why it failed ?

2. What needs to be enabled to perform a VPU decode pass, beside setting
the VPU registers ?
Meaning :
* What clocks are needed and what are their default rates ?
    "Aclk" and "Iface" clocks are enabled and setup to 200 Mhz and 50 Mhz
respectively, in my driver.
    The video power domain (pd_video) is also set during the
initialization of the VPU IOMMU but I have no idea of its clockrate.
    Note that I'm no using the HEVC unit, so I don't enable the HEVC
related clocks.
* What else needs to be enabled ?
    In the Chromium V4L2 driver for RK3288 VPU, it seems that Tomasz Figa
only enables these two clocks, setup the IOMMU (it seems to be done
automatically now, in mainline kernels, but I have to contact Jeffy Chen
just to be sure), setup the registers, write them and get its result.
https://github.com/rockchip-linux/kernel/blob/release-4.4/drivers/media/platform/rockchip-vpu/rockchip_vpu_hw.c

https://github.com/rockchip-linux/kernel/blob/release-4.4/drivers/media/platform/rockchip-vpu/rk3288_vpu_hw_h264d.c


3. To rephrase question 2 : Is there a checklist of actions to perform
to be sure that RK3288 VPU will decode correctly.

4. Do you have any files to perform Single H264 Frame Decoding tests ? I
see that the recent RKMPP releases have "Single Frame Decoding" IOCTL.
Is there any files to test this with ?



Note that the snapshots I'm using are available here :
https://github.com/Miouyouyou/Mainline-Rockchip-VPU/blob/dev/refs_dumps/refs.tar.xz

https://github.com/Miouyouyou/Mainline-Rockchip-VPU/tree/dev/refs_dumps

The snapshots were done by modifying :
mpp/hal/rkdec/h264d/hal_h264d_vdpu1.c

And adding the following function :

static void myy_dump_frame_and_regs(
      H264dHalCtx_t *p_hal,
      H264dVdpu1Regs_t *p_regs)
{
      static uint8_t dumps = 0;
      char regs_name[25];
      char frame_name[25];
      char qtable_name[25];

      //mpp_err_f("%s", "dumping");
      if (dumps < 120)
      {
          snprintf(regs_name, 24, "/tmp/mpp_dump_%04d_regs", dumps);
          snprintf(frame_name, 24, "/tmp/mpp_dump_%04d_frame", dumps);
          snprintf(qtable_name, 24, "/tmp/mpp_dump_%04d_qtbl", dumps);

          int fd = open(regs_name, O_CREAT | O_RDWR, 00644);
          if (fd > 0) {
              int const bytes_written = write(fd,
                  p_regs, sizeof(H264dVdpu1Regs_t));
              //mpp_err_f("Logging regs to %s", regs_name);
              //mpp_err_f("Wrote %d bytes", bytes_written);
              close(fd);
          }
          fd = open(frame_name, O_CREAT | O_RDWR, 00644);
          if (fd > 0) {
              int const bytes_written = write(fd,
                  p_hal->bitstream, p_hal->strm_len);
              //mpp_err_f("Logging frames to %s", frame_name);
              //mpp_err_f("Wrote %d bytes", bytes_written);
              close(fd);
          }
          fd = open(qtable_name, O_CREAT | O_RDWR, 00644);
          if (fd > 0) {
              int const bytes_written = write(fd,
                  p_hal->cabac_buf,
                  VDPU_CABAC_TAB_SIZE
                  + VDPU_SCALING_LIST_SIZE
                  + VDPU_POC_BUF_SIZE);
              //mpp_err_f("Logging qtable to %s", frame_name);
              //mpp_err_f("Wrote %d bytes", bytes_written);
              close(fd);
          }
          dumps++;
      }
}

And executing this at the end of the vdpu1_h264d_gen_regs phase :
MPP_RET vdpu1_h264d_gen_regs(void *hal, HalTaskInfo *task)
{
      // ...

      myy_dump_frame_and_regs(p_hal, (H264dVdpu1Regs_t *) p_hal->regs);
__RETURN:
      return ret = MPP_OK;
__FAILED:
      return ret;
}

And then using the RKMPP backend of MPV to read an H264 movie.
My modified copy of RKMPP to perform the snapshots is available here :
https://github.com/Miouyouyou/rkmpp-reverse-engineering




_______________________________________________
iommu mailing list
iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/iommu


_______________________________________________
Linux-rockchip mailing list
Linux-rockchip@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/linux-rockchip




[Index of Archives]     [LM Sensors]     [Linux Sound]     [ALSA Users]     [ALSA Devel]     [Linux Audio Users]     [Linux Media]     [Kernel]     [Gimp]     [Yosemite News]     [Linux Media]

  Powered by Linux