Re: [PATCH] drm/radeon: add user pointer support v2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 30.06.2014 19:35, schrieb Jerome Glisse:
On Mon, Jun 30, 2014 at 03:04:16PM +0200, Christian König wrote:
From: Christian König <christian.koenig@xxxxxxx>

This patch adds an IOCTL for turning a pointer supplied by
userspace into a buffer object.

It works similar to the recently added userptr support for i915,
so it also imposes several restrictions upon the memory being mapped:

1. It must be page aligned (both start/end addresses, i.e ptr and size).

2. It must be normal system memory, not a pointer into another map of IO
space (e.g. it must not be a GTT mmapping of another object).

3. The BO is mapped into GTT, so the maximum amount of memory mapped at
all times is still the GTT limit.

Exporting and sharing as well as mapping of buffer objects created by
this function is forbidden and results in an -EPERM.

In difference to the i915 implementation we don't use an interval tree to
manage all the buffers objects created and instead just register a separate
MMU notifier callback for each BO so that buffer objects are invalidated
whenever any modification is made to the processes page table. This probably
needs further optimization.

Please review and / or comment.
NAK, this is kind of an horror show. I should probably take a look at
Intel code too.

Yeah, I'm not to keen about the approach either and would rather prefer using the IOMMU, but unfortunately that isn't available under all use cases we need to be able to handle.

BTW: Here is the link to the initial commit of the Intel implementation as it is in Linus tree: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5cc9ed4b9a7ac579362ccebac67f7a4cdb36de06

First registering one notifier per bo is a bad idea, it is not what you
want to do.

I know, the Intel guys use a single registration for each process and an interval tree to lockup all the BOs affected.

I just wanted to keep the implementation simple and stupid for now and first get feedback about the approach in general (to save time if the whole approach won't be accepted at all and your reaction indeed confirms that this was a good idea).

[SNIP]
@@ -176,8 +181,15 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p)
  	if (p->cs_flags & RADEON_CS_USE_VM)
  		p->vm_bos = radeon_vm_get_bos(p->rdev, p->ib.vm,
  					      &p->validated);
+	if (need_mmap_lock)
+		down_read(&current->mm->mmap_sem);
+
+	r = radeon_bo_list_validate(p->rdev, &p->ticket, &p->validated, p->ring);
- return radeon_bo_list_validate(p->rdev, &p->ticket, &p->validated, p->ring);
+	if (need_mmap_lock)
+		up_read(&current->mm->mmap_sem);
+
+	return r;
  }
So the mmap lock protect almost nothing here. Once things are validated
nothing forbid the userspace to unmap the range containing the user bo.

As far as I understand it (and I'm probably not so deep into the MM code as you, so correct me if I'm wrong) unmapping the range would result in an invalidate_range_start callback and this callback in turn validates the BO into the CPU domain again. So it actually blocks for the GPU operation to be completed, marks the pages in question as dirty and then unpins them.

This should protected us anything nasty to happen in the case of a unmap(), fork() etc...

static int radeon_cs_get_ring(struct radeon_cs_parser *p, u32 ring, s32 priority)
diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c
index 6e30174..355aa67 100644
--- a/drivers/gpu/drm/radeon/radeon_drv.c
+++ b/drivers/gpu/drm/radeon/radeon_drv.c
@@ -112,6 +112,9 @@ int radeon_gem_object_open(struct drm_gem_object *obj,
  				struct drm_file *file_priv);
  void radeon_gem_object_close(struct drm_gem_object *obj,
  				struct drm_file *file_priv);
+struct dma_buf *radeon_gem_prime_export(struct drm_device *dev,
+                                        struct drm_gem_object *gobj,
+                                        int flags);
Use tab not space there is already enough broken tab/space in radeon
kernel as it is. I always tried to keep it consistant.

Fixed, thanks for the hint.

[SNIP]
+static void radeon_bo_mn_release(struct mmu_notifier *mn,
+				 struct mm_struct *mm)
+{
+	struct radeon_bo *bo = container_of(mn, struct radeon_bo, mn);
+
+	bo->mn.ops = NULL;
+}
+
+static void radeon_bo_mn_invalidate_range_start(struct mmu_notifier *mn,
+						struct mm_struct *mm,
+						unsigned long start,
+						unsigned long end)
+{
+	struct radeon_bo *bo = container_of(mn, struct radeon_bo, mn);
+	int r;
+
+	/* TODO: optimize! */
+
+	r = radeon_bo_reserve(bo, true);
+	if (r) {
+		dev_err(bo->rdev->dev, "(%d) failed to reserve user bo\n", r);
+		return;
+	}
+
+	radeon_ttm_placement_from_domain(bo, RADEON_GEM_DOMAIN_CPU);
+	r = ttm_bo_validate(&bo->tbo, &bo->placement, false, false);
+	if (r) {
+		dev_err(bo->rdev->dev, "(%d) failed to validate user bo\n", r);
+		return;
+	}
+
+	radeon_bo_unreserve(bo);
+}
So you most likely want ttm_bo_validate to be interruptible but you also
want to call something like io_schedule so you are not just wasting time
slice of the process for the own selfishness of the gpu driver.

Also the invalidate_range is callback that is a common call during process
lifetime.

I know that this is horrible inefficient and error prone, but we can't make ttm_bo_validate interruptible because invalidate_range_start doesn't allows us to return an error number and so we can't rely on userspace repeating the syscall causing the mapping change in case of a signal.

The Intel code does it the same way and it actually send a cold shiver down my back that this got upstream, but hey I'm not the one who reviewed it and actually don't have a better idea either.

Moreover nothing prevent a cs that is call after the range_start callback
to pin again the buffer before the range_end callback is made. Thus this
is extremly racy.

Oh really? Crap, I thought for all the cases that we are interested in the mmap_sem would be locked between range_start and range_end.

+
+static const struct mmu_notifier_ops radeon_bo_notifier = {
+	.release = radeon_bo_mn_release,
+	.invalidate_range_start = radeon_bo_mn_invalidate_range_start,
+};
What about invalidate_page callback ? You are breaking write back to
disk. Which is not a welcome move.

Why? Keep in mind that the pages are pinned down as soon as we make the first CS which uses them and marked as dirty when we are done with them. In between those two events the page shouldn't be written back to disk.

[SNIP]
+static int radeon_ttm_tt_pin_userptr(struct ttm_tt *ttm)
+{
+	struct radeon_device *rdev = radeon_get_rdev(ttm->bdev);
+	struct radeon_ttm_tt *gtt = (void *)ttm;
+	unsigned pinned = 0, nents;
+	int r;
+
+	/* prepare the sg table with the user pages */
+	if (current->mm != gtt->usermm)
+		return -EPERM;
+
+	do {
+		unsigned num_pages = ttm->num_pages - pinned;
+		uint64_t userptr = gtt->userptr + pinned * PAGE_SIZE;
+		struct page **pages = ttm->pages + pinned;
+
+		r = get_user_pages(current, current->mm, userptr, num_pages,
+				   1, 0, pages, NULL);
+		if (r < 0) {
+			release_pages(ttm->pages, pinned, 0);
+			return r;
+		}
+		pinned += r;
+
+	} while (pinned < ttm->num_pages);
+
+	r = -ENOMEM;
+	ttm->sg = kcalloc(1, sizeof(struct sg_table), GFP_KERNEL);
+	if (!ttm->sg)
+		goto release_pages;
+
+	r = sg_alloc_table_from_pages(ttm->sg, ttm->pages, ttm->num_pages, 0,
+				      ttm->num_pages << PAGE_SHIFT,
+				      GFP_KERNEL);
+	if (r)
+		goto release_sg;
+
+	r = -ENOMEM;
+	nents = dma_map_sg(rdev->dev, ttm->sg->sgl, ttm->sg->nents,
+			   DMA_BIDIRECTIONAL);
+	if (nents != ttm->sg->nents)
+		goto release_sg;
+
+	return 0;
+
+release_sg:
+	kfree(ttm->sg);
+
+release_pages:
+	release_pages(ttm->pages, pinned, 0);
+	return r;
+}
So as said above page back the user mapping might change. As the gup
comment explain there is no garanty that page returned by gup are the
one in use to back the address. Well for thing that will be use as
read source by the gpu this is not much of an issue as you can probably
assume that by the time you pin the bo the user mapping is done writting
to it.

But the case of GPU writing to it and then userspace expecting to read
back through its user space mapping is just a game of luck. You will be
lucky lot of time because the kernel does not migrate page just for the
kick of it. But in case it does you will have inconsistency and user of
that API will not understand why the hell they are having an issue.

Yeah, completely agree. I was aware of those limitations even before starting to work on it, but hoped that registering an invalidate_range_start callback like the Intel codes does would solve the issue (well this code indeed got upstream).

Unfortunately I'm not the one who decides if we do it or not, I'm just the one who volunteered to figure out how it is possible to do it in a clean way.

This kind of API is realy a no go the way the mm code works does not allow
for such thing. I will go look at the intel code but i fear the worst.

Please keep me updated on this, if we can't get a solution about this upstream I really need to be able to explain why.

Thanks for the comments,
Christian.


Cheers,
Jérôme

+
+static void radeon_ttm_tt_unpin_userptr(struct ttm_tt *ttm)
+{
+	struct radeon_device *rdev = radeon_get_rdev(ttm->bdev);
+	unsigned i;
+
+	/* free the sg table and pages again */
+	dma_unmap_sg(rdev->dev, ttm->sg->sgl,
+		     ttm->sg->nents, DMA_BIDIRECTIONAL);
+
+	sg_free_table(ttm->sg);
+	kfree(ttm->sg);
+
+	for (i = 0; i < ttm->num_pages; ++i)
+		set_page_dirty(ttm->pages[i]);
+
+	release_pages(ttm->pages, ttm->num_pages, 0);
+}
+
  static int radeon_ttm_tt_populate(struct ttm_tt *ttm)
  {
  	struct radeon_device *rdev;
@@ -599,6 +673,13 @@ static int radeon_ttm_tt_populate(struct ttm_tt *ttm)
  	if (ttm->state != tt_unpopulated)
  		return 0;
+ if (gtt->userptr) {
+		r = radeon_ttm_tt_pin_userptr(ttm);
+		if (r)
+			return r;
+		slave = true;
+	}
+
  	if (slave && ttm->sg) {
  		drm_prime_sg_to_page_addr_arrays(ttm->sg, ttm->pages,
  						 gtt->ttm.dma_address, ttm->num_pages);
@@ -648,6 +729,11 @@ static void radeon_ttm_tt_unpopulate(struct ttm_tt *ttm)
  	unsigned i;
  	bool slave = !!(ttm->page_flags & TTM_PAGE_FLAG_SG);
+ if (gtt->userptr) {
+		radeon_ttm_tt_unpin_userptr(ttm);
+		return;
+	}
+
  	if (slave)
  		return;
@@ -676,6 +762,28 @@ static void radeon_ttm_tt_unpopulate(struct ttm_tt *ttm)
  	ttm_pool_unpopulate(ttm);
  }
+int radeon_ttm_tt_set_userptr(struct ttm_tt *ttm, uint64_t userptr)
+{
+	struct radeon_ttm_tt *gtt = (void *)ttm;
+
+	if (gtt == NULL)
+		return -EINVAL;
+
+	gtt->userptr = userptr;
+	gtt->usermm = current->mm;
+	return 0;
+}
+
+bool radeon_ttm_tt_has_userptr(struct ttm_tt *ttm)
+{
+	struct radeon_ttm_tt *gtt = (void *)ttm;
+
+	if (gtt == NULL)
+		return false;
+
+	return !!gtt->userptr;
+}
+
  static struct ttm_bo_driver radeon_bo_driver = {
  	.ttm_tt_create = &radeon_ttm_tt_create,
  	.ttm_tt_populate = &radeon_ttm_tt_populate,
diff --git a/include/uapi/drm/radeon_drm.h b/include/uapi/drm/radeon_drm.h
index 1cc0b61..54e7421 100644
--- a/include/uapi/drm/radeon_drm.h
+++ b/include/uapi/drm/radeon_drm.h
@@ -511,6 +511,7 @@ typedef struct {
  #define DRM_RADEON_GEM_BUSY		0x2a
  #define DRM_RADEON_GEM_VA		0x2b
  #define DRM_RADEON_GEM_OP		0x2c
+#define DRM_RADEON_GEM_IMPORT		0x2d
#define DRM_IOCTL_RADEON_CP_INIT DRM_IOW( DRM_COMMAND_BASE + DRM_RADEON_CP_INIT, drm_radeon_init_t)
  #define DRM_IOCTL_RADEON_CP_START   DRM_IO(  DRM_COMMAND_BASE + DRM_RADEON_CP_START)
@@ -554,6 +555,7 @@ typedef struct {
  #define DRM_IOCTL_RADEON_GEM_BUSY	DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_BUSY, struct drm_radeon_gem_busy)
  #define DRM_IOCTL_RADEON_GEM_VA		DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_VA, struct drm_radeon_gem_va)
  #define DRM_IOCTL_RADEON_GEM_OP		DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_OP, struct drm_radeon_gem_op)
+#define DRM_IOCTL_RADEON_GEM_IMPORT	DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_IMPORT, struct drm_radeon_gem_import)
typedef struct drm_radeon_init {
  	enum {
@@ -806,6 +808,13 @@ struct drm_radeon_gem_create {
  	uint32_t	flags;
  };
+struct drm_radeon_gem_import {
+	uint64_t		addr;
+	uint64_t		size;
+	uint32_t		flags;
+	uint32_t		handle;
+};
+
  #define RADEON_TILING_MACRO				0x1
  #define RADEON_TILING_MICRO				0x2
  #define RADEON_TILING_SWAP_16BIT			0x4
--
1.9.1

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel





[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux