On Tue, Jul 01, 2014 at 11:49:18AM +0200, Christian König wrote: > Am 30.06.2014 19:35, schrieb Jerome Glisse: > >On Mon, Jun 30, 2014 at 03:04:16PM +0200, Christian König wrote: > >>From: Christian König <christian.koenig@xxxxxxx> > >> > >>This patch adds an IOCTL for turning a pointer supplied by > >>userspace into a buffer object. > >> > >>It works similar to the recently added userptr support for i915, > >>so it also imposes several restrictions upon the memory being mapped: > >> > >>1. It must be page aligned (both start/end addresses, i.e ptr and size). > >> > >>2. It must be normal system memory, not a pointer into another map of IO > >>space (e.g. it must not be a GTT mmapping of another object). > >> > >>3. The BO is mapped into GTT, so the maximum amount of memory mapped at > >>all times is still the GTT limit. > >> > >>Exporting and sharing as well as mapping of buffer objects created by > >>this function is forbidden and results in an -EPERM. > >> > >>In difference to the i915 implementation we don't use an interval tree to > >>manage all the buffers objects created and instead just register a separate > >>MMU notifier callback for each BO so that buffer objects are invalidated > >>whenever any modification is made to the processes page table. This probably > >>needs further optimization. > >> > >>Please review and / or comment. > >NAK, this is kind of an horror show. I should probably take a look at > >Intel code too. > > Yeah, I'm not to keen about the approach either and would rather > prefer using the IOMMU, but unfortunately that isn't available under > all use cases we need to be able to handle. > > BTW: Here is the link to the initial commit of the Intel > implementation as it is in Linus tree: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=5cc9ed4b9a7ac579362ccebac67f7a4cdb36de06 > > >First registering one notifier per bo is a bad idea, it is not what you > >want to do. > > I know, the Intel guys use a single registration for each process > and an interval tree to lockup all the BOs affected. > > I just wanted to keep the implementation simple and stupid for now > and first get feedback about the approach in general (to save time > if the whole approach won't be accepted at all and your reaction > indeed confirms that this was a good idea). > > >>[SNIP] > >>@@ -176,8 +181,15 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p) > >> if (p->cs_flags & RADEON_CS_USE_VM) > >> p->vm_bos = radeon_vm_get_bos(p->rdev, p->ib.vm, > >> &p->validated); > >>+ if (need_mmap_lock) > >>+ down_read(¤t->mm->mmap_sem); > >>+ > >>+ r = radeon_bo_list_validate(p->rdev, &p->ticket, &p->validated, p->ring); > >>- return radeon_bo_list_validate(p->rdev, &p->ticket, &p->validated, p->ring); > >>+ if (need_mmap_lock) > >>+ up_read(¤t->mm->mmap_sem); > >>+ > >>+ return r; > >> } > >So the mmap lock protect almost nothing here. Once things are validated > >nothing forbid the userspace to unmap the range containing the user bo. > > As far as I understand it (and I'm probably not so deep into the MM > code as you, so correct me if I'm wrong) unmapping the range would > result in an invalidate_range_start callback and this callback in > turn validates the BO into the CPU domain again. So it actually > blocks for the GPU operation to be completed, marks the pages in > question as dirty and then unpins them. > > This should protected us anything nasty to happen in the case of a > unmap(), fork() etc... Thing is nothing forbid from a new cs ioctl to happen right after the radeon range_start callback but before the core mm code that was about to do something. The mmap_sem will protect you from fork or munmap but not from otherthing. > > >> static int radeon_cs_get_ring(struct radeon_cs_parser *p, u32 ring, s32 priority) > >>diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c > >>index 6e30174..355aa67 100644 > >>--- a/drivers/gpu/drm/radeon/radeon_drv.c > >>+++ b/drivers/gpu/drm/radeon/radeon_drv.c > >>@@ -112,6 +112,9 @@ int radeon_gem_object_open(struct drm_gem_object *obj, > >> struct drm_file *file_priv); > >> void radeon_gem_object_close(struct drm_gem_object *obj, > >> struct drm_file *file_priv); > >>+struct dma_buf *radeon_gem_prime_export(struct drm_device *dev, > >>+ struct drm_gem_object *gobj, > >>+ int flags); > >Use tab not space there is already enough broken tab/space in radeon > >kernel as it is. I always tried to keep it consistant. > > Fixed, thanks for the hint. > > >>[SNIP] > >>+static void radeon_bo_mn_release(struct mmu_notifier *mn, > >>+ struct mm_struct *mm) > >>+{ > >>+ struct radeon_bo *bo = container_of(mn, struct radeon_bo, mn); > >>+ > >>+ bo->mn.ops = NULL; > >>+} > >>+ > >>+static void radeon_bo_mn_invalidate_range_start(struct mmu_notifier *mn, > >>+ struct mm_struct *mm, > >>+ unsigned long start, > >>+ unsigned long end) > >>+{ > >>+ struct radeon_bo *bo = container_of(mn, struct radeon_bo, mn); > >>+ int r; > >>+ > >>+ /* TODO: optimize! */ > >>+ > >>+ r = radeon_bo_reserve(bo, true); > >>+ if (r) { > >>+ dev_err(bo->rdev->dev, "(%d) failed to reserve user bo\n", r); > >>+ return; > >>+ } > >>+ > >>+ radeon_ttm_placement_from_domain(bo, RADEON_GEM_DOMAIN_CPU); > >>+ r = ttm_bo_validate(&bo->tbo, &bo->placement, false, false); > >>+ if (r) { > >>+ dev_err(bo->rdev->dev, "(%d) failed to validate user bo\n", r); > >>+ return; > >>+ } > >>+ > >>+ radeon_bo_unreserve(bo); > >>+} > >So you most likely want ttm_bo_validate to be interruptible but you also > >want to call something like io_schedule so you are not just wasting time > >slice of the process for the own selfishness of the gpu driver. > > > >Also the invalidate_range is callback that is a common call during process > >lifetime. > > I know that this is horrible inefficient and error prone, but we > can't make ttm_bo_validate interruptible because > invalidate_range_start doesn't allows us to return an error number > and so we can't rely on userspace repeating the syscall causing the > mapping change in case of a signal. My thinking was to restart inside the notifier callback with an io_schedule in btw. > The Intel code does it the same way and it actually send a cold > shiver down my back that this got upstream, but hey I'm not the one > who reviewed it and actually don't have a better idea either. > > >Moreover nothing prevent a cs that is call after the range_start callback > >to pin again the buffer before the range_end callback is made. Thus this > >is extremly racy. > > Oh really? Crap, I thought for all the cases that we are interested > in the mmap_sem would be locked between range_start and range_end. It can be taken in read mode. Not all call to range_start/end happen with mmap_sem in write mode. > >>+ > >>+static const struct mmu_notifier_ops radeon_bo_notifier = { > >>+ .release = radeon_bo_mn_release, > >>+ .invalidate_range_start = radeon_bo_mn_invalidate_range_start, > >>+}; > >What about invalidate_page callback ? You are breaking write back to > >disk. Which is not a welcome move. > > Why? Keep in mind that the pages are pinned down as soon as we make > the first CS which uses them and marked as dirty when we are done > with them. In between those two events the page shouldn't be written > back to disk. Page could be dirty prior to be pin, or dirtied after being (CPU access). I am just pointing out that the GPU might be writting to the page while the page is use as source for disk I/O. > > >>[SNIP] > >>+static int radeon_ttm_tt_pin_userptr(struct ttm_tt *ttm) > >>+{ > >>+ struct radeon_device *rdev = radeon_get_rdev(ttm->bdev); > >>+ struct radeon_ttm_tt *gtt = (void *)ttm; > >>+ unsigned pinned = 0, nents; > >>+ int r; > >>+ > >>+ /* prepare the sg table with the user pages */ > >>+ if (current->mm != gtt->usermm) > >>+ return -EPERM; > >>+ > >>+ do { > >>+ unsigned num_pages = ttm->num_pages - pinned; > >>+ uint64_t userptr = gtt->userptr + pinned * PAGE_SIZE; > >>+ struct page **pages = ttm->pages + pinned; > >>+ > >>+ r = get_user_pages(current, current->mm, userptr, num_pages, > >>+ 1, 0, pages, NULL); > >>+ if (r < 0) { > >>+ release_pages(ttm->pages, pinned, 0); > >>+ return r; > >>+ } > >>+ pinned += r; > >>+ > >>+ } while (pinned < ttm->num_pages); > >>+ > >>+ r = -ENOMEM; > >>+ ttm->sg = kcalloc(1, sizeof(struct sg_table), GFP_KERNEL); > >>+ if (!ttm->sg) > >>+ goto release_pages; > >>+ > >>+ r = sg_alloc_table_from_pages(ttm->sg, ttm->pages, ttm->num_pages, 0, > >>+ ttm->num_pages << PAGE_SHIFT, > >>+ GFP_KERNEL); > >>+ if (r) > >>+ goto release_sg; > >>+ > >>+ r = -ENOMEM; > >>+ nents = dma_map_sg(rdev->dev, ttm->sg->sgl, ttm->sg->nents, > >>+ DMA_BIDIRECTIONAL); > >>+ if (nents != ttm->sg->nents) > >>+ goto release_sg; > >>+ > >>+ return 0; > >>+ > >>+release_sg: > >>+ kfree(ttm->sg); > >>+ > >>+release_pages: > >>+ release_pages(ttm->pages, pinned, 0); > >>+ return r; > >>+} > >So as said above page back the user mapping might change. As the gup > >comment explain there is no garanty that page returned by gup are the > >one in use to back the address. Well for thing that will be use as > >read source by the gpu this is not much of an issue as you can probably > >assume that by the time you pin the bo the user mapping is done writting > >to it. > > > >But the case of GPU writing to it and then userspace expecting to read > >back through its user space mapping is just a game of luck. You will be > >lucky lot of time because the kernel does not migrate page just for the > >kick of it. But in case it does you will have inconsistency and user of > >that API will not understand why the hell they are having an issue. > > Yeah, completely agree. I was aware of those limitations even before > starting to work on it, but hoped that registering an > invalidate_range_start callback like the Intel codes does would > solve the issue (well this code indeed got upstream). > > Unfortunately I'm not the one who decides if we do it or not, I'm > just the one who volunteered to figure out how it is possible to do > it in a clean way. > > >This kind of API is realy a no go the way the mm code works does not allow > >for such thing. I will go look at the intel code but i fear the worst. > > Please keep me updated on this, if we can't get a solution about > this upstream I really need to be able to explain why. > The biggest issue with it is the pining of memory, this violate the memcg. Direct-io pin memory but limit itself to a reasonable amount at a time. A GPU buffer object might be several hundred mega byte which obviously can be nasty from memory starvation point of view. Most other issue can be worked out. Cheers, Jérôme _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/dri-devel