On 03/21/2014 01:12 PM, Maarten Lankhorst wrote: > Hey, > > op 21-03-14 09:27, Thomas Hellstrom schreef: >> On 03/21/2014 12:55 AM, Dave Airlie wrote: >>> On Fri, Oct 19, 2012 at 3:04 AM, Jerome Glisse <j.glisse@xxxxxxxxx> >>> wrote: >>>> On Thu, Oct 18, 2012 at 06:43:40PM +0200, Thomas Hellstrom wrote: >>>>> On 10/18/2012 04:45 PM, Maarten Lankhorst wrote: >>>>>> Op 18-10-12 13:55, Thomas Hellstrom schreef: >>>>>>> On 10/18/2012 01:38 PM, Maarten Lankhorst wrote: >>>>>>>> Op 18-10-12 13:02, Thomas Hellstrom schreef: >>>>>>>>> On 10/18/2012 10:37 AM, Maarten Lankhorst wrote: >>>>>>>>>> Hey, >>>>>>>>>> >>>>>>>>>> Op 18-10-12 09:59, Thomas Hellstrom schreef: >>>>>>>>>>> On 10/18/2012 09:28 AM, Thomas Hellstrom wrote: >>>>>>>>>>>> Hi, Maarten, >>>>>>>>>>>> >>>>>>>>>>>> As you know I have been having my doubts about this change. >>>>>>>>>>>> To me it seems insane to be forced to read the fence >>>>>>>>>>>> pointer under a >>>>>>>>>>>> reserved lock, simply because when you take the reserve >>>>>>>>>>>> lock, another >>>>>>>>>>>> process may have it and there is a substantial chance that >>>>>>>>>>>> that process >>>>>>>>>>>> will also be waiting for idle while holding the reserve lock. >>>>>>>>>> I think it makes perfect sense, the only times you want to >>>>>>>>>> read the fence >>>>>>>>>> is when you want to change the members protected by the >>>>>>>>>> reservation. >>>>>>>>> No, that's not true. A typical case (or the only case) >>>>>>>>> is where you want to do a map with no_wait semantics. You will >>>>>>>>> want >>>>>>>>> to be able to access a buffer's results even if the eviction code >>>>>>>>> is about to schedule an unbind from the GPU, and have the buffer >>>>>>>>> reserved? >>>>>>>> Well either block on reserve or return -EBUSY if reserved, >>>>>>>> presumably the >>>>>>>> former.. >>>>>>>> >>>>>>>> ttm_bo_vm_fault does the latter already, anyway >>>>>>> ttm_bo_vm_fault only trylocks to avoid a deadlock with mmap_sem, >>>>>>> it's really >>>>>>> a waiting reserve but for different reasons. Typically a >>>>>>> user-space app will keep >>>>>>> asynchronous maps to TTM during a buffer lifetime, and the >>>>>>> buffer lifetime may >>>>>>> be long as user-space apps keep caches. >>>>>>> That's not the same as accessing a buffer after the GPU is done >>>>>>> with it. >>>>>> Ah indeed. >>>>>>>> You don't need to hold the reservation while performing the >>>>>>>> wait itself though, >>>>>>>> you could check if ttm_bo_wait(no_wait_gpu = true) returns >>>>>>>> -EBUSY, and if so >>>>>>>> take a ref to the sync_obj member and then wait after >>>>>>>> unreserving. You won't >>>>>>>> reset sync_obj member to NULL in that case, but that should be >>>>>>>> harmless. >>>>>>>> This will allow you to keep the reservations fast and short. >>>>>>>> Maybe a helper >>>>>>>> would be appropriate for this since radeon and nouveau both >>>>>>>> seem to do this. >>>>>>>> >>>>>>> The problem is that as long as other users are waiting for idle >>>>>>> with reservation >>>>>>> held, for example to switch GPU engine or to delete a GPU bind, >>>>>>> waiting >>>>>>> for reserve will in many case mean wait for GPU. >>>>>> This example sounds inefficient, I know nouveau can do this, but >>>>>> this essentially >>>>>> just stalls the gpu entirely. I think guess you mean functions >>>>>> like nouveau_gem_object_close? >>>>>> It wouldn't surprise me if performance in nouveau could be >>>>>> improved simply by >>>>>> fixing those cases up instead, since it stalls the application >>>>>> completely too for other uses. >>>>>> >>>>> Please see the Nouveau cpu_prep patch as well. >>>>> >>>>> While there are a number of cases that can be fixed up, also in >>>>> Radeon, there's no way around that reservation >>>>> is a heavyweight lock that, particularly on simpler hardware without >>>>> support for fence ordering >>>>> with barriers and / or "semaphores" and accelerated eviction will be >>>>> held while waiting for idle. >>>>> >>>>> As such, it is unsuitable to protect read access to the fence >>>>> pointer. If the intention is to keep a single fence pointer >>>>> I think the best solution is to allow reading the fence pointer >>>>> outside reservation, but make sure this can be done >>>>> atomically. If the intention is to protect an array or list of fence >>>>> pointers, I think a spinlock is the better solution. >>>>> >>>>> /Thomas >>>> Just wanted to point out that like Thomas i am concern about having to >>>> have object reserved when waiting on its associated fence. I fear it >>>> will hurt us somehow. >>>> >>>> I will try to spend couple days stress testing your branch on radeon >>>> trying to see if it hurts performance anyhow with current use case. >>>> >>> I've been trying to figure out what to do with Maarten's patches going >>> forward since they are essential for all kinds of SoC people, >>> >>> However I'm still not 100% sure I believe either the fact that the >>> problem is anything more than a microoptimisation, and possibly a >>> premature one at that, this needs someone to start suggesting >>> benchmarks we can run and a driver set to run them on, otherwise I'm >>> starting to tend towards we are taking about an optimisation we can >>> fix later, >>> >>> The other option is to somehow merge this stuff under an option that >>> allows us to test it using a command line option, but I don't think >>> that is sane either, >>> >>> So Maarten, Jerome, Thomas, please start discussing actual real world >>> tests you think merging this code will affect and then I can make a >>> better consideration. >>> >>> Dave. >> Dave, >> >> This is IMHO a fundamental design discussion, not a micro-optimization. >> >> I'm pretty sure all sorts of horrendous things could be done to the DRM >> design without affecting real world application performance. >> >> In TTM data protection is primarily done with spinlocks: This serves two >> purposes. >> >> a) You can't unnecessarily hold a data protection lock while waiting for >> GPU, which is typically a very stupid thing to do (perhaps not so in >> this particular case) but when the sleep while atomic or locking >> inversion kernel warning turns up, that should at least >> ring a bell. Trying to implement dma-buf fencing using the TTM fencing >> callbacks will AFAICT cause a locking inversion. >> >> b) Spinlocks essentially go away on UP systems. The whole reservation >> path was essentially lock-free on UP systems pre dma-buf integration, >> and with very few atomic locking operations even on SMP systems. It was >> all prompted by a test some years ago (by Eric Anholt IIRC) showing that >> locking operations turned up quite high on Intel driver profiling. >> >> If we start protecting data with reservations, when we also export the >> reservation locks, so that people find them a convenient "serialize work >> on this buffer" lock, all kind of interesting things might happen, and >> we might end up in a situation >> similar to protecting everything with struct_mutex. >> >> This is why I dislike this change. In particular when there is a very >> simple remedy. >> >> But if I can get an ACK to convert the reservation object fence pointers >> and associated operations on them to be rcu-safe when I have some time >> left, I'd be prepared to accept this patch series in it's current state. > RCU could be a useful way to deal with this. But in my case I've shown > there are very few places where it's needed, core ttm does not need it > at all. > This is why I wanted to leave it to individual drivers to implement it. > > I think kfree_rcu for free in the driver itself should be enough, > and obtaining in the driver would require something like this, similar > to whats in rcuref.txt: > > rcu_read_lock(); > f = rcu_dereference(bo->fence); > if (f && !kref_get_unless-zero(&f->kref)) f = NULL; > rcu_read_unlock(); > > if (f) { > // do stuff here > ... > > // free fence > kref_put(&f->kref, fence_put_with_kfree_rcu); > } > > Am I wrong or is something like this enough to make sure fence is > still alive? No, you're correct. And a bo check for idle would amount to (since we don't care if the fence refcount is zero). rcu_read_lock() f = rcu_dereference(bo->fence); signaled = !f || f->signaled; rcu_read_unlock(). /Thomas > There might still be a small bug when bo->fence's refcount is > decreased before bo->fence is set to null. I haven't checked core ttm > so that might need fixing. > > I added some more people to CC. It might be worthwhile having this in > the core fence code and delete all fences with rcu, but I'm not > completely certain about that yet. > > ~Maarten > _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/dri-devel