On Mon, Apr 8, 2013 at 8:58 PM, Sebastian Andrzej Siewior <bigeasy at linutronix.de> wrote: > * Daniel Vetter | 2013-04-02 15:47:02 [+0200]: >>On Tue, Apr 02, 2013 at 03:30:58PM +0200, Sebastian Andrzej Siewior wrote: >>> mutex_is_locked_by() checks the owner of the lock against current. This >>> is done by accessing a private member of struct mutex which works on >>> mainline but does not on RT. >>> I did not figure out, why this "lock-owner-check" and the "lock stealing >>> flag" is required. If the lock can not be acquire the lock (because it >>> would deadlock) then it can return -1. >>> The lock stealing makes actually no sense to me. If >>> shrinker_no_lock_stealing is true then the same functions >>> (i915_gem_purge(), i915_gem_shrink_all()) are called from >>> i915_gem_object_create_mmap_offset() as from i915_gem_inactive_shrink(). >>> I haven't found a path in which i915_gem_inactive_shrink() is invoked >>> from i915_gem_object_create_mmap_offset() that means there is no way >>> shrinker_no_lock_stealing is true _and_ the lock is owned by the current >>> process. >> >>Every invocation of a memory allocation function can potentially recourse >>into the shrinker callbacks. Since we currently protect almost all our gpu >>buffer object state with one single lock, and we can also easily hold onto >>tons of memory, we can easily OOM without that lock stealing trick. > > But this only occures when you allocate memory from within the driver or > is this mutex so contended that it is almost always taken? It protects the entire gem state atm. So yes, it's grabbed everywhere. On top of that it's interleaved with tons of legacy drm core code locking, which almost exclusively relies on the same mutex. I advice you to bring along your dragon-grade broad sword when venturing in this area. >>The patch you're reverting here was added to work around spurious OOM when >>trying to allocate sg tables, so this is real. > > Couldn't you simply call the two shrink functions before allocating the > sg table? We'd need to hand-roll our own sg table allocator. We already have such hand-rolled code for a lot of these cases and it started to get _really_ ugly. >>I fully realize that this is an awful hack and I'll burn on a pretty big >>pyre for merging it. I also know that it doesn't really work if someone >>else is trying to allocate while another thread is hogging our single >>mutex. But without a replacement fix, this is nacked. > > I'm sorry to point it out but this "fix" has staging quality. I never denied that, but laid out our plans to eventually fix this for real. Unfortunately to fix this for real you have to rework 10 years of horrible drm locking, all more-or-less interacting in surprising ways with crazy ioctl interfaces. I'm working on this, slowly, but steadily. Patches welcome. >>The no_lock_stealing hack is probably even worse, we use it to protect >>critical section which are touching the gpu lru. > > Maybe another reason to have a few spots where you call shirnk function > manually? Manual shrink handling already makes up more than 50% of our logic in some cases. It's really gotten out of hand, so I've opted for this hack to fix up the remaining parts until we have a better solution. >>Long term we're working towards that goal (also for a bunch of other >>reasons) and aim for more fine-grained locking, so that we can free memory >>in smaller chunks. But that's gonna be a lot of work. > Could you please define "long term" in time units? I would really like > to see the owner check gone asap. The baseline we need first is the ww_mutex work Maarten Lankhorst is working on. Then converting gem and our entire driver codebase over to the new locking pradigm. Plus fixing up anything which needs fixing in drm, plus any drivers those drm core changes affect. All while doing our usual pace of new hw enabling, other feature work and normal bugfixing. If no one chips in, I'd say a few kernel releases at least. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch