On Fri, Jun 23, 2017 at 3:01 PM, Christian König <deathsimple at vodafone.de> wrote: > The key point here is while optimizing this is nice the much bigger pile is > the locking done for each BO. > > In other words even when we optimize all the other locks involved into > atomics or RCU, the BO reservation lock will still dominate everything. > > One possible solution to this would be per process resources like I > suggested multiple times now. Mesa can set a per-process resource flag on all resources except displayable ones. The question is, would it help if an IB contained 1000 per-process resources and 1-2 inter-process sharable? Marek > > Christian. > > > Am 23.06.2017 um 13:37 schrieb Marek Olšák: >> >> I agree with you about the spinlock. You seem to be good at this. >> >> It's always good to do measurements to validate that a code change >> improves something, especially when the code size and code complexity >> has to be increased. A CPU profiler such as sysprof can show you >> improvements on the order of 1/10000th = 0.01% if you record enough >> samples. Sometimes you have to un-inline a function to make it visible >> there. If you see a function that takes 0.3% of CPU time and you >> optimize it down to 0.1% using the profiler as the measurement tool, >> you have evidence that the improvement is there and nobody can reject >> the idea anymore. It also proves that the code size increase is worth >> it. It's always "added code size and loss of simplicity" vs benefit. >> It's a transaction. You trade one for the other. You lose something to >> get something else. OK, we know the code complexity. Now, what's the >> benefit? Can you do some measurements? The accuracy of 1/10000th >> should be enough for anybody. >> >> I know the feeling when you spend many days working on something, >> adding 100s or 1000s of lines of code, solving many problems to get >> there and increasing code complexity significantly, and then you do >> the measurement and it doesn't improve anything. I know the feeling >> very well. It sucks. The frustration comes from the investment of time >> and getting no return on the investment. Many frustrations in life are >> like that. >> >> Marek >> >> >> On Fri, Jun 23, 2017 at 4:23 AM, axie <axie at amd.com> wrote: >>> >>> Hi Marek, >>> >>> >>> So do you agree that spinlock disables CPU preemption, contrary to your >>> original idea? >>> >>> >>> If you have new reason that this patch does not improve, please speak >>> out. >>> >>> >>> Many patches in GPU driver aim at improving performance and power >>> efficiency. Does most patches submitted in AMDGPU requires a benchmarking >>> first? >>> >>> If all developers are required to always answer your questions when code >>> review, I am afraid that most open source community developers cannot >>> meet >>> that requirement and stop working on AMDGPU. >>> >>> >>> To improve performance, there are many bottlenecks to clear. When the >>> last >>> several bottlenecks are clear, the performance will show faster more >>> significantly. >>> >>> My pass profiling experience told me that clearing a lock can improve >>> performance for some driver like 0.3% to much bigger percentage. It >>> depends >>> on many factors, even depends on the application itself. >>> >>> >>> This is not the first bottleneck fixed. This is surely not the last one. >>> >>> >>> Thanks, >>> >>> Alex Bin >>> >>> >>> >>> On 2017-06-22 07:54 PM, Marek Olšák wrote: >>>> >>>> That's all nice, but does it improve performance? Have you been able >>>> to measure some performance difference with that code? Were you >>>> targeting a specific inefficiency you had seen e.g. with a CPU >>>> profiler? >>>> >>>> Marek >>>> >>>> On Thu, Jun 22, 2017 at 8:19 PM, axie <axie at amd.com> wrote: >>>>> >>>>> To clarify, local IRQ is disabled by calling raw_local_irq_save(flags); >>>>> >>>>> Function __lock_acquire double checks that the local IRQ is really >>>>> disabled. >>>>> >>>>> >>>>> >>>>> On 2017-06-22 01:34 PM, axie wrote: >>>>>> >>>>>> Hi Marek, >>>>>> >>>>>> Spin lock and spin unlock is fast. But it is not so fast compared with >>>>>> atomic, which is a single CPU instruction in x86. >>>>>> >>>>>> >>>>>> 1. spinlock does NOT allow preemption at local CPU. Let us have a look >>>>>> at >>>>>> how spin lock was implemented. >>>>>> >>>>>> static inline void __raw_spin_lock(raw_spinlock_t *lock) >>>>>> { >>>>>> preempt_disable(); <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<--This is >>>>>> memory barrier operation too. >>>>>> spin_acquire(&lock->dep_map, 0, 0, _RET_IP_); >>>>>> LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock); >>>>>> } >>>>>> >>>>>> 2. A function __lock_acquire called by spinlock. The function is so >>>>>> long >>>>>> that I would not attach all of it here. >>>>>> >>>>>> There is atomic operation inside and 12 meta data updates and 14 if >>>>>> statements and it calls quite some other functions. >>>>>> >>>>>> Note that it disable IRQ... >>>>>> >>>>>> static int __lock_acquire(struct lockdep_map *lock, unsigned int >>>>>> subclass, >>>>>> int trylock, int read, int check, int hardirqs_off, >>>>>> struct lockdep_map *nest_lock, unsigned long ip, >>>>>> int references, int pin_count) >>>>>> { >>>>>> struct task_struct *curr = current; >>>>>> struct lock_class *class = NULL; >>>>>> struct held_lock *hlock; >>>>>> unsigned int depth; >>>>>> int chain_head = 0; >>>>>> int class_idx; >>>>>> u64 chain_key; >>>>>> >>>>>> if (unlikely(!debug_locks)) >>>>>> return 0; >>>>>> >>>>>> /* >>>>>> * Lockdep should run with IRQs disabled, otherwise we could >>>>>> * get an interrupt which would want to take locks, which would >>>>>> * end up in lockdep and have you got a head-ache already? >>>>>> */ >>>>>> if (DEBUG_LOCKS_WARN_ON(!irqs_disabled())) >>>>>> <<<<<<<<<<<<<<<Disable >>>>>> IRQ >>>>>> return 0; >>>>>> >>>>>> .... >>>>>> >>>>>> 3. Another function called by spinlock in a higher level: >>>>>> >>>>>> void lock_acquire(struct lockdep_map *lock, unsigned int subclass, >>>>>> >>>>>> int trylock, int read, int check, >>>>>> struct lockdep_map *nest_lock, unsigned long ip) >>>>>> { >>>>>> unsigned long flags; >>>>>> >>>>>> if (unlikely(current->lockdep_recursion)) >>>>>> return; >>>>>> >>>>>> raw_local_irq_save(flags); >>>>>> check_flags(flags); >>>>>> >>>>>> current->lockdep_recursion = 1; >>>>>> trace_lock_acquire(lock, subclass, trylock, read, check, >>>>>> nest_lock, >>>>>> ip); >>>>>> __lock_acquire(lock, subclass, trylock, read, check, >>>>>> irqs_disabled_flags(flags), nest_lock, ip, 0, 0); >>>>>> current->lockdep_recursion = 0; >>>>>> raw_local_irq_restore(flags); >>>>>> } >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Alex Bin >>>>>> >>>>>> >>>>>> On 2017-06-22 12:27 PM, Marek Olšák wrote: >>>>>>> >>>>>>> On Thu, Jun 22, 2017 at 5:33 PM, Xie, AlexBin <AlexBin.Xie at amd.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Christian, >>>>>>>> >>>>>>>> >>>>>>>> In fact, the change from spinlock to atomic is quite painful. When I >>>>>>>> started, I thought it was easy but later I found there might be race >>>>>>>> condition here and there. Now I think the change looks more robust. >>>>>>>> In >>>>>>>> kernel source, there are several other drivers used the same trick. >>>>>>>> >>>>>>>> >>>>>>>> On the other hand, I think the logic itself might be optimized >>>>>>>> considering >>>>>>>> the locking. But I had spent quite some effort to maintain original >>>>>>>> logic. >>>>>>> >>>>>>> It seems quite complicated and I don't know if there is any >>>>>>> performance benefit. Spinlocks are nice because they allow >>>>>>> preemption. >>>>>>> >>>>>>> It would be more interesting to merge the CS and BO_LIST ioctls into >>>>>>> one. >>>>>>> >>>>>>> Marek >>>>>> >>>>>> >