[PATCH 1/3] drm/amdgpu: fix a typo

axie@xxxxxxx (axie) · Fri, 23 Jun 2017 09:45:09 -0400

Hi Marek,

I understand you spent time on your original logic too. I really don't 
understand why you talked about pain if somebody can improve it.

To reduce the pain, now I am seriously considering dropping this patch. 
But please read on before you conclude. Let us treat open source 
software development a fun.

Same trick like this patch could be found in open source Intel GPU 
driver and xfs.

Talking about code size. You will be surprised if you really calculate it.

For function amdgpu_cs_get_threshold_for_moves:
New code:  It is 2 more loops and 2 more ifs.
Old code:  2 spinlock inline function. spin_lock can expand to 5 
function calls and one if. One function call contains 7 parameters.
spin_unlock can expand to 4 function calls.

By the way, you can config Linux kernel to disable some spinlock macro 
expansion. But I don't think people really do that.

In function amdgpu_cs_report_moved_bytes
New code:  zero
Old code:  2 spinlock inline function.

In Total:

New code:  It is 2 more loops and 2 more ifs. Maybe there are one or two 
other tiny things.
Old code:  4 spinlock inline function. They are expanded to 18 function 
calls. Among them, two function calls each contain 7 parameters.

Please think about it. Are you still sure that the new code make code 
size bigger? Now what is the next problem of the new change?

I agree that it is more difficult to understand the new code. But if you 
get used to it. It is not so difficult in deed. Just one loop to retry. 
Human are much smarter than this logic.
Compared with how hardware engineers went to extreme to optimize logics 
and design with optimization from day 1, my tiny new logic is really 
nothing.

I said that removing a lock can improve 0.3% or even bigger for some 
driver. I did not say it was AMDGPU. My tiny improvement may not be so 
obvious in this big driver for the time being.

I will give you the privilege to make a final decision, for example, you 
can even delay it for future if you don't want to make a decision now. 
Please be happy.

Thanks,
Alex Bin Xie

On 2017-06-23 07:37 AM, Marek OlÅ¡Ã¡k wrote:
> I agree with you about the spinlock. You seem to be good at this.
>
> It's always good to do measurements to validate that a code change
> improves something, especially when the code size and code complexity
> has to be increased. A CPU profiler such as sysprof can show you
> improvements on the order of 1/10000th = 0.01% if you record enough
> samples. Sometimes you have to un-inline a function to make it visible
> there. If you see a function that takes 0.3% of CPU time and you
> optimize it down to 0.1% using the profiler as the measurement tool,
> you have evidence that the improvement is there and nobody can reject
> the idea anymore. It also proves that the code size increase is worth
> it. It's always "added code size and loss of simplicity" vs benefit.
> It's a transaction. You trade one for the other. You lose something to
> get something else. OK, we know the code complexity. Now, what's the
> benefit? Can you do some measurements? The accuracy of 1/10000th
> should be enough for anybody.
>
> I know the feeling when you spend many days working on something,
> adding 100s or 1000s of lines of code, solving many problems to get
> there and increasing code complexity significantly, and then you do
> the measurement and it doesn't improve anything. I know the feeling
> very well. It sucks. The frustration comes from the investment of time
> and getting no return on the investment. Many frustrations in life are
> like that.
>
> Marek
>
>
> On Fri, Jun 23, 2017 at 4:23 AM, axie <axie at amd.com> wrote:
>> Hi Marek,
>>
>>
>> So do you agree that spinlock disables CPU preemption, contrary to your
>> original idea?
>>
>>
>> If you have new reason that this patch does not improve, please speak out.
>>
>>
>> Many patches in GPU driver aim at improving performance and power
>> efficiency. Does most patches submitted in AMDGPU requires a benchmarking
>> first?
>>
>> If all developers are required to always answer your questions when code
>> review, I am afraid that most open source community developers cannot meet
>> that requirement and stop working on AMDGPU.
>>
>>
>> To improve performance, there are many bottlenecks to clear. When the last
>> several bottlenecks are clear, the performance will show faster more
>> significantly.
>>
>> My pass profiling experience told me that clearing a lock can improve
>> performance for some driver like 0.3% to much bigger percentage. It depends
>> on many factors, even depends on the application itself.
>>
>>
>> This is not the first bottleneck fixed. This is surely not the last one.
>>
>>
>> Thanks,
>>
>> Alex Bin
>>
>>
>>
>> On 2017-06-22 07:54 PM, Marek OlÅ¡Ã¡k wrote:
>>> That's all nice, but does it improve performance? Have you been able
>>> to measure some performance difference with that code? Were you
>>> targeting a specific inefficiency you had seen e.g. with a CPU
>>> profiler?
>>>
>>> Marek
>>>
>>> On Thu, Jun 22, 2017 at 8:19 PM, axie <axie at amd.com> wrote:
>>>> To clarify, local IRQ is disabled by calling raw_local_irq_save(flags);
>>>>
>>>> Function __lock_acquire double checks that the local IRQ is really
>>>> disabled.
>>>>
>>>>
>>>>
>>>> On 2017-06-22 01:34 PM, axie wrote:
>>>>> Hi Marek,
>>>>>
>>>>> Spin lock and spin unlock is fast. But it is not so fast compared with
>>>>> atomic, which is a single CPU instruction in x86.
>>>>>
>>>>>
>>>>> 1. spinlock does NOT allow preemption at local CPU. Let us have a look
>>>>> at
>>>>> how spin lock was implemented.
>>>>>
>>>>> static inline void __raw_spin_lock(raw_spinlock_t *lock)
>>>>> {
>>>>>       preempt_disable(); <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<--This is
>>>>> memory barrier operation too.
>>>>>       spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
>>>>>       LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
>>>>> }
>>>>>
>>>>> 2.  A function  __lock_acquire called by spinlock. The function is so
>>>>> long
>>>>> that I would not attach all of it here.
>>>>>
>>>>> There is atomic operation inside and 12 meta data updates and 14 if
>>>>> statements and it calls quite some other functions.
>>>>>
>>>>> Note that it disable IRQ...
>>>>>
>>>>> static int __lock_acquire(struct lockdep_map *lock, unsigned int
>>>>> subclass,
>>>>>                 int trylock, int read, int check, int hardirqs_off,
>>>>>                 struct lockdep_map *nest_lock, unsigned long ip,
>>>>>                 int references, int pin_count)
>>>>> {
>>>>>       struct task_struct *curr = current;
>>>>>       struct lock_class *class = NULL;
>>>>>       struct held_lock *hlock;
>>>>>       unsigned int depth;
>>>>>       int chain_head = 0;
>>>>>       int class_idx;
>>>>>       u64 chain_key;
>>>>>
>>>>>       if (unlikely(!debug_locks))
>>>>>           return 0;
>>>>>
>>>>>       /*
>>>>>        * Lockdep should run with IRQs disabled, otherwise we could
>>>>>        * get an interrupt which would want to take locks, which would
>>>>>        * end up in lockdep and have you got a head-ache already?
>>>>>        */
>>>>>       if (DEBUG_LOCKS_WARN_ON(!irqs_disabled())) <<<<<<<<<<<<<<<Disable
>>>>> IRQ
>>>>>           return 0;
>>>>>
>>>>> ....
>>>>>
>>>>> 3. Another function called by spinlock in a higher level:
>>>>>
>>>>> void lock_acquire(struct lockdep_map *lock, unsigned int subclass,
>>>>>
>>>>>                 int trylock, int read, int check,
>>>>>                 struct lockdep_map *nest_lock, unsigned long ip)
>>>>> {
>>>>>       unsigned long flags;
>>>>>
>>>>>       if (unlikely(current->lockdep_recursion))
>>>>>           return;
>>>>>
>>>>>       raw_local_irq_save(flags);
>>>>>       check_flags(flags);
>>>>>
>>>>>       current->lockdep_recursion = 1;
>>>>>       trace_lock_acquire(lock, subclass, trylock, read, check, nest_lock,
>>>>> ip);
>>>>>       __lock_acquire(lock, subclass, trylock, read, check,
>>>>>                  irqs_disabled_flags(flags), nest_lock, ip, 0, 0);
>>>>>       current->lockdep_recursion = 0;
>>>>>       raw_local_irq_restore(flags);
>>>>> }
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Alex Bin
>>>>>
>>>>>
>>>>> On 2017-06-22 12:27 PM, Marek OlÅ¡Ã¡k wrote:
>>>>>> On Thu, Jun 22, 2017 at 5:33 PM, Xie, AlexBin <AlexBin.Xie at amd.com>
>>>>>> wrote:
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>>
>>>>>>> In fact, the change from spinlock to atomic is quite painful. When I
>>>>>>> started, I thought it was easy but later I found there might be race
>>>>>>> condition here and there. Now I think the change looks more robust. In
>>>>>>> kernel source, there are several other drivers used the same trick.
>>>>>>>
>>>>>>>
>>>>>>> On the other hand, I think the logic itself might be optimized
>>>>>>> considering
>>>>>>> the locking. But I had spent quite some effort to maintain original
>>>>>>> logic.
>>>>>> It seems quite complicated and I don't know if there is any
>>>>>> performance benefit. Spinlocks are nice because they allow preemption.
>>>>>>
>>>>>> It would be more interesting to merge the CS and BO_LIST ioctls into
>>>>>> one.
>>>>>>
>>>>>> Marek
>>>>>