Re: Measuring lock conention

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Apr 13, 2017 at 4:54 PM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>
> On 04/13/2017 03:38 PM, Milosz Tanski wrote:
>>
>> On Thu, Apr 13, 2017 at 2:02 PM, Mohamad Gebai <mgebai@xxxxxxxx> wrote:
>>>
>>>
>>> On 04/13/2017 01:20 PM, Mark Nelson wrote:
>>>>
>>>>
>>>> Nice!  I will give it a try and see how it goes. Specifically I want to
>>>> compare it to what I ended up working on yesterday.  After the meeting I
>>>> ended up doing major surgery on an existing gdb based wallclock profiler and
>>>> modified it to a) work b) be thread aware c) print inverse call-graphs.  The
>>>> code is still pretty rough but you can see it here:
>>>>
>>>> https://github.com/markhpc/gdbprof
>>>
>>>
>>> Very neat, thank you for this. Please let me know what happens, I'm
>>> interested to see which tool ends up working best for this use case. Also,
>>> could you share some information about what you're trying to find out?
>>>
>>> If I'm not mistaken, there's work being done to add the call stack of a
>>> process within the context of LTTng events. That way we could have this
>>> information when a process blocks in sys_futex.
>>>
>>> PS: found it, it's still in RFC -
>>> https://lists.lttng.org/pipermail/lttng-dev/2017-March/026985.html
>>>
>>>
>>
>> You can also use perf's syscall tracepoints to capture the
>> syscalls:sys_enter_futex event. This way you get contended mutex. The
>> nice thing about it is all the normal perf tools apply, so you can see
>> source annotation, frequency (hot points) and also examine the
>> callgraph of the hot points.
>>
>
> Hi Milosz,
>
> Do you know of any examples showing this technique?  I've suspected there was a way to do this (and similar things) with perf, but always ran into roadblocks that made it not work.  Potentially one of those roadblocks might have been errors on my part. ;)
>
> Mark

Mark, here's crash tutorial example.

    # Build a mutex benchmark I stole from github
    mtanski@crunchy:~/tmp/bench$ g++ --std=c++11 -Og -g -pthread
bench.cpp -o bench

    # I can only make the syscall probes work as root :(
    mtanski@crunchy:~/tmp/bench$ sudo -s

    root@crunchy:~/tmp/bench# /usr/bin/perf record -F 4096 -g
--call-graph dwarf -e syscalls:sys_exit_futex ./bench
    lock with 1 threads throughput = 48333
    lock with 2 threads throughput = 19551
    lock with 4 threads throughput = 14307
    lock_guard with 1 threads throughput = 57499
    lock_guard with 2 threads throughput = 16919
    lock_guard with 4 threads throughput = 14460
    atomic with 1 threads throughput = 175000
    atomic with 2 threads throughput = 63888
    atomic with 4 threads throughput = 65833
    [ perf record: Woken up 296 times to write data ]
    [ perf record: Captured and wrote 76.438 MB perf.data (9448 samples) ]

    root@crunchy:~/tmp/bench# perf report --stdio
    ...
        59.77%    59.77%  0x0
                |
                 --59.75%--__clone
                           start_thread
                           0xb8c80
                           |

|--22.46%--_ZNSt6thread5_ImplISt12_Bind_simpleIFZ16bench_lock_guardILi4EEvvEUlvE_vEEE6_M_runEv
                           |          |
                           |          |--19.28%--pthread_mutex_unlock
                           |          |          __lll_unlock_wake
                           |          |
                           |           --3.17%--pthread_mutex_lock
                           |                     __lll_lock_wait
                           |

|--21.77%--_ZNSt6thread5_ImplISt12_Bind_simpleIFZ10bench_lockILi4EEvvEUlvE_vEEE6_M_runEv
                           |          |
                           |          |--19.29%--pthread_mutex_unlock
                           |          |          __lll_unlock_wake
                           |          |
                           |           --2.48%--pthread_mutex_lock
                           |                     __lll_lock_wait
    ...

Notes:
* Ubuntu's built in perf is built without symbol demangling. I haven't
rebuilt my own perf; please ignore that.
* If you're running on a newer system, you can use the SDT events. And
then glibc/pthreads expose their own event called
'sdt_libc:lll_lock_wait' that you can use.
* You're going to see the pthread_mutex_(un)lock here because the
futex also needs to be called to wakeup waiters. That's why
'sdt_libc:lll_lock_wait' is a better event.
* Unless your glibc is built with frame pointers you have to use dwarf
call graphs. Otherwise you'll lose the call graph symbol at __lll
* If you use the 'syscalls:sys_enter_futex' you will also get the
uaddr of the futex and you can look at which mutex that corresponds to
as well.
* On an application with a lot mutex contention play with the event
capture limits. I've generated 32GB perf files that perf report chokes
on.


Similarly you can use other perf (like live perf top with various lock events)

Best,
- Milosz

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@xxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux