Re: Integrating XEN Server : Long query time for "rbd ls -l" queries

Jason Dillaman <jdillama@xxxxxxxxxx> · Wed, 25 Apr 2018 11:34:39 -0400

Since I cannot reproduce your issue, can you generate a perf CPU flame
graph on this to figure out where the user time is being spent?

On Wed, Apr 25, 2018 at 11:25 AM, Marc Schöchlin <ms@xxxxxxxxxx> wrote:
> Hello Jason,
>
> according to this, latency between client and osd should not be the problem:
> (the high amount of user time in the measure above, network
> communication should not be the problem)
>
> Finding the involved osd:
>
> # ceph osd map RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
> rbd_directory
> osdmap e7570 pool 'RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c'
> (14) object 'rbd_directory' -> pg 14.30a98c1c (14.1c) -> up ([36,0,38],
> p36) acting ([36,0,38], p36)
>
> # ceph osd find osd.36
> {
>     "osd": 36,
>     "ip": "10.23.27.149:6826/7195",
>     "crush_location": {
>         "host": "ceph-ssd-s39",
>         "root": "default"
>     }
> }
>
> ssh ceph-ssd-s39
>
> # nuttcp -w1m ceph-mon-s43
> 11186.3391 MB /  10.00 sec = 9381.8890 Mbps 12 %TX 32 %RX 0 retrans 0.15
> msRTT
>
> # time rbd ls -l -p RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
> --rbd_concurrent_management_ops=1 --id xen_test
> NAME                                            SIZE
> PARENT
> FMT PROT LOCK
> RBD-0192938e-cb4b-4ee1-9988-b8145704ac81      20480M
> RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE
> 2
> RBD-0192938e-cb4b-4ee1-9988-b8145704ac81@BASE 20480M
> RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE
> 2 yes
> ...
> RBD-feb32ab0-a5ee-44e6-9089-486e91ee8af3      20480M
> RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-bbbc2ce0-4ad3-44ae-a52f-e57df0441e27@BASE
> 2
> __srlock__
> 0
> 2
>
> real    0m23.667s
> user    0m15.949s
> sys    0m1.276s
>
> # time rbd ls -l -p RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
> --rbd_concurrent_management_ops=1 --id xen_test
> NAME                                            SIZE
> PARENT
> FMT PROT LOCK
> RBD-0192938e-cb4b-4ee1-9988-b8145704ac81      20480M
> RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE
> 2
> RBD-0192938e-cb4b-4ee1-9988-b8145704ac81@BASE 20480M
> RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE
> 2 yes
> ...
> RBD-feb32ab0-a5ee-44e6-9089-486e91ee8af3      20480M
> RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-bbbc2ce0-4ad3-44ae-a52f-e57df0441e27@BASE
> 2
>
> ....
> __srlock__
> 0
> 2
>
> real    0m13.937s
> user    0m14.404s
> sys    0m1.089s
>
> Regards
> Marc
>
>
> Am 25.04.2018 um 16:38 schrieb Jason Dillaman:
>> I'd check your latency between your client and your cluster. On my
>> development machine w/ only a single OSD running and 200 clones, each
>> with 1 snapshot, "rbd -l" only takes a couple seconds for me:
>>
>> $ time rbd ls -l --rbd_concurrent_management_ops=1 | wc -l
>> 403
>>
>> real 0m1.746s
>> user 0m1.136s
>> sys 0m0.169s
>>
>> Also, I have to ask, but how often are you expecting to scrape the
>> images from pool? The long directory list involves opening each image
>> in the pool (which involves numerous round-trips to the OSDs) plus
>> iterating through each snapshot (which also involves round-trips).
>>
>> On Wed, Apr 25, 2018 at 10:13 AM, Marc Schöchlin <ms@xxxxxxxxxx> wrote:
>>> Hello Piotr,
>>>
>>> i updated the issue.
>>> (https://tracker.ceph.com/issues/23853?next_issue_id=23852&prev_issue_id=23854)
>>>
>>> # time rbd ls -l --pool
>>> RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
>>> --rbd_concurrent_management_ops=1
>>> NAME                                            SIZE PARENT
>>>
>>> RBD-feb32ab0-a5ee-44e6-9089-486e91ee8af3      20480M
>>> RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-bbbc2ce0-4ad3-44ae-a52f-e57df0441e27@BASE
>>> 2
>>> __srlock__
>>> 0
>>> 2
>>> ....
>>> real    0m18.562s
>>> user    0m12.513s
>>> sys    0m0.793s
>>>
>>> I also attached a json dump of my pool structure.
>>>
>>> Regards
>>> Marc
>>>
>>> Am 25.04.2018 um 14:46 schrieb Piotr Dałek:
>>>> On 18-04-25 02:29 PM, Marc Schöchlin wrote:
>>>>> Hello list,
>>>>>
>>>>> we are trying to integrate a storage repository in xenserver.
>>>>> (i also describe the problem as a issue in the ceph bugtracker:
>>>>> https://tracker.ceph.com/issues/23853)
>>>>>
>>>>> Summary:
>>>>>
>>>>> The slowness is a real pain for us, because this prevents the xen
>>>>> storage repository to work efficently.
>>>>> Gathering information for XEN Pools with hundreds of virtual machines
>>>>> (using "--format json") would be a real pain...
>>>>> The high user time consumption and the really huge amount of threads
>>>>> suggests that there is something really inefficient in the "rbd"
>>>>> utility.
>>>>>
>>>>> So what can i do to make "rbd ls -l" faster or to get comparable
>>>>> information regarding snapshot hierarchy information?
>>>> Can you run this command with extra argument
>>>> "--rbd_concurrent_management_ops=1" and share the timing of that?
>>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>

-- 
Jason
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com