Since I cannot reproduce your issue, can you generate a perf CPU flame graph on this to figure out where the user time is being spent? On Wed, Apr 25, 2018 at 11:25 AM, Marc Schöchlin <ms@xxxxxxxxxx> wrote: > Hello Jason, > > according to this, latency between client and osd should not be the problem: > (the high amount of user time in the measure above, network > communication should not be the problem) > > Finding the involved osd: > > # ceph osd map RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c > rbd_directory > osdmap e7570 pool 'RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c' > (14) object 'rbd_directory' -> pg 14.30a98c1c (14.1c) -> up ([36,0,38], > p36) acting ([36,0,38], p36) > > # ceph osd find osd.36 > { > "osd": 36, > "ip": "10.23.27.149:6826/7195", > "crush_location": { > "host": "ceph-ssd-s39", > "root": "default" > } > } > > ssh ceph-ssd-s39 > > # nuttcp -w1m ceph-mon-s43 > 11186.3391 MB / 10.00 sec = 9381.8890 Mbps 12 %TX 32 %RX 0 retrans 0.15 > msRTT > > # time rbd ls -l -p RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c > --rbd_concurrent_management_ops=1 --id xen_test > NAME SIZE > PARENT > FMT PROT LOCK > RBD-0192938e-cb4b-4ee1-9988-b8145704ac81 20480M > RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE > 2 > RBD-0192938e-cb4b-4ee1-9988-b8145704ac81@BASE 20480M > RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE > 2 yes > ... > RBD-feb32ab0-a5ee-44e6-9089-486e91ee8af3 20480M > RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-bbbc2ce0-4ad3-44ae-a52f-e57df0441e27@BASE > 2 > __srlock__ > 0 > 2 > > real 0m23.667s > user 0m15.949s > sys 0m1.276s > > # time rbd ls -l -p RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c > --rbd_concurrent_management_ops=1 --id xen_test > NAME SIZE > PARENT > FMT PROT LOCK > RBD-0192938e-cb4b-4ee1-9988-b8145704ac81 20480M > RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE > 2 > RBD-0192938e-cb4b-4ee1-9988-b8145704ac81@BASE 20480M > RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE > 2 yes > ... > RBD-feb32ab0-a5ee-44e6-9089-486e91ee8af3 20480M > RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-bbbc2ce0-4ad3-44ae-a52f-e57df0441e27@BASE > 2 > > .... > __srlock__ > 0 > 2 > > real 0m13.937s > user 0m14.404s > sys 0m1.089s > > Regards > Marc > > > Am 25.04.2018 um 16:38 schrieb Jason Dillaman: >> I'd check your latency between your client and your cluster. On my >> development machine w/ only a single OSD running and 200 clones, each >> with 1 snapshot, "rbd -l" only takes a couple seconds for me: >> >> $ time rbd ls -l --rbd_concurrent_management_ops=1 | wc -l >> 403 >> >> real 0m1.746s >> user 0m1.136s >> sys 0m0.169s >> >> Also, I have to ask, but how often are you expecting to scrape the >> images from pool? The long directory list involves opening each image >> in the pool (which involves numerous round-trips to the OSDs) plus >> iterating through each snapshot (which also involves round-trips). >> >> On Wed, Apr 25, 2018 at 10:13 AM, Marc Schöchlin <ms@xxxxxxxxxx> wrote: >>> Hello Piotr, >>> >>> i updated the issue. >>> (https://tracker.ceph.com/issues/23853?next_issue_id=23852&prev_issue_id=23854) >>> >>> # time rbd ls -l --pool >>> RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c >>> --rbd_concurrent_management_ops=1 >>> NAME SIZE PARENT >>> >>> RBD-feb32ab0-a5ee-44e6-9089-486e91ee8af3 20480M >>> RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-bbbc2ce0-4ad3-44ae-a52f-e57df0441e27@BASE >>> 2 >>> __srlock__ >>> 0 >>> 2 >>> .... >>> real 0m18.562s >>> user 0m12.513s >>> sys 0m0.793s >>> >>> I also attached a json dump of my pool structure. >>> >>> Regards >>> Marc >>> >>> Am 25.04.2018 um 14:46 schrieb Piotr Dałek: >>>> On 18-04-25 02:29 PM, Marc Schöchlin wrote: >>>>> Hello list, >>>>> >>>>> we are trying to integrate a storage repository in xenserver. >>>>> (i also describe the problem as a issue in the ceph bugtracker: >>>>> https://tracker.ceph.com/issues/23853) >>>>> >>>>> Summary: >>>>> >>>>> The slowness is a real pain for us, because this prevents the xen >>>>> storage repository to work efficently. >>>>> Gathering information for XEN Pools with hundreds of virtual machines >>>>> (using "--format json") would be a real pain... >>>>> The high user time consumption and the really huge amount of threads >>>>> suggests that there is something really inefficient in the "rbd" >>>>> utility. >>>>> >>>>> So what can i do to make "rbd ls -l" faster or to get comparable >>>>> information regarding snapshot hierarchy information? >>>> Can you run this command with extra argument >>>> "--rbd_concurrent_management_ops=1" and share the timing of that? >>>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > -- Jason _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com