Integrating XEN Server : Long query time for "rbd ls -l" queries

Marc Schöchlin <ms@xxxxxxxxxx> · Wed, 25 Apr 2018 14:29:41 +0200

Hello list,

we are trying to integrate a storage repository in xenserver.
(i also describe the problem as a issue in the ceph bugtracker:
https://tracker.ceph.com/issues/23853)

Summary:

The slowness is a real pain for us, because this prevents the xen
storage repository to work efficently.
Gathering information for XEN Pools with hundreds of virtual machines
(using "--format json") would be a real pain...
The high user time consumption and the really huge amount of threads
suggests that there is something really inefficient in the "rbd" utility.

So what can i do to make "rbd ls -l" faster or to get comparable
information regarding snapshot hierarchy information?

Details:

We have a strange problem on listing the images of a SSD pool.

# time rbd ls -l --pool RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
NAME                                            SIZE
PARENT                                                                                           
FMT PROT LOCK
RBD-0192938e-cb4b-4ee1-9988-b8145704ac81      20480M
RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE  
2          
RBD-0192938e-cb4b-4ee1-9988-b8145704ac81@BASE 20480M
RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c/RBD-8b2cfe76-44b7-4393-b376-f675366831c3@BASE  
2 yes      
..
__srlock__                                        
0                                                                                                    
2          

real    0m8.726s
user    0m8.506s
sys    0m0.591s

===> This incredibly slow for outputting 105 lines.

Without "-l" ist pretty fast (unfortunately i need this information):

# time rbd ls  --pool RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
RBD-0192938e-cb4b-4ee1-9988-b8145704ac81
...
__srlock__

real    0m0.024s
user    0m0.015s
sys    0m0.008s

===> This incredibly fast for outputting 71 lines.

The "@BASE" snapshots are created by the following procedure:

rbd snap create
rbd snap protect
rbd clone

It seems that lookups to rbd-pools are performed by using a object named
"rbd_directory" which resides on the pool...

Querying this object with 142 entries need 0.024seconds.

# time rados -p RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
listomapvals rbd_directory

id_12da262ae8944a
value (14 bytes) :
00000000  0a 00 00 00 5f 5f 73 72  6c 6f 63 6b 5f 5f        |....__srlock__|
0000000e

id_12f2943d1b58ba
value (44 bytes) :
00000000  28 00 00 00 52 42 44 2d  31 62 34 32 31 38 39 31 
|(...RBD-1b421891|
00000010  2d 34 34 31 63 2d 34 35  33 30 2d 62 64 66 33 2d 
|-441c-4530-bdf3-|
00000020  61 64 32 62 31 31 34 61  36 33 66 63              |ad2b114a63fc|
0000002c
...

real    0m0.024s
user    0m0.023s
sys    0m0.000s

I also analyzed the state of the OSD holding this object:

# ceph osd map RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c
rbd_directory
osdmap e7400 pool 'RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c'
(14) object 'rbd_directory' -> pg 14.30a98c1c (14.1c) -> up ([36,0,38],
p36) acting ([36,0,38], p36)

Repeating queries resulted in 3.6% cpu usage - the logs do not provide
any useful information.

Analyzing this command by strace, suggests me that there is something
wrong with the "rbd" command implementation.

# strace -f -c rbd ls -l --pool
RBD_XenStorage-07449252-bf96-4daa-b0a6-687b7f1c369c > /dev/null
strace: Process 50286 attached
strace: Process 50287 attached
strace: Process 50288 attached
strace: Process 50289 attached
strace: Process 50290 attached
strace: Process 50291 attached
strace: Process 50292 attached
....
!!! 2086 threads !!!
....

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 98.42  219.207328        2020    108534     20241 futex
  1.47    3.265517         162     20099           epoll_wait
  0.06    0.131473           3     46377     22894 read
  0.02    0.053740           4     14627           sendmsg
  0.01    0.017225           2     10020           write
  0.00    0.008975           4      2001           munmap
  0.00    0.007260           3      2170           epoll_ctl
  0.00    0.007171           3      2088           madvise
  0.00    0.003139           1      4166           rt_sigprocmask
  0.00    0.002670           1      2086           prctl
  0.00    0.002494           1      3381           mprotect
  0.00    0.002315           1      2087           mmap
  0.00    0.002120           1      2087           set_robust_list
  0.00    0.002098           1      2084           gettid
  0.00    0.001833           1      2086           clone
  0.00    0.001152           8       136        87 connect
  0.00    0.000739           7       102           close
  0.00    0.000623          13        49           shutdown
  0.00    0.000622           6       110           fcntl
  0.00    0.000469          10        49           socket
  0.00    0.000466           6        73        29 open
  0.00    0.000456           9        49           setsockopt
  0.00    0.000166           6        27        27 access
  0.00    0.000164           4        37           fstat
  0.00    0.000070           4        17           brk
  0.00    0.000047           5        10         5 stat
  0.00    0.000013           4         3           pipe
  0.00    0.000012           4         3           epoll_create
  0.00    0.000010           1        11           rt_sigaction
  0.00    0.000007           7         1           lseek
  0.00    0.000007           7         1           execve
  0.00    0.000006           2         3           uname
  0.00    0.000005           3         2           sysinfo
  0.00    0.000005           5         1           arch_prctl
  0.00    0.000005           5         1           membarrier
  0.00    0.000003           3         1         1 ioctl
  0.00    0.000003           3         1           getrlimit
  0.00    0.000003           3         1           set_tid_address
  0.00    0.000000           0         1           getuid
------ ----------- ----------- --------- --------- ----------------
100.00  222.720411                224582     43284 total

Our System:

- Luminous/12.2.4
- Ubuntu 16.04
- 5 OSD Nodes (24*8 TB HDD  OSDs, 48*1TB SSD OSDS, Bluestore, 6Gb Cache
Size per OSD, 192GB RAM, 56 HT CPUs)
- 3 Mons (64 GB RAM, 200GB SSD, 4 visible CPUs)
- 2 * 10 GBIT, SFP+, bonded xmit_hash_policy layer3+4'
- the cluster is currently in preproduction state, there is no
significant load

Regards

Marc

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com