Re: Performance issue with small files, and weird "workaround"

Ruben Rodriguez <ruben@xxxxxxx> · Wed, 28 Jun 2017 16:46:46 -0400

On 06/27/2017 07:08 PM, Jason Dillaman wrote:
> Have you tried blktrace to determine if there are differences in the
> IO patterns to the rbd-backed virtio-scsi block device (direct vs
> indirect through loop)?

I tried today with the kernel tracing features, and I'll give blktrace a
go if necessary. But I did find some important differences between the
two read modes already. The main one is that on loop mode there are very
few scsi_dispatch calls (1/100 times less), and they have txlen=8192
(being blocks it means 4M reads, matching the rbd object size), and in
the direct case txlen is often 8, or 4kb.

direct mode:              cp-1167  [000] ....  4790.125637:
scsi_dispatch_cmd_start: host_no=2 channel=0 id=0 lun=2 data_sgl=1
prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=17540264 txlen=8
protect=0 raw=28 00 01 0b a4 a8 00 00 08 00)

vs

loop mode:           loop0-1021  [000] ....  4645.976267:
scsi_dispatch_cmd_start: host_no=2 channel=0 id=0 lun=2 data_sgl=67
prot_sgl=0 prot_op=SCSI_PROT_NORMAL cmnd=(READ_10 lba=4705776 txlen=8192
protect=0 raw=28 00 00 47 cd f0 00 20 00 00)

I also see a number of calls like
           loop0-1021  [000] ....  3319.709354: block_bio_backmerge:
8,16 R 10499064 + 2048 [loop0]
           loop0-1021  [000] ....  3319.709508: block_bio_backmerge:
8,16 R 10501112 + 2048 [loop0]
           loop0-1021  [000] ....  3319.709639: block_bio_backmerge:
8,16 R 10503160 + 2048 [loop0]
but only on the loop one.

The key to the performance penalty seems to be that the kernel is most
of the time reading in 4k chunks instead of 4MB, and setting readahead
or min_io_size is failing to fix that. Any idea how to achieve this?

> On Tue, Jun 27, 2017 at 3:17 PM, Ruben Rodriguez <ruben@xxxxxxx> wrote:
>>
>> We are setting a new set of servers to run the FSF/GNU infrastructure,
>> and we are seeing a strange behavior. From a Qemu host, reading small
>> files from a mounted rbd image is very slow. The "realworld" test that I
>> use is to copy the linux source code from the filesystem to /dev/shm. On
>> the host server that takes ~10 seconds to copy from a mapped rbd image,
>> but on the vm it takes over a minute. The same test also takes <20
>> seconds when the vm storage is local LVM. Writing the files to the rbd
>> mounted disk also takes ~10 seconds.
>>
>> I suspect a problem with readahead and caching, so as a test I copied
>> those same files into a loop device inside the vm (stored in the same
>> rbd), reading takes ~10 seconds. I drop the caches before each test.
>>
>> This is how I run that test:
>>
>> dd if=/dev/zero of=test bs=1G count=5
>> mkfs.xfs test
>> mount test /mnt
>> cp linux-src /mnt -a
>> echo 1 > /proc/sys/vm/drop_caches
>> time cp /mnt/linux-src /dev/shm -a
>>
>> I've tested many different parameters (readahead, partition alignment,
>> filesystem formatting, block queue settings, etc) with little change in
>> performance. Wrapping files in a loop device seems to change things in a
>> way that I cannot replicate on the upper layers otherwise.
>>
>> Is this expected or am I doing something wrong?
>>
>> Here are the specs:
>> Ceph 10.2.7 on Ubuntu xenial derivative. Kernel 4.4, Qemu 2.5
>> 2 Ceph servers running 6x 1TB SSD OSDs each.
>> 2 Qemu/kvm servers managed with libvirt
>> All connected with 20GbE (bonding). Every server has 2x 16 core opteron
>> cpus, 2GB ram per OSD, and a bunch of ram on the KVM host servers.
>>
>> osd pool default size = 2
>> osd pool default min size = 2
>> osd pool default pg num = 512
>> osd pool default pgp num = 512
>>
>> lsblk -t
>> NAME ALIGNMENT  MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED RQ-SIZE RA WSAM
>> sdb         0     512      0     512     512    0 noop      128  0    2G
>> loop0       0     512      0     512     512    0           128  0    0B
>>
>> Some numbers:
>> rados bench -p libvirt-pool 10 write: avg MB/s 339.508 avg lat 0.186789
>> rados bench -p libvirt-pool 100 rand: avg MB/s 1111.42 avg lat 0.0534118
>> Random small file read:
>> fio read 4k rand inside the vm: avg=2246KB/s 1708usec avg lat, 600IOPS
>> Sequential, small files read with readahead:
>> fio read 4k  seq inside the vm: avg=308351KB/s 11usec avg lat, 55kIOPS
>>
>> The rbd images are attached with virtio-scsi (no difference using
>> virtio) and the guest block devices have 4M readahead set (no difference
>> if disabled). Rbd cache is enabled on server and client (no difference
>> if disabled). Forcing rbd readahead makes no difference.
>>
>> Please advice!
>> --
>> Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation
>> GPG Key: 05EF 1D2F FE61 747D 1FC8  27C3 7FAC 7D26 472F 4409
>> https://fsf.org | https://gnu.org
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>

-- 
Ruben Rodriguez | Senior Systems Administrator, Free Software Foundation
GPG Key: 05EF 1D2F FE61 747D 1FC8  27C3 7FAC 7D26 472F 4409
https://fsf.org | https://gnu.org

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com