slow read speeds from kernel rbd (Firefly 0.80.4)

sma310@xxxxxxxxxx (Steve Anthony) · Tue, 26 Aug 2014 15:23:50 -0400

Ok, after some delays and the move to new network hardware I have an
update. I'm still seeing the same low bandwidth and high retransmissions
from iperf after moving to the Cisco 6001 (10Gb) and 2960 (1Gb). I've
narrowed it down to transmissions from a 10Gb connected host to a 1Gb
connected host. Taking a more targeted tcpdump, I discovered that there
are multiple duplicate ACKs, triggering fast retransmissions between the
two test hosts.

There are several websites/articles which suggest that mixing 10Gb and
1Gb hosts causes performance issues, but no concrete explanation of why
that's the case, and if it can be avoided without moving everything to
10Gb, eg.

http://blogs.technet.com/b/networking/archive/2011/05/16/tcp-dupacks-and-tcp-fast-retransmits.aspx
http://en.community.dell.com/dell-groups/dtcmedia/m/mediagallery/19856911/download.aspx
[PDF]
http://packetpushers.net/flow-control-storm-%E2%80%93-ip-storage-performance-effects/

I verified that it's not a flow control storm (the pause frame counters
along the network path are zero), so assuming it might be bandwidth
related I installed trickle and used it to limit the bandwidth of iperf
to 1Gb; no change. I further restricted it down to 100Kbps, and was
*still* seeing high retransmission. This seems to imply it's not purely
bandwidth related.

After further research, I noticed a difference of about 0.1ms in the RTT
between two 10Gb hosts (intra-switch) and the 10Gb and 1Gb host
(inter-switch). I theorized this may be affecting the retransmission
timeout counter calculations, per:

http://sgros.blogspot.com/2012/02/calculating-tcp-rto.html

so I used ethtool to set the link plugged into the 10Gb 6001 to 1Gb;
this immediately fixed the issue. After this change the difference in
RTTs moved to about .025ms. Plugging another host into the old 10Gb FEX,
I have 10Gb to 10Gb RTTs withing .001ms of 6001 to 2960 RTTs, and don't
see the high retransmissions with iperf between those 10Gb hosts.

**** tldr ****

So, right now I don't see retransmissions between hosts on the same
switch (even if speeds are mixed), but I do across switches when the
hosts are mixed 10Gb/1Gb. Also, I wonder what the difference between
process bandwidth limiting and link 1Gb negotiation is which leads to
the differences observed. I checked the link per Mark's suggestion
below, but all the values they increase in that old post are already
lower than the defaults set on my hosts.

If anyone has any ideas or explanations, I'd appreciate it. Otherwise,
I'll keep the list posted if I uncover a solution or make more progress.
Thanks.

-Steve

On 07/28/2014 01:21 PM, Mark Nelson wrote:
> On 07/28/2014 11:28 AM, Steve Anthony wrote:
>> While searching for more information I happened across the following
>> post (http://dachary.org/?p=2961) which vaguely resembled the symptoms
>> I've been experiencing. I ran tcpdump and noticed what appeared to be a
>> high number of retransmissions on the host where the images are mounted
>> during a read from a Ceph rbd, so I ran iperf3 to get some concrete
>> numbers:
>
> Very interesting that you are seeing retransmissions.
>
>>
>> Server: nas4 (where rbd images are mapped)
>> Client: ceph2 (currently not in the cluster, but configured
>> identically to the other nodes)
>>
>> Start server on nas4:
>> iperf3 -s
>>
>> On ceph2, connect to server nas4, send 4096MB of data, report on 1
>> second intervals. Add -R to reverse the client/server roles.
>> iperf3 -c nas4 -n 4096M -i 1
>>
>> Summary of traffic going out the 1Gb interface to a switch
>>
>> [ ID] Interval           Transfer     Bandwidth       Retr
>> [  5]   0.00-36.53  sec  4.00 GBytes   941 Mbits/sec   15            
>> sender
>> [  5]   0.00-36.53  sec  4.00 GBytes   940 Mbits/sec
>> receiver
>>
>> Reversed, summary of traffic going over the fabric extender
>>
>> [ ID] Interval           Transfer     Bandwidth       Retr
>> [  5]   0.00-80.84  sec  4.00 GBytes   425 Mbits/sec  30756
>> sender
>> [  5]   0.00-80.84  sec  4.00 GBytes   425 Mbits/sec
>> receiver
>
> Definitely looks suspect!
>
>>
>>
>> It appears that the issue is related to the network topology employed.
>> The private cluster network and nas4's public interface are both
>> connected to a 10Gb Cisco Fabric Extender (FEX), in turn connected to a
>> Nexus 7000. This was meant as a temporary solution until our network
>> team could finalize their design and bring up the Nexus 6001 for the
>> cluster. From what our network guys have said, the FEX has been much
>> more limited than they anticipated and they haven't been pleased with it
>> as a solution in general. The 6001 is supposed be ready this week, so
>> once it's online I'll move the cluster to that switch and re-test to see
>> if this fixes the issues I've been experiencing.
>
> If it's not the hardware, one other thing you might want to test is to
> make sure it's not something similar to the autotuning issues we used
> to see.  I don't think this should be an issue at this point given the
> code changes we made to address it, but it would be easy to test. 
> Doesn't seem like it should be happening with simple iperf tests
> though so the hardware is maybe the better theory.
>
> http://www.spinics.net/lists/ceph-devel/msg05049.html
>
>>
>> -Steve
>>
>> On 07/24/2014 05:59 PM, Steve Anthony wrote:
>>> Thanks for the information!
>>>
>>> Based on my reading of http://ceph.com/docs/next/rbd/rbd-config-ref I
>>> was under the impression that rbd cache options wouldn't apply, since
>>> presumably the kernel is handling the caching. I'll have to toggle some
>>> of those values and see it they make a difference in my setup.
>>>
>>> I did some additional testing today. If I limit the write benchmark
>>> to 1
>>> concurrent operation I see a lower bandwidth number, as I expected.
>>> However, when writing to the XFS filesystem on an rbd image I see
>>> transfer rates closer to to 400MB/s.
>>>
>>> # rados -p bench bench 300 write --no-cleanup -t 1
>>>
>>> Total time run:         300.105945
>>> Total writes made:      1992
>>> Write size:             4194304
>>> Bandwidth (MB/sec):     26.551
>>>
>>> Stddev Bandwidth:       5.69114
>>> Max bandwidth (MB/sec): 40
>>> Min bandwidth (MB/sec): 0
>>> Average Latency:        0.15065
>>> Stddev Latency:         0.0732024
>>> Max latency:            0.617945
>>> Min latency:            0.097339
>>>
>>> # time cp -a /mnt/local/climate /mnt/ceph_test1
>>>
>>> real    2m11.083s
>>> user    0m0.440s
>>> sys    1m11.632s
>>>
>>> # du -h --max-deph=1 /mnt/local
>>> 53G    /mnt/local/climate
>>>
>>> This seems to imply that the there is more than one concurrent
>>> operation
>>> when writing into the filesystem on top of the rbd image. However,
>>> given
>>> that the filesystem read speeds and the rados benchmark read speeds are
>>> much closer in reported bandwidth, it's as if reads are occurring as a
>>> single operation.
>>>
>>> # time cp -a /mnt/ceph_test2/isos /mnt/local/
>>>
>>> real    36m2.129s
>>> user    0m1.572s
>>> sys    3m23.404s
>>>
>>> # du -h --max-deph=1 /mnt/ceph_test2/
>>> 68G    /mnt/ceph_test2/isos
>>>
>>> Is this apparent single-thread read and multi-thread write with the rbd
>>> kernel module the expected mode of operation? If so, could someone
>>> explain the reason for this limitation?
>>>
>>> Based on the information on data striping in
>>> http://ceph.com/docs/next/architecture/#data-striping I would assume
>>> that a format 1 image would stripe a file larger than the 4MB object
>>> size over multiple objects and that those objects would be distributed
>>> over multiple OSDs. This would seem to indicate that reading a file
>>> back
>>> would be much faster since even though Ceph is only reading the primary
>>> replica, the read is still distributed over multiple OSDs. At worst I
>>> would expect something near the read bandwidth of a single OSD, which
>>> would still be much higher than 30-40MB/s.
>>>
>>> -Steve
>>>
>>> On 07/24/2014 04:07 PM, Udo Lembke wrote:
>>>
>>>> Hi Steve,
>>>> I'm also looking for improvements of single-thread-reads.
>>>>
>>>> A little bit higher values (twice?) should be possible with your
>>>> config.
>>>> I have 5 nodes with 60 4-TB hdds and got following:
>>>> rados -p test bench -b 4194304 60 seq -t 1 --no-cleanup
>>>> Total time run:        60.066934
>>>> Total reads made:     863
>>>> Read size:            4194304
>>>> Bandwidth (MB/sec):    57.469
>>>> Average Latency:       0.0695964
>>>> Max latency:           0.434677
>>>> Min latency:           0.016444
>>>>
>>>> In my case I had some osds (xfs) with an high fragmentation (20%).
>>>> Changing the mount options and defragmentation help slightly.
>>>> Performance changes:
>>>> [client]
>>>> rbd cache = true
>>>> rbd cache writethrough until flush = true
>>>>
>>>> [osd]
>>>>
>>>> osd mount options xfs =
>>>> "rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M"
>>>>
>>>> osd_op_threads =
>>>> 4
>>>>
>>>> osd_disk_threads = 4
>>>>
>>>>
>>>> But I expect much more speed for an single thread...
>>>>
>>>> Udo
>>>>
>>>> On 23.07.2014 22:13, Steve Anthony wrote:
>>>>
>>>>> Ah, ok. That makes sense. With one concurrent operation I see numbers
>>>>> more in line with the read speeds I'm seeing from the filesystems
>>>>> on the
>>>>> rbd images.
>>>>>
>>>>> # rados -p bench bench 300 seq --no-cleanup -t 1
>>>>> Total time run:        300.114589
>>>>> Total reads made:     2795
>>>>> Read size:            4194304
>>>>> Bandwidth (MB/sec):    37.252
>>>>>
>>>>> Average Latency:       0.10737
>>>>> Max latency:           0.968115
>>>>> Min latency:           0.039754
>>>>>
>>>>> # rados -p bench bench 300 rand --no-cleanup -t 1
>>>>> Total time run:        300.164208
>>>>> Total reads made:     2996
>>>>> Read size:            4194304
>>>>> Bandwidth (MB/sec):    39.925
>>>>>
>>>>> Average Latency:       0.100183
>>>>> Max latency:           1.04772
>>>>> Min latency:           0.039584
>>>>>
>>>>> I really wish I could find my data on read speeds from a couple weeks
>>>>> ago. It's possible that they've always been in this range, but I
>>>>> remember one of my test users saturating his 1GbE link over NFS
>>>>> reading
>>>>> copying from the rbd client to his workstation. Of course, it's also
>>>>> possible that the data set he was using was cached in RAM when he was
>>>>> testing, masking the lower rbd speeds.
>>>>>
>>>>> It just seems counterintuitive to me that read speeds would be so
>>>>> much
>>>>> slower that writes at the filesystem layer in practice. With
>>>>> images in
>>>>> the 10-100TB range, reading data at 20-60MB/s isn't going to be
>>>>> pleasant. Can you suggest any tunables or other approaches to
>>>>> investigate to improve these speeds, or are they in line with what
>>>>> you'd
>>>>> expect? Thanks for your help!
>>>>>
>>>>> -Steve
>>>>>
>>>>>
>>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users at lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Steve Anthony
LTS HPC Support Specialist
Lehigh University
sma310 at lehigh.edu