Re: Poor RBD performance as LIO iSCSI target

Nick Fisk <nick@xxxxxxxxxx> · Sat, 6 Dec 2014 16:18:53 -0000

Hi David,

Very strange, but  I'm glad you managed to finally get the cluster working
normally. Thank you for posting the benchmarks figures, it's interesting to
see the overhead of LIO over pure RBD performance. 

I should have the hardware for our cluster up and running early next year, I
will be in a better position to test the iSCSI performance then. I will
report back once I have some numbers.

Just out of interest, have you tried any of the other iSCSI implementations
to see if they show the same performance drop?

Nick

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
David Moreau Simard
Sent: 05 December 2014 16:03
To: Nick Fisk
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  Poor RBD performance as LIO iSCSI target

I've flushed everything - data, pools, configs and reconfigured the whole
thing.

I was particularly careful with cache tiering configurations (almost leaving
defaults when possible) and it's not locking anymore.
It looks like the cache tiering configuration I had was causing the problem
? I can't put my finger on exactly what/why and I don't have the luxury of
time to do this lengthy testing again.

Here's what I dumped as far as config goes before wiping:
========
# for var in size min_size pg_num pgp_num crush_ruleset
erasure_code_profile; do ceph osd pool get volumes $var; done
size: 5
min_size: 2
pg_num: 7200
pgp_num: 7200
crush_ruleset: 1
erasure_code_profile: ecvolumes

# for var in size min_size pg_num pgp_num crush_ruleset hit_set_type
hit_set_period hit_set_count target_max_objects target_max_bytes
cache_target_dirty_ratio cache_target_full_ratio cache_min_flush_age
cache_min_evict_age; do ceph osd pool get volumecache $var; done
size: 2
min_size: 1
pg_num: 7200
pgp_num: 7200
crush_ruleset: 4
hit_set_type: bloom
hit_set_period: 3600
hit_set_count: 1
target_max_objects: 0
target_max_bytes: 100000000000
cache_target_dirty_ratio: 0.5
cache_target_full_ratio: 0.8
cache_min_flush_age: 600
cache_min_evict_age: 1800

# ceph osd erasure-code-profile get ecvolumes
directory=/usr/lib/ceph/erasure-code
k=3
m=2
plugin=jerasure
ruleset-failure-domain=osd
technique=reed_sol_van
========

And now:
========
# for var in size min_size pg_num pgp_num crush_ruleset
erasure_code_profile; do ceph osd pool get volumes $var; done
size: 5
min_size: 3
pg_num: 2048
pgp_num: 2048
crush_ruleset: 1
erasure_code_profile: ecvolumes

# for var in size min_size pg_num pgp_num crush_ruleset hit_set_type
hit_set_period hit_set_count target_max_objects target_max_bytes
cache_target_dirty_ratio cache_target_full_ratio cache_min_flush_age
cache_min_evict_age; do ceph osd pool get volumecache $var; done
size: 2
min_size: 1
pg_num: 2048
pgp_num: 2048
crush_ruleset: 4
hit_set_type: bloom
hit_set_period: 3600
hit_set_count: 1
target_max_objects: 0
target_max_bytes: 150000000000
cache_target_dirty_ratio: 0.5
cache_target_full_ratio: 0.8
cache_min_flush_age: 0
cache_min_evict_age: 1800

# ceph osd erasure-code-profile get ecvolumes
directory=/usr/lib/ceph/erasure-code
k=3
m=2
plugin=jerasure
ruleset-failure-domain=osd
technique=reed_sol_van
========

Crush map hasn't really changed before and after.

FWIW, the benchmarks I pulled out of the setup:
https://gist.github.com/dmsimard/2737832d077cfc5eff34
Definite overhead going from krbd to krbd + LIO...
--
David Moreau Simard

> On Nov 20, 2014, at 4:14 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> 
> Here you go:-
> 
> Erasure Profile
> k=2
> m=1
> plugin=jerasure
> ruleset-failure-domain=osd
> ruleset-root=hdd
> technique=reed_sol_van
> 
> Cache Settings
> hit_set_type: bloom
> hit_set_period: 3600
> hit_set_count: 1
> target_max_objects
> target_max_objects: 0
> target_max_bytes: 1000000000
> cache_target_dirty_ratio: 0.4
> cache_target_full_ratio: 0.8
> cache_min_flush_age: 0
> cache_min_evict_age: 0
> 
> Crush Dump
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> 
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> 
> # types
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> type 4 row
> type 5 pdu
> type 6 pod
> type 7 room
> type 8 datacenter
> type 9 region
> type 10 root
> 
> # buckets
> host ceph-test-hdd {
>        id -5           # do not change unnecessarily
>        # weight 2.730
>        alg straw
>        hash 0  # rjenkins1
>        item osd.1 weight 0.910
>        item osd.2 weight 0.910
>        item osd.0 weight 0.910
> }
> root hdd {
>        id -3           # do not change unnecessarily
>        # weight 2.730
>        alg straw
>        hash 0  # rjenkins1
>        item ceph-test-hdd weight 2.730 } host ceph-test-ssd {
>        id -6           # do not change unnecessarily
>        # weight 1.000
>        alg straw
>        hash 0  # rjenkins1
>        item osd.3 weight 1.000
> }
> root ssd {
>        id -4           # do not change unnecessarily
>        # weight 1.000
>        alg straw
>        hash 0  # rjenkins1
>        item ceph-test-ssd weight 1.000 }
> 
> # rules
> rule hdd {
>        ruleset 0
>        type replicated
>        min_size 0
>        max_size 10
>        step take hdd
>        step chooseleaf firstn 0 type osd
>        step emit
> }
> rule ssd {
>        ruleset 1
>        type replicated
>        min_size 0
>        max_size 4
>        step take ssd
>        step chooseleaf firstn 0 type osd
>        step emit
> }
> rule ecpool {
>        ruleset 2
>        type erasure
>        min_size 3
>        max_size 20
>        step set_chooseleaf_tries 5
>        step take hdd
>        step chooseleaf indep 0 type osd
>        step emit
> }
> 
> 
> 
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf 
> Of David Moreau Simard
> Sent: 20 November 2014 20:03
> To: Nick Fisk
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Poor RBD performance as LIO iSCSI target
> 
> Nick,
> 
> Can you share more datails on the configuration you are using ? I'll 
> try and duplicate those configurations in my environment and see what
happens.
> I'm mostly interested in:
> - Erasure code profile (k, m, plugin, ruleset-failure-domain)
> - Cache tiering pool configuration (ex: hit_set_type, hit_set_period, 
> hit_set_count, target_max_objects, target_max_bytes, 
> cache_target_dirty_ratio, cache_target_full_ratio, 
> cache_min_flush_age,
> cache_min_evict_age)
> 
> The crush rulesets would also be helpful.
> 
> Thanks,
> --
> David Moreau Simard
> 
>> On Nov 20, 2014, at 12:43 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> 
>> Hi David,
>> 
>> I've just finished running the 75GB fio test you posted a few days 
>> back on my new test cluster.
>> 
>> The cluster is as follows:-
>> 
>> Single server with 3x hdd and 1 ssd
>> Ubuntu 14.04 with 3.16.7 kernel
>> 2+1 EC pool on hdds below a 10G ssd cache pool. SSD is also 
>> 2+partitioned to
>> provide journals for hdds.
>> 150G RBD mapped locally
>> 
>> The fio test seemed to run without any problems. I want to run a few 
>> more tests with different settings to see if I can reproduce your 
>> problem. I will let you know if I find anything.
>> 
>> If there is anything you would like me to try, please let me know.
>> 
>> Nick
>> 
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf 
>> Of David Moreau Simard
>> Sent: 19 November 2014 10:48
>> To: Ramakrishna Nishtala (rnishtal)
>> Cc: ceph-users@xxxxxxxxxxxxxx; Nick Fisk
>> Subject: Re:  Poor RBD performance as LIO iSCSI target
>> 
>> Rama,
>> 
>> Thanks for your reply.
>> 
>> My end goal is to use iSCSI (with LIO/targetcli) to export rbd block 
>> devices.
>> 
>> I was encountering issues with iSCSI which are explained in my 
>> previous emails.
>> I ended up being able to reproduce the problem at will on various 
>> Kernel and OS combinations, even on raw RBD devices - thus ruling out 
>> the hypothesis that it was a problem with iSCSI but rather with Ceph.
>> I'm even running 0.88 now and the issue is still there.
>> 
>> I haven't isolated the issue just yet.
>> My next tests involve disabling the cache tiering.
>> 
>> I do have client krbd cache as well, i'll try to disable it too if 
>> cache tiering isn't enough.
>> --
>> David Moreau Simard
>> 
>> 
>>> On Nov 18, 2014, at 8:10 PM, Ramakrishna Nishtala (rnishtal)
>> <rnishtal@xxxxxxxxx> wrote:
>>> 
>>> Hi Dave
>>> Did you say iscsi only? The tracker issue does not say though.
>>> I am on giant, with both client and ceph on RHEL 7 and seems to work 
>>> ok,
>> unless I am missing something here. RBD on baremetal with kmod-rbd 
>> and caching disabled.
>>> 
>>> [root@compute4 ~]# time fio --name=writefile --size=100G 
>>> --filesize=100G --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1
>>> --sync=0 --randrepeat=0 --rw=write --refill_buffers --end_fsync=1
>>> --iodepth=200 --ioengine=libaio
>>> writefile: (g=0): rw=write, bs=1M-1M/1M-1M/1M-1M, ioengine=libaio,
>>> iodepth=200
>>> fio-2.1.11
>>> Starting 1 process
>>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/853.0MB/0KB /s] [0/853/0 
>>> iops] [eta 00m:00s] ...
>>> Disk stats (read/write):
>>> rbd0: ios=184/204800, merge=0/0, ticks=70/16164931, 
>>> in_queue=16164942, util=99.98%
>>> 
>>> real    1m56.175s
>>> user    0m18.115s
>>> sys     0m10.430s
>>> 
>>> Regards,
>>> 
>>> Rama
>>> 
>>> 
>>> -----Original Message-----
>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On 
>>> Behalf Of David Moreau Simard
>>> Sent: Tuesday, November 18, 2014 3:49 PM
>>> To: Nick Fisk
>>> Cc: ceph-users@xxxxxxxxxxxxxx
>>> Subject: Re:  Poor RBD performance as LIO iSCSI target
>>> 
>>> Testing without the cache tiering is the next test I want to do when 
>>> I
>> have time..
>>> 
>>> When it's hanging, there is no activity at all on the cluster.
>>> Nothing in "ceph -w", nothing in "ceph osd pool stats".
>>> 
>>> I'll provide an update when I have a chance to test without tiering.
>>> --
>>> David Moreau Simard
>>> 
>>> 
>>>> On Nov 18, 2014, at 3:28 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>>>> 
>>>> Hi David,
>>>> 
>>>> Have you tried on a normal replicated pool with no cache? I've seen 
>>>> a number of threads recently where caching is causing various 
>>>> things to
>> block/hang.
>>>> It would be interesting to see if this still happens without the 
>>>> caching layer, at least it would rule it out.
>>>> 
>>>> Also is there any sign that as the test passes ~50GB that the cache 
>>>> might start flushing to the backing pool causing slow performance?
>>>> 
>>>> I am planning a deployment very similar to yours so I am following 
>>>> this with great interest. I'm hoping to build a single node test 
>>>> "cluster" shortly, so I might be in a position to work with you on 
>>>> this issue and hopefully get it resolved.
>>>> 
>>>> Nick
>>>> 
>>>> -----Original Message-----
>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On 
>>>> Behalf Of David Moreau Simard
>>>> Sent: 18 November 2014 19:58
>>>> To: Mike Christie
>>>> Cc: ceph-users@xxxxxxxxxxxxxx; Christopher Spearman
>>>> Subject: Re:  Poor RBD performance as LIO iSCSI target
>>>> 
>>>> Thanks guys. I looked at http://tracker.ceph.com/issues/8818 and 
>>>> chatted with "dis" on #ceph-devel.
>>>> 
>>>> I ran a LOT of tests on a LOT of comabination of kernels (sometimes 
>>>> with tunables legacy). I haven't found a magical combination in 
>>>> which the following test does not hang:
>>>> fio --name=writefile --size=100G --filesize=100G
>>>> --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 --sync=0
>>>> --randrepeat=0 --rw=write --refill_buffers --end_fsync=1
>>>> --iodepth=200 --ioengine=libaio
>>>> 
>>>> Either directly on a mapped rbd device, on a mounted filesystem 
>>>> (over rbd), exported through iSCSI.. nothing.
>>>> I guess that rules out a potential issue with iSCSI overhead.
>>>> 
>>>> Now, something I noticed out of pure luck is that I am unable to 
>>>> reproduce the issue if I drop the size of the test to 50GB. Tests 
>>>> will complete in under 2 minutes.
>>>> 75GB will hang right at the end and take more than 10 minutes.
>>>> 
>>>> TL;DR of tests:
>>>> - 3x fio --name=writefile --size=50G --filesize=50G
>>>> --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 --sync=0
>>>> --randrepeat=0 --rw=write --refill_buffers --end_fsync=1
>>>> --iodepth=200 --ioengine=libaio
>>>> -- 1m44s, 1m49s, 1m40s
>>>> 
>>>> - 3x fio --name=writefile --size=75G --filesize=75G
>>>> --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 --sync=0
>>>> --randrepeat=0 --rw=write --refill_buffers --end_fsync=1
>>>> --iodepth=200 --ioengine=libaio
>>>> -- 10m12s, 10m11s, 10m13s
>>>> 
>>>> Details of tests here: http://pastebin.com/raw.php?i=3v9wMtYP
>>>> 
>>>> Does that ring you guys a bell ?
>>>> 
>>>> --
>>>> David Moreau Simard
>>>> 
>>>> 
>>>>> On Nov 13, 2014, at 3:31 PM, Mike Christie <mchristi@xxxxxxxxxx>
wrote:
>>>>> 
>>>>> On 11/13/2014 10:17 AM, David Moreau Simard wrote:
>>>>>> Running into weird issues here as well in a test environment. I 
>>>>>> don't
>>>> have a solution either but perhaps we can find some things in common..
>>>>>> 
>>>>>> Setup in a nutshell:
>>>>>> - Ceph cluster: Ubuntu 14.04, Kernel 3.16.7, Ceph 0.87-1 (OSDs 
>>>>>> with separate public/cluster network in 10 Gbps)
>>>>>> - iSCSI Proxy node (targetcli/LIO): Ubuntu 14.04, Kernel 3.16.7, 
>>>>>> Ceph
>>>>>> 0.87-1 (10 Gbps)
>>>>>> - Client node: Ubuntu 12.04, Kernel 3.11 (10 Gbps)
>>>>>> 
>>>>>> Relevant cluster config: Writeback cache tiering with NVME PCI-E 
>>>>>> cards (2
>>>> replica) in front of a erasure coded pool (k=3,m=2) backed by spindles.
>>>>>> 
>>>>>> I'm following the instructions here: 
>>>>>> http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd
>>>>>> - im a ges-san-storage-devices No issues with creating and 
>>>>>> mapping a 100GB RBD image and then creating the target.
>>>>>> 
>>>>>> I'm interested in finding out the overhead/performance impact of
>>>> re-exporting through iSCSI so the idea is to run benchmarks.
>>>>>> Here's a fio test I'm trying to run on the client node on the 
>>>>>> mounted
>>>> iscsi device:
>>>>>> fio --name=writefile --size=100G --filesize=100G 
>>>>>> --filename=/dev/sdu --bs=1M --nrfiles=1 --direct=1 --sync=0
>>>>>> --randrepeat=0 --rw=write --refill_buffers --end_fsync=1
>>>>>> --iodepth=200 --ioengine=libaio
>>>>>> 
>>>>>> The benchmark will eventually hang towards the end of the test 
>>>>>> for some
>>>> long seconds before completing.
>>>>>> On the proxy node, the kernel complains with iscsi portal login
>>>>>> timeout: http://pastebin.com/Q49UnTPr and I also see irqbalance 
>>>>>> errors in syslog: http://pastebin.com/AiRTWDwR
>>>>>> 
>>>>> 
>>>>> You are hitting a different issue. German Anders is most likely 
>>>>> correct and you hit the rbd hang. That then caused the iscsi/scsi 
>>>>> command to timeout which caused the scsi error handler to run. In 
>>>>> your logs we see the LIO error handler has received a task abort 
>>>>> from the initiator and that timed out which caused the escalation 
>>>>> (iscsi portal login related messages).
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> 
>> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com