Re: Poor RBD performance as LIO iSCSI target

David Moreau Simard <dmsimard@xxxxxxxx> · Fri, 5 Dec 2014 16:03:13 +0000

I've flushed everything - data, pools, configs and reconfigured the whole thing.

I was particularly careful with cache tiering configurations (almost leaving defaults when possible) and it's not locking anymore.
It looks like the cache tiering configuration I had was causing the problem ? I can't put my finger on exactly what/why and I don't have the luxury of time to do this lengthy testing again.

Here's what I dumped as far as config goes before wiping:
========
# for var in size min_size pg_num pgp_num crush_ruleset erasure_code_profile; do ceph osd pool get volumes $var; done
size: 5
min_size: 2
pg_num: 7200
pgp_num: 7200
crush_ruleset: 1
erasure_code_profile: ecvolumes

# for var in size min_size pg_num pgp_num crush_ruleset hit_set_type hit_set_period hit_set_count target_max_objects target_max_bytes cache_target_dirty_ratio cache_target_full_ratio cache_min_flush_age cache_min_evict_age; do ceph osd pool get volumecache $var; done
size: 2
min_size: 1
pg_num: 7200
pgp_num: 7200
crush_ruleset: 4
hit_set_type: bloom
hit_set_period: 3600
hit_set_count: 1
target_max_objects: 0
target_max_bytes: 100000000000
cache_target_dirty_ratio: 0.5
cache_target_full_ratio: 0.8
cache_min_flush_age: 600
cache_min_evict_age: 1800

# ceph osd erasure-code-profile get ecvolumes
directory=/usr/lib/ceph/erasure-code
k=3
m=2
plugin=jerasure
ruleset-failure-domain=osd
technique=reed_sol_van
========

And now:
========
# for var in size min_size pg_num pgp_num crush_ruleset erasure_code_profile; do ceph osd pool get volumes $var; done
size: 5
min_size: 3
pg_num: 2048
pgp_num: 2048
crush_ruleset: 1
erasure_code_profile: ecvolumes

# for var in size min_size pg_num pgp_num crush_ruleset hit_set_type hit_set_period hit_set_count target_max_objects target_max_bytes cache_target_dirty_ratio cache_target_full_ratio cache_min_flush_age cache_min_evict_age; do ceph osd pool get volumecache $var; done
size: 2
min_size: 1
pg_num: 2048
pgp_num: 2048
crush_ruleset: 4
hit_set_type: bloom
hit_set_period: 3600
hit_set_count: 1
target_max_objects: 0
target_max_bytes: 150000000000
cache_target_dirty_ratio: 0.5
cache_target_full_ratio: 0.8
cache_min_flush_age: 0
cache_min_evict_age: 1800

# ceph osd erasure-code-profile get ecvolumes
directory=/usr/lib/ceph/erasure-code
k=3
m=2
plugin=jerasure
ruleset-failure-domain=osd
technique=reed_sol_van
========

Crush map hasn't really changed before and after.

FWIW, the benchmarks I pulled out of the setup: https://gist.github.com/dmsimard/2737832d077cfc5eff34
Definite overhead going from krbd to krbd + LIO...
--
David Moreau Simard

> On Nov 20, 2014, at 4:14 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> 
> Here you go:-
> 
> Erasure Profile
> k=2
> m=1
> plugin=jerasure
> ruleset-failure-domain=osd
> ruleset-root=hdd
> technique=reed_sol_van
> 
> Cache Settings
> hit_set_type: bloom
> hit_set_period: 3600
> hit_set_count: 1
> target_max_objects
> target_max_objects: 0
> target_max_bytes: 1000000000
> cache_target_dirty_ratio: 0.4
> cache_target_full_ratio: 0.8
> cache_min_flush_age: 0
> cache_min_evict_age: 0
> 
> Crush Dump
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> 
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> 
> # types
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> type 4 row
> type 5 pdu
> type 6 pod
> type 7 room
> type 8 datacenter
> type 9 region
> type 10 root
> 
> # buckets
> host ceph-test-hdd {
>        id -5           # do not change unnecessarily
>        # weight 2.730
>        alg straw
>        hash 0  # rjenkins1
>        item osd.1 weight 0.910
>        item osd.2 weight 0.910
>        item osd.0 weight 0.910
> }
> root hdd {
>        id -3           # do not change unnecessarily
>        # weight 2.730
>        alg straw
>        hash 0  # rjenkins1
>        item ceph-test-hdd weight 2.730
> }
> host ceph-test-ssd {
>        id -6           # do not change unnecessarily
>        # weight 1.000
>        alg straw
>        hash 0  # rjenkins1
>        item osd.3 weight 1.000
> }
> root ssd {
>        id -4           # do not change unnecessarily
>        # weight 1.000
>        alg straw
>        hash 0  # rjenkins1
>        item ceph-test-ssd weight 1.000
> }
> 
> # rules
> rule hdd {
>        ruleset 0
>        type replicated
>        min_size 0
>        max_size 10
>        step take hdd
>        step chooseleaf firstn 0 type osd
>        step emit
> }
> rule ssd {
>        ruleset 1
>        type replicated
>        min_size 0
>        max_size 4
>        step take ssd
>        step chooseleaf firstn 0 type osd
>        step emit
> }
> rule ecpool {
>        ruleset 2
>        type erasure
>        min_size 3
>        max_size 20
>        step set_chooseleaf_tries 5
>        step take hdd
>        step chooseleaf indep 0 type osd
>        step emit
> }
> 
> 
> 
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> David Moreau Simard
> Sent: 20 November 2014 20:03
> To: Nick Fisk
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Poor RBD performance as LIO iSCSI target
> 
> Nick,
> 
> Can you share more datails on the configuration you are using ? I'll try and
> duplicate those configurations in my environment and see what happens.
> I'm mostly interested in:
> - Erasure code profile (k, m, plugin, ruleset-failure-domain)
> - Cache tiering pool configuration (ex: hit_set_type, hit_set_period,
> hit_set_count, target_max_objects, target_max_bytes,
> cache_target_dirty_ratio, cache_target_full_ratio, cache_min_flush_age,
> cache_min_evict_age)
> 
> The crush rulesets would also be helpful.
> 
> Thanks,
> --
> David Moreau Simard
> 
>> On Nov 20, 2014, at 12:43 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> 
>> Hi David,
>> 
>> I've just finished running the 75GB fio test you posted a few days 
>> back on my new test cluster.
>> 
>> The cluster is as follows:-
>> 
>> Single server with 3x hdd and 1 ssd
>> Ubuntu 14.04 with 3.16.7 kernel
>> 2+1 EC pool on hdds below a 10G ssd cache pool. SSD is also 
>> 2+partitioned to
>> provide journals for hdds.
>> 150G RBD mapped locally
>> 
>> The fio test seemed to run without any problems. I want to run a few 
>> more tests with different settings to see if I can reproduce your 
>> problem. I will let you know if I find anything.
>> 
>> If there is anything you would like me to try, please let me know.
>> 
>> Nick
>> 
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf 
>> Of David Moreau Simard
>> Sent: 19 November 2014 10:48
>> To: Ramakrishna Nishtala (rnishtal)
>> Cc: ceph-users@xxxxxxxxxxxxxx; Nick Fisk
>> Subject: Re:  Poor RBD performance as LIO iSCSI target
>> 
>> Rama,
>> 
>> Thanks for your reply.
>> 
>> My end goal is to use iSCSI (with LIO/targetcli) to export rbd block 
>> devices.
>> 
>> I was encountering issues with iSCSI which are explained in my 
>> previous emails.
>> I ended up being able to reproduce the problem at will on various 
>> Kernel and OS combinations, even on raw RBD devices - thus ruling out 
>> the hypothesis that it was a problem with iSCSI but rather with Ceph.
>> I'm even running 0.88 now and the issue is still there.
>> 
>> I haven't isolated the issue just yet.
>> My next tests involve disabling the cache tiering.
>> 
>> I do have client krbd cache as well, i'll try to disable it too if 
>> cache tiering isn't enough.
>> --
>> David Moreau Simard
>> 
>> 
>>> On Nov 18, 2014, at 8:10 PM, Ramakrishna Nishtala (rnishtal)
>> <rnishtal@xxxxxxxxx> wrote:
>>> 
>>> Hi Dave
>>> Did you say iscsi only? The tracker issue does not say though.
>>> I am on giant, with both client and ceph on RHEL 7 and seems to work 
>>> ok,
>> unless I am missing something here. RBD on baremetal with kmod-rbd and 
>> caching disabled.
>>> 
>>> [root@compute4 ~]# time fio --name=writefile --size=100G 
>>> --filesize=100G --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1
>>> --sync=0 --randrepeat=0 --rw=write --refill_buffers --end_fsync=1
>>> --iodepth=200 --ioengine=libaio
>>> writefile: (g=0): rw=write, bs=1M-1M/1M-1M/1M-1M, ioengine=libaio,
>>> iodepth=200
>>> fio-2.1.11
>>> Starting 1 process
>>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/853.0MB/0KB /s] [0/853/0 
>>> iops] [eta 00m:00s] ...
>>> Disk stats (read/write):
>>> rbd0: ios=184/204800, merge=0/0, ticks=70/16164931, 
>>> in_queue=16164942, util=99.98%
>>> 
>>> real    1m56.175s
>>> user    0m18.115s
>>> sys     0m10.430s
>>> 
>>> Regards,
>>> 
>>> Rama
>>> 
>>> 
>>> -----Original Message-----
>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf 
>>> Of David Moreau Simard
>>> Sent: Tuesday, November 18, 2014 3:49 PM
>>> To: Nick Fisk
>>> Cc: ceph-users@xxxxxxxxxxxxxx
>>> Subject: Re:  Poor RBD performance as LIO iSCSI target
>>> 
>>> Testing without the cache tiering is the next test I want to do when 
>>> I
>> have time..
>>> 
>>> When it's hanging, there is no activity at all on the cluster.
>>> Nothing in "ceph -w", nothing in "ceph osd pool stats".
>>> 
>>> I'll provide an update when I have a chance to test without tiering.
>>> --
>>> David Moreau Simard
>>> 
>>> 
>>>> On Nov 18, 2014, at 3:28 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>>>> 
>>>> Hi David,
>>>> 
>>>> Have you tried on a normal replicated pool with no cache? I've seen 
>>>> a number of threads recently where caching is causing various things 
>>>> to
>> block/hang.
>>>> It would be interesting to see if this still happens without the 
>>>> caching layer, at least it would rule it out.
>>>> 
>>>> Also is there any sign that as the test passes ~50GB that the cache 
>>>> might start flushing to the backing pool causing slow performance?
>>>> 
>>>> I am planning a deployment very similar to yours so I am following 
>>>> this with great interest. I'm hoping to build a single node test 
>>>> "cluster" shortly, so I might be in a position to work with you on 
>>>> this issue and hopefully get it resolved.
>>>> 
>>>> Nick
>>>> 
>>>> -----Original Message-----
>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On 
>>>> Behalf Of David Moreau Simard
>>>> Sent: 18 November 2014 19:58
>>>> To: Mike Christie
>>>> Cc: ceph-users@xxxxxxxxxxxxxx; Christopher Spearman
>>>> Subject: Re:  Poor RBD performance as LIO iSCSI target
>>>> 
>>>> Thanks guys. I looked at http://tracker.ceph.com/issues/8818 and 
>>>> chatted with "dis" on #ceph-devel.
>>>> 
>>>> I ran a LOT of tests on a LOT of comabination of kernels (sometimes 
>>>> with tunables legacy). I haven't found a magical combination in 
>>>> which the following test does not hang:
>>>> fio --name=writefile --size=100G --filesize=100G
>>>> --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 --sync=0
>>>> --randrepeat=0 --rw=write --refill_buffers --end_fsync=1
>>>> --iodepth=200 --ioengine=libaio
>>>> 
>>>> Either directly on a mapped rbd device, on a mounted filesystem 
>>>> (over rbd), exported through iSCSI.. nothing.
>>>> I guess that rules out a potential issue with iSCSI overhead.
>>>> 
>>>> Now, something I noticed out of pure luck is that I am unable to 
>>>> reproduce the issue if I drop the size of the test to 50GB. Tests 
>>>> will complete in under 2 minutes.
>>>> 75GB will hang right at the end and take more than 10 minutes.
>>>> 
>>>> TL;DR of tests:
>>>> - 3x fio --name=writefile --size=50G --filesize=50G
>>>> --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 --sync=0
>>>> --randrepeat=0 --rw=write --refill_buffers --end_fsync=1
>>>> --iodepth=200 --ioengine=libaio
>>>> -- 1m44s, 1m49s, 1m40s
>>>> 
>>>> - 3x fio --name=writefile --size=75G --filesize=75G
>>>> --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 --sync=0
>>>> --randrepeat=0 --rw=write --refill_buffers --end_fsync=1
>>>> --iodepth=200 --ioengine=libaio
>>>> -- 10m12s, 10m11s, 10m13s
>>>> 
>>>> Details of tests here: http://pastebin.com/raw.php?i=3v9wMtYP
>>>> 
>>>> Does that ring you guys a bell ?
>>>> 
>>>> --
>>>> David Moreau Simard
>>>> 
>>>> 
>>>>> On Nov 13, 2014, at 3:31 PM, Mike Christie <mchristi@xxxxxxxxxx> wrote:
>>>>> 
>>>>> On 11/13/2014 10:17 AM, David Moreau Simard wrote:
>>>>>> Running into weird issues here as well in a test environment. I 
>>>>>> don't
>>>> have a solution either but perhaps we can find some things in common..
>>>>>> 
>>>>>> Setup in a nutshell:
>>>>>> - Ceph cluster: Ubuntu 14.04, Kernel 3.16.7, Ceph 0.87-1 (OSDs 
>>>>>> with separate public/cluster network in 10 Gbps)
>>>>>> - iSCSI Proxy node (targetcli/LIO): Ubuntu 14.04, Kernel 3.16.7, 
>>>>>> Ceph
>>>>>> 0.87-1 (10 Gbps)
>>>>>> - Client node: Ubuntu 12.04, Kernel 3.11 (10 Gbps)
>>>>>> 
>>>>>> Relevant cluster config: Writeback cache tiering with NVME PCI-E 
>>>>>> cards (2
>>>> replica) in front of a erasure coded pool (k=3,m=2) backed by spindles.
>>>>>> 
>>>>>> I'm following the instructions here: 
>>>>>> http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-
>>>>>> im a ges-san-storage-devices No issues with creating and mapping a 
>>>>>> 100GB RBD image and then creating the target.
>>>>>> 
>>>>>> I'm interested in finding out the overhead/performance impact of
>>>> re-exporting through iSCSI so the idea is to run benchmarks.
>>>>>> Here's a fio test I'm trying to run on the client node on the 
>>>>>> mounted
>>>> iscsi device:
>>>>>> fio --name=writefile --size=100G --filesize=100G 
>>>>>> --filename=/dev/sdu --bs=1M --nrfiles=1 --direct=1 --sync=0
>>>>>> --randrepeat=0 --rw=write --refill_buffers --end_fsync=1
>>>>>> --iodepth=200 --ioengine=libaio
>>>>>> 
>>>>>> The benchmark will eventually hang towards the end of the test for 
>>>>>> some
>>>> long seconds before completing.
>>>>>> On the proxy node, the kernel complains with iscsi portal login
>>>>>> timeout: http://pastebin.com/Q49UnTPr and I also see irqbalance 
>>>>>> errors in syslog: http://pastebin.com/AiRTWDwR
>>>>>> 
>>>>> 
>>>>> You are hitting a different issue. German Anders is most likely 
>>>>> correct and you hit the rbd hang. That then caused the iscsi/scsi 
>>>>> command to timeout which caused the scsi error handler to run. In 
>>>>> your logs we see the LIO error handler has received a task abort 
>>>>> from the initiator and that timed out which caused the escalation 
>>>>> (iscsi portal login related messages).
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> 
>> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com