Re: Poor RBD performance as LIO iSCSI target

Nick Fisk <nick@xxxxxxxxxx> · Thu, 20 Nov 2014 21:14:46 -0000

Here you go:-

Erasure Profile
k=2
m=1
plugin=jerasure
ruleset-failure-domain=osd
ruleset-root=hdd
technique=reed_sol_van

Cache Settings
hit_set_type: bloom
hit_set_period: 3600
hit_set_count: 1
target_max_objects
target_max_objects: 0
target_max_bytes: 1000000000
cache_target_dirty_ratio: 0.4
cache_target_full_ratio: 0.8
cache_min_flush_age: 0
cache_min_evict_age: 0

Crush Dump
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host ceph-test-hdd {
        id -5           # do not change unnecessarily
        # weight 2.730
        alg straw
        hash 0  # rjenkins1
        item osd.1 weight 0.910
        item osd.2 weight 0.910
        item osd.0 weight 0.910
}
root hdd {
        id -3           # do not change unnecessarily
        # weight 2.730
        alg straw
        hash 0  # rjenkins1
        item ceph-test-hdd weight 2.730
}
host ceph-test-ssd {
        id -6           # do not change unnecessarily
        # weight 1.000
        alg straw
        hash 0  # rjenkins1
        item osd.3 weight 1.000
}
root ssd {
        id -4           # do not change unnecessarily
        # weight 1.000
        alg straw
        hash 0  # rjenkins1
        item ceph-test-ssd weight 1.000
}

# rules
rule hdd {
        ruleset 0
        type replicated
        min_size 0
        max_size 10
        step take hdd
        step chooseleaf firstn 0 type osd
        step emit
}
rule ssd {
        ruleset 1
        type replicated
        min_size 0
        max_size 4
        step take ssd
        step chooseleaf firstn 0 type osd
        step emit
}
rule ecpool {
        ruleset 2
        type erasure
        min_size 3
        max_size 20
        step set_chooseleaf_tries 5
        step take hdd
        step chooseleaf indep 0 type osd
        step emit
}

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
David Moreau Simard
Sent: 20 November 2014 20:03
To: Nick Fisk
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  Poor RBD performance as LIO iSCSI target

Nick,

Can you share more datails on the configuration you are using ? I'll try and
duplicate those configurations in my environment and see what happens.
I'm mostly interested in:
- Erasure code profile (k, m, plugin, ruleset-failure-domain)
- Cache tiering pool configuration (ex: hit_set_type, hit_set_period,
hit_set_count, target_max_objects, target_max_bytes,
cache_target_dirty_ratio, cache_target_full_ratio, cache_min_flush_age,
cache_min_evict_age)

The crush rulesets would also be helpful.

Thanks,
--
David Moreau Simard

> On Nov 20, 2014, at 12:43 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> 
> Hi David,
> 
> I've just finished running the 75GB fio test you posted a few days 
> back on my new test cluster.
> 
> The cluster is as follows:-
> 
> Single server with 3x hdd and 1 ssd
> Ubuntu 14.04 with 3.16.7 kernel
> 2+1 EC pool on hdds below a 10G ssd cache pool. SSD is also 
> 2+partitioned to
> provide journals for hdds.
> 150G RBD mapped locally
> 
> The fio test seemed to run without any problems. I want to run a few 
> more tests with different settings to see if I can reproduce your 
> problem. I will let you know if I find anything.
> 
> If there is anything you would like me to try, please let me know.
> 
> Nick
> 
> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf 
> Of David Moreau Simard
> Sent: 19 November 2014 10:48
> To: Ramakrishna Nishtala (rnishtal)
> Cc: ceph-users@xxxxxxxxxxxxxx; Nick Fisk
> Subject: Re:  Poor RBD performance as LIO iSCSI target
> 
> Rama,
> 
> Thanks for your reply.
> 
> My end goal is to use iSCSI (with LIO/targetcli) to export rbd block 
> devices.
> 
> I was encountering issues with iSCSI which are explained in my 
> previous emails.
> I ended up being able to reproduce the problem at will on various 
> Kernel and OS combinations, even on raw RBD devices - thus ruling out 
> the hypothesis that it was a problem with iSCSI but rather with Ceph.
> I'm even running 0.88 now and the issue is still there.
> 
> I haven't isolated the issue just yet.
> My next tests involve disabling the cache tiering.
> 
> I do have client krbd cache as well, i'll try to disable it too if 
> cache tiering isn't enough.
> --
> David Moreau Simard
> 
> 
>> On Nov 18, 2014, at 8:10 PM, Ramakrishna Nishtala (rnishtal)
> <rnishtal@xxxxxxxxx> wrote:
>> 
>> Hi Dave
>> Did you say iscsi only? The tracker issue does not say though.
>> I am on giant, with both client and ceph on RHEL 7 and seems to work 
>> ok,
> unless I am missing something here. RBD on baremetal with kmod-rbd and 
> caching disabled.
>> 
>> [root@compute4 ~]# time fio --name=writefile --size=100G 
>> --filesize=100G --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1
>> --sync=0 --randrepeat=0 --rw=write --refill_buffers --end_fsync=1
>> --iodepth=200 --ioengine=libaio
>> writefile: (g=0): rw=write, bs=1M-1M/1M-1M/1M-1M, ioengine=libaio,
>> iodepth=200
>> fio-2.1.11
>> Starting 1 process
>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/853.0MB/0KB /s] [0/853/0 
>> iops] [eta 00m:00s] ...
>> Disk stats (read/write):
>>  rbd0: ios=184/204800, merge=0/0, ticks=70/16164931, 
>> in_queue=16164942, util=99.98%
>> 
>> real    1m56.175s
>> user    0m18.115s
>> sys     0m10.430s
>> 
>> Regards,
>> 
>> Rama
>> 
>> 
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf 
>> Of David Moreau Simard
>> Sent: Tuesday, November 18, 2014 3:49 PM
>> To: Nick Fisk
>> Cc: ceph-users@xxxxxxxxxxxxxx
>> Subject: Re:  Poor RBD performance as LIO iSCSI target
>> 
>> Testing without the cache tiering is the next test I want to do when 
>> I
> have time..
>> 
>> When it's hanging, there is no activity at all on the cluster.
>> Nothing in "ceph -w", nothing in "ceph osd pool stats".
>> 
>> I'll provide an update when I have a chance to test without tiering.
>> --
>> David Moreau Simard
>> 
>> 
>>> On Nov 18, 2014, at 3:28 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>>> 
>>> Hi David,
>>> 
>>> Have you tried on a normal replicated pool with no cache? I've seen 
>>> a number of threads recently where caching is causing various things 
>>> to
> block/hang.
>>> It would be interesting to see if this still happens without the 
>>> caching layer, at least it would rule it out.
>>> 
>>> Also is there any sign that as the test passes ~50GB that the cache 
>>> might start flushing to the backing pool causing slow performance?
>>> 
>>> I am planning a deployment very similar to yours so I am following 
>>> this with great interest. I'm hoping to build a single node test 
>>> "cluster" shortly, so I might be in a position to work with you on 
>>> this issue and hopefully get it resolved.
>>> 
>>> Nick
>>> 
>>> -----Original Message-----
>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On 
>>> Behalf Of David Moreau Simard
>>> Sent: 18 November 2014 19:58
>>> To: Mike Christie
>>> Cc: ceph-users@xxxxxxxxxxxxxx; Christopher Spearman
>>> Subject: Re:  Poor RBD performance as LIO iSCSI target
>>> 
>>> Thanks guys. I looked at http://tracker.ceph.com/issues/8818 and 
>>> chatted with "dis" on #ceph-devel.
>>> 
>>> I ran a LOT of tests on a LOT of comabination of kernels (sometimes 
>>> with tunables legacy). I haven't found a magical combination in 
>>> which the following test does not hang:
>>> fio --name=writefile --size=100G --filesize=100G
>>> --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 --sync=0
>>> --randrepeat=0 --rw=write --refill_buffers --end_fsync=1
>>> --iodepth=200 --ioengine=libaio
>>> 
>>> Either directly on a mapped rbd device, on a mounted filesystem 
>>> (over rbd), exported through iSCSI.. nothing.
>>> I guess that rules out a potential issue with iSCSI overhead.
>>> 
>>> Now, something I noticed out of pure luck is that I am unable to 
>>> reproduce the issue if I drop the size of the test to 50GB. Tests 
>>> will complete in under 2 minutes.
>>> 75GB will hang right at the end and take more than 10 minutes.
>>> 
>>> TL;DR of tests:
>>> - 3x fio --name=writefile --size=50G --filesize=50G
>>> --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 --sync=0
>>> --randrepeat=0 --rw=write --refill_buffers --end_fsync=1
>>> --iodepth=200 --ioengine=libaio
>>> -- 1m44s, 1m49s, 1m40s
>>> 
>>> - 3x fio --name=writefile --size=75G --filesize=75G
>>> --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 --sync=0
>>> --randrepeat=0 --rw=write --refill_buffers --end_fsync=1
>>> --iodepth=200 --ioengine=libaio
>>> -- 10m12s, 10m11s, 10m13s
>>> 
>>> Details of tests here: http://pastebin.com/raw.php?i=3v9wMtYP
>>> 
>>> Does that ring you guys a bell ?
>>> 
>>> --
>>> David Moreau Simard
>>> 
>>> 
>>>> On Nov 13, 2014, at 3:31 PM, Mike Christie <mchristi@xxxxxxxxxx> wrote:
>>>> 
>>>> On 11/13/2014 10:17 AM, David Moreau Simard wrote:
>>>>> Running into weird issues here as well in a test environment. I 
>>>>> don't
>>> have a solution either but perhaps we can find some things in common..
>>>>> 
>>>>> Setup in a nutshell:
>>>>> - Ceph cluster: Ubuntu 14.04, Kernel 3.16.7, Ceph 0.87-1 (OSDs 
>>>>> with separate public/cluster network in 10 Gbps)
>>>>> - iSCSI Proxy node (targetcli/LIO): Ubuntu 14.04, Kernel 3.16.7, 
>>>>> Ceph
>>>>> 0.87-1 (10 Gbps)
>>>>> - Client node: Ubuntu 12.04, Kernel 3.11 (10 Gbps)
>>>>> 
>>>>> Relevant cluster config: Writeback cache tiering with NVME PCI-E 
>>>>> cards (2
>>> replica) in front of a erasure coded pool (k=3,m=2) backed by spindles.
>>>>> 
>>>>> I'm following the instructions here: 
>>>>> http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-
>>>>> im a ges-san-storage-devices No issues with creating and mapping a 
>>>>> 100GB RBD image and then creating the target.
>>>>> 
>>>>> I'm interested in finding out the overhead/performance impact of
>>> re-exporting through iSCSI so the idea is to run benchmarks.
>>>>> Here's a fio test I'm trying to run on the client node on the 
>>>>> mounted
>>> iscsi device:
>>>>> fio --name=writefile --size=100G --filesize=100G 
>>>>> --filename=/dev/sdu --bs=1M --nrfiles=1 --direct=1 --sync=0
>>>>> --randrepeat=0 --rw=write --refill_buffers --end_fsync=1
>>>>> --iodepth=200 --ioengine=libaio
>>>>> 
>>>>> The benchmark will eventually hang towards the end of the test for 
>>>>> some
>>> long seconds before completing.
>>>>> On the proxy node, the kernel complains with iscsi portal login
>>>>> timeout: http://pastebin.com/Q49UnTPr and I also see irqbalance 
>>>>> errors in syslog: http://pastebin.com/AiRTWDwR
>>>>> 
>>>> 
>>>> You are hitting a different issue. German Anders is most likely 
>>>> correct and you hit the rbd hang. That then caused the iscsi/scsi 
>>>> command to timeout which caused the scsi error handler to run. In 
>>>> your logs we see the LIO error handler has received a task abort 
>>>> from the initiator and that timed out which caused the escalation 
>>>> (iscsi portal login related messages).
>>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>>> 
>>> 
>>> 
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com