Re: Poor RBD performance as LIO iSCSI target

Nick Fisk <nick@xxxxxxxxxx> · Mon, 08 Dec 2014 19:13:17 +0000

Hi David,

This is a long shot, but have you checked the Max queue depth on the  
iscsi side. I've got a feeling that lio might be set at 32 as default.

This would definitely have an effect at the high queue depths you are  
testing with.

On 8 Dec 2014 16:53, David Moreau Simard <dmsimard@xxxxxxxx> wrote:

Haven't tried other iSCSI implementations (yet). LIO/targetcli makes  
it very easy to iQuoting David Moreau Simard <dmsimard@xxxxxxxx>

Haven't tried other iSCSI implementations (yet).

LIO/targetcli makes it very easy to  
implement/integrate/wrap/automate around so I'm really trying to get  
this right.

PCI-E SSD cache tier in front of spindles-backed erasure coded pool  
in 10 Gbps across the board yields results slightly better or very  
similar to two spindles in hardware RAID-0 with writeback caching.
With that in mind, the performance is not outright awful by any  
means, there's just a lot of overhead we have to be reminded about.

What I'd like to further test but am unable to right now is to see  
what happens if you scale up the cluster. Right now I'm testing on  
only two nodes.
Does the IOPS scale linearly with increasing amount of OSDs/servers  
? Or is it more about a capacity thing ?

Perhaps if someone else can chime in, I'm really curious.
--
David Moreau Simard

On Dec 6, 2014, at 11:18 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:

Hi David,

Very strange, but  I'm glad you managed to finally get the cluster working
normally. Thank you for posting the benchmarks figures, it's interesting to
see the overhead of LIO over pure RBD performance.

I should have the hardware for our cluster up and running early next year, I
will be in a better position to test the iSCSI performance then. I will
report back once I have some numbers.

Just out of interest, have you tried any of the other iSCSI implementations
to see if they show the same performance drop?

Nick

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
David Moreau Simard
Sent: 05 December 2014 16:03
To: Nick Fisk
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  Poor RBD performance as LIO iSCSI target

I've flushed everything - data, pools, configs and reconfigured the whole
thing.

I was particularly careful with cache tiering configurations (almost leaving
defaults when possible) and it's not locking anymore.
It looks like the cache tiering configuration I had was causing the problem
? I can't put my finger on exactly what/why and I don't have the luxury of
time to do this lengthy testing again.

Here's what I dumped as far as config goes before wiping:
========
# for var in size min_size pg_num pgp_num crush_ruleset
erasure_code_profile; do ceph osd pool get volumes $var; done
size: 5
min_size: 2
pg_num: 7200
pgp_num: 7200
crush_ruleset: 1
erasure_code_profile: ecvolumes

# for var in size min_size pg_num pgp_num crush_ruleset hit_set_type
hit_set_period hit_set_count target_max_objects target_max_bytes
cache_target_dirty_ratio cache_target_full_ratio cache_min_flush_age
cache_min_evict_age; do ceph osd pool get volumecache $var; done
size: 2
min_size: 1
pg_num: 7200
pgp_num: 7200
crush_ruleset: 4
hit_set_type: bloom
hit_set_period: 3600
hit_set_count: 1
target_max_objects: 0
target_max_bytes: 100000000000
cache_target_dirty_ratio: 0.5
cache_target_full_ratio: 0.8
cache_min_flush_age: 600
cache_min_evict_age: 1800

# ceph osd erasure-code-profile get ecvolumes
directory=/usr/lib/ceph/erasure-code
k=3
m=2
plugin=jerasure
ruleset-failure-domain=osd
technique=reed_sol_van
========

And now:
========
# for var in size min_size pg_num pgp_num crush_ruleset
erasure_code_profile; do ceph osd pool get volumes $var; done
size: 5
min_size: 3
pg_num: 2048
pgp_num: 2048
crush_ruleset: 1
erasure_code_profile: ecvolumes

# for var in size min_size pg_num pgp_num crush_ruleset hit_set_type
hit_set_period hit_set_count target_max_objects target_max_bytes
cache_target_dirty_ratio cache_target_full_ratio cache_min_flush_age
cache_min_evict_age; do ceph osd pool get volumecache $var; done
size: 2
min_size: 1
pg_num: 2048
pgp_num: 2048
crush_ruleset: 4
hit_set_type: bloom
hit_set_period: 3600
hit_set_count: 1
target_max_objects: 0
target_max_bytes: 150000000000
cache_target_dirty_ratio: 0.5
cache_target_full_ratio: 0.8
cache_min_flush_age: 0
cache_min_evict_age: 1800

# ceph osd erasure-code-profile get ecvolumes
directory=/usr/lib/ceph/erasure-code
k=3
m=2
plugin=jerasure
ruleset-failure-domain=osd
technique=reed_sol_van
========

Crush map hasn't really changed before and after.

FWIW, the benchmarks I pulled out of the setup:
https://gist.github.com/dmsimard/2737832d077cfc5eff34
Definite overhead going from krbd to krbd + LIO...
--
David Moreau Simard

On Nov 20, 2014, at 4:14 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:

Here you go:-

Erasure Profile
k=2
m=1
plugin=jerasure
ruleset-failure-domain=osd
ruleset-root=hdd
technique=reed_sol_van

Cache Settings
hit_set_type: bloom
hit_set_period: 3600
hit_set_count: 1
target_max_objects
target_max_objects: 0
target_max_bytes: 1000000000
cache_target_dirty_ratio: 0.4
cache_target_full_ratio: 0.8
cache_min_flush_age: 0
cache_min_evict_age: 0

Crush Dump
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host ceph-test-hdd {
      id -5           # do not change unnecessarily
      # weight 2.730
      alg straw
      hash 0  # rjenkins1
      item osd.1 weight 0.910
      item osd.2 weight 0.910
      item osd.0 weight 0.910
}
root hdd {
      id -3           # do not change unnecessarily
      # weight 2.730
      alg straw
      hash 0  # rjenkins1
      item ceph-test-hdd weight 2.730 } host ceph-test-ssd {
      id -6           # do not change unnecessarily
      # weight 1.000
      alg straw
      hash 0  # rjenkins1
      item osd.3 weight 1.000
}
root ssd {
      id -4           # do not change unnecessarily
      # weight 1.000
      alg straw
      hash 0  # rjenkins1
      item ceph-test-ssd weight 1.000 }

# rules
rule hdd {
      ruleset 0
      type replicated
      min_size 0
      max_size 10
      step take hdd
      step chooseleaf firstn 0 type osd
      step emit
}
rule ssd {
      ruleset 1
      type replicated
      min_size 0
      max_size 4
      step take ssd
      step chooseleaf firstn 0 type osd
      step emit
}
rule ecpool {
      ruleset 2
      type erasure
      min_size 3
      max_size 20
      step set_chooseleaf_tries 5
      step take hdd
      step chooseleaf indep 0 type osd
      step emit
}

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
Of David Moreau Simard
Sent: 20 November 2014 20:03
To: Nick Fisk
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  Poor RBD performance as LIO iSCSI target

Nick,

Can you share more datails on the configuration you are using ? I'll
try and duplicate those configurations in my environment and see what
happens.
I'm mostly interested in:
- Erasure code profile (k, m, plugin, ruleset-failure-domain)
- Cache tiering pool configuration (ex: hit_set_type, hit_set_period,
hit_set_count, target_max_objects, target_max_bytes,
cache_target_dirty_ratio, cache_target_full_ratio,
cache_min_flush_age,
cache_min_evict_age)

The crush rulesets would also be helpful.

Thanks,
--
David Moreau Simard

On Nov 20, 2014, at 12:43 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:

Hi David,

I've just finished running the 75GB fio test you posted a few days
back on my new test cluster.

The cluster is as follows:-

Single server with 3x hdd and 1 ssd
Ubuntu 14.04 with 3.16.7 kernel
2+1 EC pool on hdds below a 10G ssd cache pool. SSD is also
2+partitioned to
provide journals for hdds.
150G RBD mapped locally

The fio test seemed to run without any problems. I want to run a few
more tests with different settings to see if I can reproduce your
problem. I will let you know if I find anything.

If there is anything you would like me to try, please let me know.

Nick

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
Of David Moreau Simard
Sent: 19 November 2014 10:48
To: Ramakrishna Nishtala (rnishtal)
Cc: ceph-users@xxxxxxxxxxxxxx; Nick Fisk
Subject: Re:  Poor RBD performance as LIO iSCSI target

Rama,

Thanks for your reply.

My end goal is to use iSCSI (with LIO/targetcli) to export rbd block
devices.

I was encountering issues with iSCSI which are explained in my
previous emails.
I ended up being able to reproduce the problem at will on various
Kernel and OS combinations, even on raw RBD devices - thus ruling out
the hypothesis that it was a problem with iSCSI but rather with Ceph.
I'm even running 0.88 now and the issue is still there.

I haven't isolated the issue just yet.
My next tests involve disabling the cache tiering.

I do have client krbd cache as well, i'll try to disable it too if
cache tiering isn't enough.
--
David Moreau Simard

On Nov 18, 2014, at 8:10 PM, Ramakrishna Nishtala (rnishtal)
<rnishtal@xxxxxxxxx> wrote:

Hi Dave
Did you say iscsi only? The tracker issue does not say though.
I am on giant, with both client and ceph on RHEL 7 and seems to work
ok,
unless I am missing something here. RBD on baremetal with kmod-rbd
and caching disabled.

[root@compute4 ~]# time fio --name=writefile --size=100G
--filesize=100G --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1
--sync=0 --randrepeat=0 --rw=write --refill_buffers --end_fsync=1
--iodepth=200 --ioengine=libaio
writefile: (g=0): rw=write, bs=1M-1M/1M-1M/1M-1M, ioengine=libaio,
iodepth=200
fio-2.1.11
Starting 1 process
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/853.0MB/0KB /s] [0/853/0
iops] [eta 00m:00s] ...
Disk stats (read/write):
rbd0: ios=184/204800, merge=0/0, ticks=70/16164931,
in_queue=16164942, util=99.98%

real    1m56.175s
user    0m18.115s
sys     0m10.430s

Regards,

Rama

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
Behalf Of David Moreau Simard
Sent: Tuesday, November 18, 2014 3:49 PM
To: Nick Fisk
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  Poor RBD performance as LIO iSCSI target

Testing without the cache tiering is the next test I want to do when
I
have time..

When it's hanging, there is no activity at all on the cluster.
Nothing in "ceph -w", nothing in "ceph osd pool stats".

I'll provide an update when I have a chance to test without tiering.
--
David Moreau Simard

On Nov 18, 2014, at 3:28 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:

Hi David,

Have you tried on a normal replicated pool with no cache? I've seen
a number of threads recently where caching is causing various
things to
block/hang.
It would be interesting to see if this still happens without the
caching layer, at least it would rule it out.

Also is there any sign that as the test passes ~50GB that the cache
might start flushing to the backing pool causing slow performance?

I am planning a deployment very similar to yours so I am following
this with great interest. I'm hoping to build a single node test
"cluster" shortly, so I might be in a position to work with you on
this issue and hopefully get it resolved.

Nick

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
Behalf Of David Moreau Simard
Sent: 18 November 2014 19:58
To: Mike Christie
Cc: ceph-users@xxxxxxxxxxxxxx; Christopher Spearman
Subject: Re:  Poor RBD performance as LIO iSCSI target

Thanks guys. I looked at http://tracker.ceph.com/issues/8818 and
chatted with "dis" on #ceph-devel.

I ran a LOT of tests on a LOT of comabination of kernels (sometimes
with tunables legacy). I haven't found a magical combination in
which the following test does not hang:
fio --name=writefile --size=100G --filesize=100G
--filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 --sync=0
--randrepeat=0 --rw=write --refill_buffers --end_fsync=1
--iodepth=200 --ioengine=libaio

Either directly on a mapped rbd device, on a mounted filesystem
(over rbd), exported through iSCSI.. nothing.
I guess that rules out a potential issue with iSCSI overhead.

Now, something I noticed out of pure luck is that I am unable to
reproduce the issue if I drop the size of the test to 50GB. Tests
will complete in under 2 minutes.
75GB will hang right at the end and take more than 10 minutes.

TL;DR of tests:
- 3x fio --name=writefile --size=50G --filesize=50G
--filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 --sync=0
--randrepeat=0 --rw=write --refill_buffers --end_fsync=1
--iodepth=200 --ioengine=libaio
-- 1m44s, 1m49s, 1m40s

- 3x fio --name=writefile --size=75G --filesize=75G
--filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 --sync=0
--randrepeat=0 --rw=write --refill_buffers --end_fsync=1
--iodepth=200 --ioengine=libaio
-- 10m12s, 10m11s, 10m13s

Details of tests here: http://pastebin.com/raw.php?i=3v9wMtYP

Does that ring you guys a bell ?

--
David Moreau Simard

On Nov 13, 2014, at 3:31 PM, Mike Christie <mchristi@xxxxxxxxxx>
wrote:

On 11/13/2014 10:17 AM, David Moreau Simard wrote:
Running into weird issues here as well in a test environment. I
don't
have a solution either but perhaps we can find some things in common..

Setup in a nutshell:
- Ceph cluster: Ubuntu 14.04, Kernel 3.16.7, Ceph 0.87-1 (OSDs
with separate public/cluster network in 10 Gbps)
- iSCSI Proxy node (targetcli/LIO): Ubuntu 14.04, Kernel 3.16.7,
Ceph
0.87-1 (10 Gbps)
- Client node: Ubuntu 12.04, Kernel 3.11 (10 Gbps)

Relevant cluster config: Writeback cache tiering with NVME PCI-E
cards (2
replica) in front of a erasure coded pool (k=3,m=2) backed by spindles.

I'm following the instructions here:
http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd
- im a ges-san-storage-devices No issues with creating and
mapping a 100GB RBD image and then creating the target.

I'm interested in finding out the overhead/performance impact of
re-exporting through iSCSI so the idea is to run benchmarks.
Here's a fio test I'm trying to run on the client node on the
mounted
iscsi device:
fio --name=writefile --size=100G --filesize=100G
--filename=/dev/sdu --bs=1M --nrfiles=1 --direct=1 --sync=0
--randrepeat=0 --rw=write --refill_buffers --end_fsync=1
--iodepth=200 --ioengine=libaio

The benchmark will eventually hang towards the end of the test
for some
long seconds before completing.
On the proxy node, the kernel complains with iscsi portal login
timeout: http://pastebin.com/Q49UnTPr and I also see irqbalance
errors in syslog: http://pastebin.com/AiRTWDwR

You are hitting a different issue. German Anders is most likely
correct and you hit the rbd hang. That then caused the iscsi/scsi
command to timeout which caused the scsi error handler to run. In
your logs we see the LIO error handler has received a task abort
from the initiator and that timed out which caused the escalation
(iscsi portal login related messages).

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com