I've flushed everything - data, pools, configs and reconfigured the whole thing. I was particularly careful with cache tiering configurations (almost leaving defaults when possible) and it's not locking anymore. It looks like the cache tiering configuration I had was causing the problem ? I can't put my finger on exactly what/why and I don't have the luxury of time to do this lengthy testing again. Here's what I dumped as far as config goes before wiping: ======== # for var in size min_size pg_num pgp_num crush_ruleset erasure_code_profile; do ceph osd pool get volumes $var; done size: 5 min_size: 2 pg_num: 7200 pgp_num: 7200 crush_ruleset: 1 erasure_code_profile: ecvolumes # for var in size min_size pg_num pgp_num crush_ruleset hit_set_type hit_set_period hit_set_count target_max_objects target_max_bytes cache_target_dirty_ratio cache_target_full_ratio cache_min_flush_age cache_min_evict_age; do ceph osd pool get volumecache $var; done size: 2 min_size: 1 pg_num: 7200 pgp_num: 7200 crush_ruleset: 4 hit_set_type: bloom hit_set_period: 3600 hit_set_count: 1 target_max_objects: 0 target_max_bytes: 100000000000 cache_target_dirty_ratio: 0.5 cache_target_full_ratio: 0.8 cache_min_flush_age: 600 cache_min_evict_age: 1800 # ceph osd erasure-code-profile get ecvolumes directory=/usr/lib/ceph/erasure-code k=3 m=2 plugin=jerasure ruleset-failure-domain=osd technique=reed_sol_van ======== And now: ======== # for var in size min_size pg_num pgp_num crush_ruleset erasure_code_profile; do ceph osd pool get volumes $var; done size: 5 min_size: 3 pg_num: 2048 pgp_num: 2048 crush_ruleset: 1 erasure_code_profile: ecvolumes # for var in size min_size pg_num pgp_num crush_ruleset hit_set_type hit_set_period hit_set_count target_max_objects target_max_bytes cache_target_dirty_ratio cache_target_full_ratio cache_min_flush_age cache_min_evict_age; do ceph osd pool get volumecache $var; done size: 2 min_size: 1 pg_num: 2048 pgp_num: 2048 crush_ruleset: 4 hit_set_type: bloom hit_set_period: 3600 hit_set_count: 1 target_max_objects: 0 target_max_bytes: 150000000000 cache_target_dirty_ratio: 0.5 cache_target_full_ratio: 0.8 cache_min_flush_age: 0 cache_min_evict_age: 1800 # ceph osd erasure-code-profile get ecvolumes directory=/usr/lib/ceph/erasure-code k=3 m=2 plugin=jerasure ruleset-failure-domain=osd technique=reed_sol_van ======== Crush map hasn't really changed before and after. FWIW, the benchmarks I pulled out of the setup: https://gist.github.com/dmsimard/2737832d077cfc5eff34 Definite overhead going from krbd to krbd + LIO... -- David Moreau Simard > On Nov 20, 2014, at 4:14 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: > > Here you go:- > > Erasure Profile > k=2 > m=1 > plugin=jerasure > ruleset-failure-domain=osd > ruleset-root=hdd > technique=reed_sol_van > > Cache Settings > hit_set_type: bloom > hit_set_period: 3600 > hit_set_count: 1 > target_max_objects > target_max_objects: 0 > target_max_bytes: 1000000000 > cache_target_dirty_ratio: 0.4 > cache_target_full_ratio: 0.8 > cache_min_flush_age: 0 > cache_min_evict_age: 0 > > Crush Dump > # begin crush map > tunable choose_local_tries 0 > tunable choose_local_fallback_tries 0 > tunable choose_total_tries 50 > tunable chooseleaf_descend_once 1 > > # devices > device 0 osd.0 > device 1 osd.1 > device 2 osd.2 > device 3 osd.3 > > # types > type 0 osd > type 1 host > type 2 chassis > type 3 rack > type 4 row > type 5 pdu > type 6 pod > type 7 room > type 8 datacenter > type 9 region > type 10 root > > # buckets > host ceph-test-hdd { > id -5 # do not change unnecessarily > # weight 2.730 > alg straw > hash 0 # rjenkins1 > item osd.1 weight 0.910 > item osd.2 weight 0.910 > item osd.0 weight 0.910 > } > root hdd { > id -3 # do not change unnecessarily > # weight 2.730 > alg straw > hash 0 # rjenkins1 > item ceph-test-hdd weight 2.730 > } > host ceph-test-ssd { > id -6 # do not change unnecessarily > # weight 1.000 > alg straw > hash 0 # rjenkins1 > item osd.3 weight 1.000 > } > root ssd { > id -4 # do not change unnecessarily > # weight 1.000 > alg straw > hash 0 # rjenkins1 > item ceph-test-ssd weight 1.000 > } > > # rules > rule hdd { > ruleset 0 > type replicated > min_size 0 > max_size 10 > step take hdd > step chooseleaf firstn 0 type osd > step emit > } > rule ssd { > ruleset 1 > type replicated > min_size 0 > max_size 4 > step take ssd > step chooseleaf firstn 0 type osd > step emit > } > rule ecpool { > ruleset 2 > type erasure > min_size 3 > max_size 20 > step set_chooseleaf_tries 5 > step take hdd > step chooseleaf indep 0 type osd > step emit > } > > > > -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > David Moreau Simard > Sent: 20 November 2014 20:03 > To: Nick Fisk > Cc: ceph-users@xxxxxxxxxxxxxx > Subject: Re: Poor RBD performance as LIO iSCSI target > > Nick, > > Can you share more datails on the configuration you are using ? I'll try and > duplicate those configurations in my environment and see what happens. > I'm mostly interested in: > - Erasure code profile (k, m, plugin, ruleset-failure-domain) > - Cache tiering pool configuration (ex: hit_set_type, hit_set_period, > hit_set_count, target_max_objects, target_max_bytes, > cache_target_dirty_ratio, cache_target_full_ratio, cache_min_flush_age, > cache_min_evict_age) > > The crush rulesets would also be helpful. > > Thanks, > -- > David Moreau Simard > >> On Nov 20, 2014, at 12:43 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: >> >> Hi David, >> >> I've just finished running the 75GB fio test you posted a few days >> back on my new test cluster. >> >> The cluster is as follows:- >> >> Single server with 3x hdd and 1 ssd >> Ubuntu 14.04 with 3.16.7 kernel >> 2+1 EC pool on hdds below a 10G ssd cache pool. SSD is also >> 2+partitioned to >> provide journals for hdds. >> 150G RBD mapped locally >> >> The fio test seemed to run without any problems. I want to run a few >> more tests with different settings to see if I can reproduce your >> problem. I will let you know if I find anything. >> >> If there is anything you would like me to try, please let me know. >> >> Nick >> >> -----Original Message----- >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf >> Of David Moreau Simard >> Sent: 19 November 2014 10:48 >> To: Ramakrishna Nishtala (rnishtal) >> Cc: ceph-users@xxxxxxxxxxxxxx; Nick Fisk >> Subject: Re: Poor RBD performance as LIO iSCSI target >> >> Rama, >> >> Thanks for your reply. >> >> My end goal is to use iSCSI (with LIO/targetcli) to export rbd block >> devices. >> >> I was encountering issues with iSCSI which are explained in my >> previous emails. >> I ended up being able to reproduce the problem at will on various >> Kernel and OS combinations, even on raw RBD devices - thus ruling out >> the hypothesis that it was a problem with iSCSI but rather with Ceph. >> I'm even running 0.88 now and the issue is still there. >> >> I haven't isolated the issue just yet. >> My next tests involve disabling the cache tiering. >> >> I do have client krbd cache as well, i'll try to disable it too if >> cache tiering isn't enough. >> -- >> David Moreau Simard >> >> >>> On Nov 18, 2014, at 8:10 PM, Ramakrishna Nishtala (rnishtal) >> <rnishtal@xxxxxxxxx> wrote: >>> >>> Hi Dave >>> Did you say iscsi only? The tracker issue does not say though. >>> I am on giant, with both client and ceph on RHEL 7 and seems to work >>> ok, >> unless I am missing something here. RBD on baremetal with kmod-rbd and >> caching disabled. >>> >>> [root@compute4 ~]# time fio --name=writefile --size=100G >>> --filesize=100G --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 >>> --sync=0 --randrepeat=0 --rw=write --refill_buffers --end_fsync=1 >>> --iodepth=200 --ioengine=libaio >>> writefile: (g=0): rw=write, bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, >>> iodepth=200 >>> fio-2.1.11 >>> Starting 1 process >>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/853.0MB/0KB /s] [0/853/0 >>> iops] [eta 00m:00s] ... >>> Disk stats (read/write): >>> rbd0: ios=184/204800, merge=0/0, ticks=70/16164931, >>> in_queue=16164942, util=99.98% >>> >>> real 1m56.175s >>> user 0m18.115s >>> sys 0m10.430s >>> >>> Regards, >>> >>> Rama >>> >>> >>> -----Original Message----- >>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf >>> Of David Moreau Simard >>> Sent: Tuesday, November 18, 2014 3:49 PM >>> To: Nick Fisk >>> Cc: ceph-users@xxxxxxxxxxxxxx >>> Subject: Re: Poor RBD performance as LIO iSCSI target >>> >>> Testing without the cache tiering is the next test I want to do when >>> I >> have time.. >>> >>> When it's hanging, there is no activity at all on the cluster. >>> Nothing in "ceph -w", nothing in "ceph osd pool stats". >>> >>> I'll provide an update when I have a chance to test without tiering. >>> -- >>> David Moreau Simard >>> >>> >>>> On Nov 18, 2014, at 3:28 PM, Nick Fisk <nick@xxxxxxxxxx> wrote: >>>> >>>> Hi David, >>>> >>>> Have you tried on a normal replicated pool with no cache? I've seen >>>> a number of threads recently where caching is causing various things >>>> to >> block/hang. >>>> It would be interesting to see if this still happens without the >>>> caching layer, at least it would rule it out. >>>> >>>> Also is there any sign that as the test passes ~50GB that the cache >>>> might start flushing to the backing pool causing slow performance? >>>> >>>> I am planning a deployment very similar to yours so I am following >>>> this with great interest. I'm hoping to build a single node test >>>> "cluster" shortly, so I might be in a position to work with you on >>>> this issue and hopefully get it resolved. >>>> >>>> Nick >>>> >>>> -----Original Message----- >>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On >>>> Behalf Of David Moreau Simard >>>> Sent: 18 November 2014 19:58 >>>> To: Mike Christie >>>> Cc: ceph-users@xxxxxxxxxxxxxx; Christopher Spearman >>>> Subject: Re: Poor RBD performance as LIO iSCSI target >>>> >>>> Thanks guys. I looked at http://tracker.ceph.com/issues/8818 and >>>> chatted with "dis" on #ceph-devel. >>>> >>>> I ran a LOT of tests on a LOT of comabination of kernels (sometimes >>>> with tunables legacy). I haven't found a magical combination in >>>> which the following test does not hang: >>>> fio --name=writefile --size=100G --filesize=100G >>>> --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 --sync=0 >>>> --randrepeat=0 --rw=write --refill_buffers --end_fsync=1 >>>> --iodepth=200 --ioengine=libaio >>>> >>>> Either directly on a mapped rbd device, on a mounted filesystem >>>> (over rbd), exported through iSCSI.. nothing. >>>> I guess that rules out a potential issue with iSCSI overhead. >>>> >>>> Now, something I noticed out of pure luck is that I am unable to >>>> reproduce the issue if I drop the size of the test to 50GB. Tests >>>> will complete in under 2 minutes. >>>> 75GB will hang right at the end and take more than 10 minutes. >>>> >>>> TL;DR of tests: >>>> - 3x fio --name=writefile --size=50G --filesize=50G >>>> --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 --sync=0 >>>> --randrepeat=0 --rw=write --refill_buffers --end_fsync=1 >>>> --iodepth=200 --ioengine=libaio >>>> -- 1m44s, 1m49s, 1m40s >>>> >>>> - 3x fio --name=writefile --size=75G --filesize=75G >>>> --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 --sync=0 >>>> --randrepeat=0 --rw=write --refill_buffers --end_fsync=1 >>>> --iodepth=200 --ioengine=libaio >>>> -- 10m12s, 10m11s, 10m13s >>>> >>>> Details of tests here: http://pastebin.com/raw.php?i=3v9wMtYP >>>> >>>> Does that ring you guys a bell ? >>>> >>>> -- >>>> David Moreau Simard >>>> >>>> >>>>> On Nov 13, 2014, at 3:31 PM, Mike Christie <mchristi@xxxxxxxxxx> wrote: >>>>> >>>>> On 11/13/2014 10:17 AM, David Moreau Simard wrote: >>>>>> Running into weird issues here as well in a test environment. I >>>>>> don't >>>> have a solution either but perhaps we can find some things in common.. >>>>>> >>>>>> Setup in a nutshell: >>>>>> - Ceph cluster: Ubuntu 14.04, Kernel 3.16.7, Ceph 0.87-1 (OSDs >>>>>> with separate public/cluster network in 10 Gbps) >>>>>> - iSCSI Proxy node (targetcli/LIO): Ubuntu 14.04, Kernel 3.16.7, >>>>>> Ceph >>>>>> 0.87-1 (10 Gbps) >>>>>> - Client node: Ubuntu 12.04, Kernel 3.11 (10 Gbps) >>>>>> >>>>>> Relevant cluster config: Writeback cache tiering with NVME PCI-E >>>>>> cards (2 >>>> replica) in front of a erasure coded pool (k=3,m=2) backed by spindles. >>>>>> >>>>>> I'm following the instructions here: >>>>>> http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd- >>>>>> im a ges-san-storage-devices No issues with creating and mapping a >>>>>> 100GB RBD image and then creating the target. >>>>>> >>>>>> I'm interested in finding out the overhead/performance impact of >>>> re-exporting through iSCSI so the idea is to run benchmarks. >>>>>> Here's a fio test I'm trying to run on the client node on the >>>>>> mounted >>>> iscsi device: >>>>>> fio --name=writefile --size=100G --filesize=100G >>>>>> --filename=/dev/sdu --bs=1M --nrfiles=1 --direct=1 --sync=0 >>>>>> --randrepeat=0 --rw=write --refill_buffers --end_fsync=1 >>>>>> --iodepth=200 --ioengine=libaio >>>>>> >>>>>> The benchmark will eventually hang towards the end of the test for >>>>>> some >>>> long seconds before completing. >>>>>> On the proxy node, the kernel complains with iscsi portal login >>>>>> timeout: http://pastebin.com/Q49UnTPr and I also see irqbalance >>>>>> errors in syslog: http://pastebin.com/AiRTWDwR >>>>>> >>>>> >>>>> You are hitting a different issue. German Anders is most likely >>>>> correct and you hit the rbd hang. That then caused the iscsi/scsi >>>>> command to timeout which caused the scsi error handler to run. In >>>>> your logs we see the LIO error handler has received a task abort >>>>> from the initiator and that timed out which caused the escalation >>>>> (iscsi portal login related messages). >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users@xxxxxxxxxxxxxx >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>> >>>> >>>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com