Re: very different performance on two volumes in the same pool #2

"Mason, Michael" <Michael.Mason@xxxxxxx> · Mon, 11 May 2015 12:59:39 +0000

I had the same problem when doing benchmarks with small block sizes (<8k) to RBDs. These settings seemed to fix the problem for me.

sudo ceph tell osd.* injectargs '--filestore_merge_threshold 40'
sudo ceph tell osd.* injectargs '--filestore_split_multiple 8'

After you apply the settings give it a few minutes to shuffle the data around.

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Somnath Roy
Sent: Monday, May 11, 2015 3:21 AM
To: Nikola Ciprich
Cc: ceph-users; nik@xxxxxxxxxxx
Subject: Re:  very different performance on two volumes in the same pool #2

Nik,
If you increase num_jobs  beyond 4 , is it helping further ?  Try 8 or so.
Yeah,  libsoft* is definitely consuming some cpu cycles , but I don't know how to resolve that.
Also, acpi_processor_ffh_cstate_enter popped up and consuming lot of cpu. Try disabling cstate and run cpu in maximum performance mode , this may give you some boost.

Thanks & Regards
Somnath

-----Original Message-----
From: Nikola Ciprich [mailto:nikola.ciprich@xxxxxxxxxxx] 
Sent: Sunday, May 10, 2015 11:32 PM
To: Somnath Roy
Cc: ceph-users; nik@xxxxxxxxxxx
Subject: Re:  very different performance on two volumes in the same pool #2

On Mon, May 11, 2015 at 06:07:21AM +0000, Somnath Roy wrote:
> Yes, you need to run fio clients on a separate box, it will take quite a bit of cpu.
> Stopping OSDs on other nodes, rebalancing will start. Have you waited cluster to go for active + clean state ? If you are running while rebalancing is going on , the performance will be impacted.
I set noout, so there was no rebalancing, I forgot to mention that..

> 
> ~110%  cpu util seems pretty low. Try to run fio_rbd with more num_jobs (say 3 or 4 or more), io_depth =64 is fine and see if it improves performance or not.
ok, increasing jobs to 4 seems to squeeze a bit more from the cluster, about 43.3K iops..

OSD cpu util jumps to ~300% on both alive nodes, so there seems to be still a bit of reserves.. 

> Also, since you have 3 OSDs (3 nodes?), I would suggest to tweak the 
> following settings
> 
> osd_op_num_threads_per_shard
> osd_op_num_shards
> 
> May be (1,10 / 1,15 / 2, 10 ?).

tried all those combinations, but it doesn't make almost any difference..

do you think I could get more then those 43k?

one more think that makes me wonder a bit is this line I can see in perf:
  2.21%  libsoftokn3.so             [.] 0x000000000001ebb2

I suppose this has something to do with resolving, 2.2% seems quite a lot to me..
Should I be worried about it? Does it make sense to enable kernel DNS resolving support in ceph?

thanks for your time Somnath!

nik

> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Nikola Ciprich [mailto:nikola.ciprich@xxxxxxxxxxx]
> Sent: Sunday, May 10, 2015 10:33 PM
> To: Somnath Roy
> Cc: ceph-users; nik@xxxxxxxxxxx
> Subject: Re:  very different performance on two volumes in 
> the same pool #2
> 
> 
> On Mon, May 11, 2015 at 05:20:25AM +0000, Somnath Roy wrote:
> > Two things..
> > 
> > 1. You should always use SSD drives for benchmarking after preconditioning it.
> well, I don't really understand... ?
> 
> > 
> > 2. After creating and mapping rbd lun, you need to write data first 
> > to read it afterword otherwise fio output will be misleading. In 
> > fact, I think you will see IO is not even hitting cluster (check 
> > with ceph -s)
> yes, so this approves my conjecture. ok.
> 
> 
> > 
> > Now, if you are saying it's a 3 OSD setup, yes, ~23K is pretty low. Check the following.
> > 
> > 1. Check client or OSd node cpu is saturating or not.
> On OSD nodes, I can see cpeh-osd CPU utilisation of ~110%. On client node (which is one of OSD nodes as well), I can see fio eating quite lot of CPU cycles.. I tried stopping ceph-osd on this node (thus only two nodes are serving data) and performance got a bit higher, to ~33k IOPS. But still I think it's not very good..
> 
> 
> > 
> > 2. With 4K, hope network BW is fine
> I think it's ok..
> 
> 
> > 
> > 3. Number of PGs/pool should be ~128 or so.
> I'm using pg_num 128
> 
> 
> > 
> > 4. If you are using krbd, you might want to try latest krbd module where TCP_NODELAY problem is fixed. If you don't want that complexity, try with fio-rbd.
> I'm not using RBD (only for writing data to volume), for benchmarking, I'm using fio-rbd.
> 
> anything else I could check?
> 
> 
> > 
> > Hope this helps,
> > 
> > Thanks & Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On 
> > Behalf Of Nikola Ciprich
> > Sent: Sunday, May 10, 2015 9:43 PM
> > To: ceph-users
> > Cc: nik@xxxxxxxxxxx
> > Subject:  very different performance on two volumes in 
> > the same pool #2
> > 
> > Hello ceph developers and users,
> > 
> > some time ago, I posted here a question regarding very different performance for two volumes in one pool (backed by SSD drives).
> > 
> > After some examination, I probably got to the root of the problem..
> > 
> > When I create fresh volume (ie rbd create --image-format 2 --size
> > 51200 ssd/test) and run random io fio benchmark
> > 
> > fio  --randrepeat=1 --ioengine=rbd --direct=1 --gtod_reduce=1 
> > --name=test --pool=ssd3r --rbdname=${rbdname} --invalidate=1 --bs=4k
> > --iodepth=64 --readwrite=randread
> > 
> > I get very nice performance of up to 200k IOPS. However once the volume is written to (ie when I map it using rbd map and dd whole volume with some random data), and repeat the benchmark, random performance drops to ~23k IOPS.
> > 
> > This leads me to conjecture that for unwritten (sparse) volumes, read is just a noop, simply returning zeroes without really having to read data from physical storage, and thus showing nice performance, but once the volume is written, performance drops due to need to physically read the data, right?
> > 
> > However I'm a bit unhappy about the performance drop, the pool is backed by 3 SSD drives (each having random io performance of 100k iops) on three nodes, and object size is set to 3. Cluster is completely idle, nodes are quad core Xeons E3-1220 v3 @ 3.10GHz, 32GB RAM each, centos 6, kernel 3.18.12, ceph 0.94.1. I'm using libtcmalloc (I even tried upgrading gperftools-libs to 2.4) Nodes are connected using 10gb ethernet, with jumbo frames enabled.
> > 
> > 
> > I tried tuning following values:
> > 
> > osd_op_threads = 5
> > filestore_op_threads = 4
> > osd_op_num_threads_per_shard = 1
> > osd_op_num_shards = 25
> > filestore_fd_cache_size = 64
> > filestore_fd_cache_shards = 32
> > 
> > I don't see anything special in perf:
> > 
> >   5.43%  [kernel]              [k] acpi_processor_ffh_cstate_enter
> >   2.93%  libtcmalloc.so.4.2.6  [.] 0x0000000000017d2c
> >   2.45%  libpthread-2.12.so    [.] pthread_mutex_lock
> >   2.37%  libpthread-2.12.so    [.] pthread_mutex_unlock
> >   2.33%  [kernel]              [k] do_raw_spin_lock
> >   2.00%  libsoftokn3.so        [.] 0x000000000001f455
> >   1.96%  [kernel]              [k] __switch_to
> >   1.32%  [kernel]              [k] __schedule
> >   1.24%  libstdc++.so.6.0.13   [.] std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char
> >   1.24%  libc-2.12.so          [.] memcpy
> >   1.19%  libtcmalloc.so.4.2.6  [.] operator delete(void*)
> >   1.16%  [kernel]              [k] __d_lookup_rcu
> >   1.09%  libstdc++.so.6.0.13   [.] 0x000000000007d6be
> >   0.93%  libstdc++.so.6.0.13   [.] std::basic_streambuf<char, std::char_traits<char> >::xsputn(char const*, long)
> >   0.93%  ceph-osd              [.] crush_hash32_3
> >   0.85%  libc-2.12.so          [.] vfprintf
> >   0.84%  libc-2.12.so          [.] __strlen_sse42
> >   0.80%  [kernel]              [k] get_futex_key_refs
> >   0.80%  libpthread-2.12.so    [.] pthread_mutex_trylock
> >   0.78%  libtcmalloc.so.4.2.6  [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)
> >   0.71%  libstdc++.so.6.0.13   [.] std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)
> >   0.68%  ceph-osd              [.] ceph::log::Log::flush()
> >   0.66%  libtcmalloc.so.4.2.6  [.] tc_free
> >   0.63%  [kernel]              [k] resched_curr
> >   0.63%  [kernel]              [k] page_fault
> >   0.62%  libstdc++.so.6.0.13   [.] std::string::reserve(unsigned long)
> > 
> > I'm running benchmark directly on one of nodes, which I know is not optimal, but it's still able to give those 200k iops for empty volume, so I guess it shouldn't be problem..
> > 
> > Another story is random write performance, which is totally poor, but I't like to deal with read performance first..
> > 
> > 
> > so my question is, are those numbers normal? If not, what should I check?
> > 
> > I'll be very grateful for all the hints I could get..
> > 
> > thanks a lot in advance
> > 
> > nik
> > 
> > 
> > --
> > -------------------------------------
> > Ing. Nikola CIPRICH
> > LinuxBox.cz, s.r.o.
> > 28.rijna 168, 709 00 Ostrava
> > 
> > tel.:   +420 591 166 214
> > fax:    +420 596 621 273
> > mobil:  +420 777 093 799
> > www.linuxbox.cz
> > 
> > mobil servis: +420 737 238 656
> > email servis: servis@xxxxxxxxxxx
> > -------------------------------------
> > 
> > ________________________________
> > 
> > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > 
> > 
> 
> --
> -------------------------------------
> Ing. Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28.rijna 168, 709 00 Ostrava
> 
> tel.:   +420 591 166 214
> fax:    +420 596 621 273
> mobil:  +420 777 093 799
> www.linuxbox.cz
> 
> mobil servis: +420 737 238 656
> email servis: servis@xxxxxxxxxxx
> -------------------------------------
> 

--
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava

tel.:   +420 591 166 214
fax:    +420 596 621 273
mobil:  +420 777 093 799
www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@xxxxxxxxxxx
-------------------------------------
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com