Re: rbd cache on full ssd cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

On Tue, 15 Mar 2016 12:00:24 +0200 Yair Magnezi wrote:

> Thanks Christian .
> 
> Still
> 
> "So yes, your numbers are normal for single client, low depth reads, as
> many threads in this ML confirm."
> 
> we're facing very  high latency ( i expect much less latency  from ssd
> cluster ) :
> 
As I tried to explain, this has very little to do with your cluster being
SSD based. 

> clat percentiles (usec):
> |  1.00th=[  350],  5.00th=[  390], 10.00th=[  414], 20.00th=[  454], 
> | 30.00th=[  494], 40.00th=[  540], 50.00th=[  612], 60.00th=[  732], 
> | 70.00th=[ 1064], 80.00th=[10304], 90.00th=[37632], 95.00th=[38656], 
> | 99.00th=[40192], 99.50th=[41216], 99.90th=[43264], 99.95th=[43776],

Compared to mine:

clat percentiles (usec):
|  1.00th=[  149],  5.00th=[  175], 10.00th=[  197], 20.00th=[  772],
| 30.00th=[  884], 40.00th=[  956], 50.00th=[ 1012], 60.00th=[ 1064],
| 70.00th=[ 1128], 80.00th=[ 1192], 90.00th=[ 1288], 95.00th=[ 1400],
| 99.00th=[ 1800], 99.50th=[ 2224], 99.90th=[ 3600], 99.95th=[ 3952],
| 99.99th=[ 6880]

Your latency is (aside from things that may otherwise be wrong, especially
in your network) most likely a factor of these, in order of how
precedence as I see it:

1. Network (a network stack, switches, etc is not a 30cm SATA cable)
One of the reasons I use Infiniband, especially with the future rDMA
support.

2. Your Ceph version. Firefly is not tuned for SSDs, in fact it was slower
in some ways than older versions. It would be worthwhile to wait for Jewel
if you can. The Ceph stack is definitely not a 30cm SATA cable and
introduces significant latencies.

3. Your CPUs, but they would be more involved with writes than reads.
OTOH, having them in performance mode instead of powersave will lower
latencies as the CPUs don't have to ramp up.

4. Tuning like disabling cephx, different memory allocators, etc.

Christian

> Thanks
> 
> 
> 
> 
> 
> 
> *Yair Magnezi *
> 
> 
> 
> 
> *Storage & Data Protection TL   // KenshooOffice +972 7 32862423   //
> Mobile +972 50 575-2955__________________________________________*
> 
> 
> 
> On Tue, Mar 15, 2016 at 2:28 AM, Christian Balzer <chibi@xxxxxxx> wrote:
> 
> >
> > Hello,
> >
> > On Mon, 14 Mar 2016 15:51:11 +0200 Yair Magnezi wrote:
> >
> > > On Fri, Mar 11, 2016 at 2:01 AM, Christian Balzer <chibi@xxxxxxx>
> > > wrote:
> > >
> > > >
> > > > Hello,
> > > >
> > > > As alway there are many similar threads in here, googling and
> > > > reading up stuff are good for you.
> > > >
> > > > On Thu, 10 Mar 2016 16:55:03 +0200 Yair Magnezi wrote:
> > > >
> > > > > Hello Cephers .
> > > > >
> > > > > I wonder if anyone has some experience with full ssd cluster .
> > > > > We're testing ceph ( "firefly" ) with 4 nodes ( supermicro
> > > > >  SYS-F628R3-R72BPT ) * 1TB  SSD , total of 12 osds .
> > > > > Our network is 10 gig .
> > > > Much more, relevant details, from SW versions (kernel, OS, Ceph)
> > > > and configuration (replica size of your pool) to precise HW info.
> > > >
> > >
> > >     H/W  --> 4 nodes  supermicro ( SYS-F628R3-R72BPT ) , every node
> > > has 64 GB mem ,
> > >                   MegaRAID SAS 2208 : RAID0 , 4 * 1 TB ssd ( SAMSUNG
> > > MZ7KM960HAHP-00005 )
> > >
> >
> > SM863, they should be fine.
> > However I've never seen any results of them with sync writes, if you
> > have the time, something to test.
> >
> > >     Cluster --. 4 nodes , 12 OSD's , replica size = 2  , ubuntu
> > > 14.04.1 LTS ,
> > >
> > Otherwise similar to my cache pool against which I tested below,
> > 2 nodes with 4x 800GB Intel DC S3610 each, replica of 2, thus 8 OSDs.
> > 2 E5-2623 (3GHz base speed) per node.
> > Network is QDR Infiniband, IPoIB.
> >
> > Debian Jessie and Ceph Hammer, though.
> >
> > > >
> > > > In particular your SSDs, exact maker/version/size.
> > > > Where are your journals?
> > > >
> > > >     SAMSUNG MZ7KM960HAHP-00005 , 893.752 GB
> > >     Journals on the same drive data ( all SSD as  mentioned )
> > >
> > Again, should be fine but test these with sync writes.
> > And of course monitor their wearout over time.
> >
> > >
> > > > Also Firefly is EOL, Hammer and even more so the upcoming Jewel
> > > > have significant improvements with SSDs.
> > > >
> > > > > We used the ceph_deploy for installation with all defaults
> > > > > ( followed ceph documentation for integration with open-stack )
> > > > > As much as we understand there is no need to enable the rbd
> > > > > cache as we're running on full ssd.
> > > > RBD cache as in the client side librbd cache is always very
> > > > helpful, fast backing storage or not.
> > > > It can significantly reduce the number of small writes, something
> > > > Ceph has to do a lot of heavy lifting for.
> > > >
> > > > > bench marking the cluster shows very poor performance write but
> > > > > mostly read ( clients are open-stack but also vmware instances
> > > > > ) .
> > > >
> > > > Benchmarking how (exact command line for fio for example) and with
> > > > what results?
> > > > You say poor, but that might be "normal" for your situation, we
> > > > can't really tell w/o hard data.
> > > >
> > >
> > >
> > >
> > >    fio --name=randread --ioengine=libaio --iodepth=1 --rw=randread
> > > --bs=4k --direct=1 --size=256M --numjobs=10 --runtime=120
> > > --group_reporting --directory=/ceph_test2
> > >
> >
> > Just to make sure, this is run inside your VM?
> >
> > >    root@open-compute1:~# fio --name=randread --ioengine=libaio
> > > --iodepth=1 --rw=randread --bs=4k --direct=1 --size=256M --numjobs=10
> > > --runtime=120 --group_reporting --directory=/ceph_test2
> > > randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
> > > iodepth=1
> > > ...
> > > randread: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio,
> > > iodepth=1
> > > fio-2.1.3
> > > Starting 10 processes
> > > randread: Laying out IO file(s) (1 file(s) / 256MB)
> > > randread: Laying out IO file(s) (1 file(s) / 256MB)
> > > randread: Laying out IO file(s) (1 file(s) / 256MB)
> > > randread: Laying out IO file(s) (1 file(s) / 256MB)
> > > randread: Laying out IO file(s) (1 file(s) / 256MB)
> > > randread: Laying out IO file(s) (1 file(s) / 256MB)
> > > randread: Laying out IO file(s) (1 file(s) / 256MB)
> > > randread: Laying out IO file(s) (1 file(s) / 256MB)
> > > randread: Laying out IO file(s) (1 file(s) / 256MB)
> > > randread: Laying out IO file(s) (1 file(s) / 256MB)
> > > Jobs: 10 (f=10): [rrrrrrrrrr] [100.0% done] [4616KB/0KB/0KB /s]
> > > [1154/0/0 iops] [eta 00m:00s]
> > > randread: (groupid=0, jobs=10): err= 0: pid=25393: Mon Mar 14
> > > 09:17:24 2016 read : io=597360KB, bw=4976.5KB/s, iops=1244,
> > > runt=120038msec slat (usec): min=4, max=497, avg=22.91, stdev=14.70
> > >     clat (usec): min=154, max=57106, avg=8007.97, stdev=14477.89
> > >      lat (usec): min=276, max=57125, avg=8031.36, stdev=14477.36
> > >     clat percentiles (usec):
> > >      |  1.00th=[  350],  5.00th=[  390], 10.00th=[  414],
> > > 20.00th=[  454], | 30.00th=[  494], 40.00th=[  540], 50.00th=[  612],
> > > 60.00th=[  732], | 70.00th=[ 1064], 80.00th=[10304], 90.00th=[37632],
> > > 95.00th=[38656], | 99.00th=[40192], 99.50th=[41216], 99.90th=[43264],
> > > 99.95th=[43776], | 99.99th=[44800]
> > >     bw (KB  /s): min=  314, max=  967, per=10.01%, avg=498.08,
> > > stdev=83.91 lat (usec) : 250=0.01%, 500=31.64%, 750=29.32%,
> > > 1000=8.21% lat (msec) : 2=5.22%, 4=3.35%, 10=2.22%, 20=0.46%,
> > > 50=19.56% lat (msec) : 100=0.01%
> > >   cpu          : usr=0.14%, sys=0.41%, ctx=153613, majf=0, minf=78
> > >   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> > > >=64=0.0%
> > >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > > >=64=0.0%
> > >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > > >=64=0.0%
> > >      issued    : total=r=149340/w=0/d=0, short=r=0/w=0/d=0
> > >
> > > Run status group 0 (all jobs):
> > >    READ: io=597360KB, aggrb=4976KB/s, minb=4976KB/s, maxb=4976KB/s,
> > > mint=120038msec, maxt=120038msec
> > >
> > > Disk stats (read/write):
> > >   rbd0: ios=149207/3, merge=0/3, ticks=1194356/0, in_queue=1194452,
> > > util=100.00%
> > >
> >
> > Here is the result of a functionally identical fio run inside one of my
> > VMs (entirely against the cache pool/nodes):
> > ---
> > root@tvm-03:~# fio --size=128MB --ioengine=libaio --invalidate=1
> > --direct=1 --numjobs=1 --rw=randread --name=fiojob --blocksize=4k
> > --iodepth=1
> > fiojob: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1
> > 2.0.8
> > Starting 1 process
> > fiojob: (groupid=0, jobs=1): err= 0: pid=18099
> >   read : io=131072KB, bw=4128.1KB/s, iops=1032 , runt= 31745msec
> >     slat (usec): min=19 , max=393 , avg=35.58, stdev=11.76
> >     clat (usec): min=117 , max=7786 , avg=924.96, stdev=422.77
> >      lat (usec): min=146 , max=7835 , avg=961.91, stdev=423.99
> >     clat percentiles (usec):
> >      |  1.00th=[  149],  5.00th=[  175], 10.00th=[  197],
> > 20.00th=[  772], | 30.00th=[  884], 40.00th=[  956], 50.00th=[ 1012],
> > 60.00th=[ 1064], | 70.00th=[ 1128], 80.00th=[ 1192], 90.00th=[ 1288],
> > 95.00th=[ 1400], | 99.00th=[ 1800], 99.50th=[ 2224], 99.90th=[ 3600],
> > 99.95th=[ 3952], | 99.99th=[ 6880]
> >     bw (KB/s)  : min= 3440, max= 8120, per=100.00%, avg=4135.22,
> > stdev=572.73
> >     lat (usec) : 250=17.95%, 500=0.79%, 750=0.60%, 1000=28.67%
> >     lat (msec) : 2=51.34%, 4=0.60%, 10=0.05%
> >   cpu          : usr=1.30%, sys=5.61%, ctx=32985, majf=0, minf=23
> >   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
> > >=64=0.0%
> >      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > >=64=0.0%
> >      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
> > >=64=0.0%
> >      issued    : total=r=32768/w=0/d=0, short=r=0/w=0/d=0
> >
> > Run status group 0 (all jobs):
> >    READ: io=131072KB, aggrb=4128KB/s, minb=4128KB/s, maxb=4128KB/s,
> > mint=31745msec, maxt=31745msec
> >
> > Disk stats (read/write):
> >   vda: ios=32718/0, merge=0/0, ticks=29312/0, in_queue=29228,
> > util=92.06% ---
> >
> > So same ballpark, note the much lower clat/lat times.
> > Note that I use "block/vda/queue/read_ahead_kb = 2048" in sysfs.conf
> > instead of Ceph client configurations.
> > But then again that's basically identical, supposedly only helping with
> > sequential reads and not much of a help with SSDs (and fast CPUs).
> >
> > So yes, your numbers are normal for single client, low depth reads, as
> > many threads in this ML confirm.
> >
> > Cranking up iodepth to 32 gives me up to 7200 IOPS per client (VM),
> > with the storage nodes still very much bored.
> >
> > Christian
> > >
> > >   conf file ( client side ) -->
> > >
> > >   [global]
> > > fsid = 609317d9-c8ee-462f-a82f-f5c28c6c561b
> > > mon_initial_members = open-ceph1,open-ceph2,open-ceph3
> > > mon_host = 10.63.4.101,10.63.4.102,10.63.4.103
> > > auth_cluster_required = none
> > > auth_service_required = none
> > > auth_client_required = none
> > > filestore_xattr_use_omap = true
> > > public_network = 10.63.4.0/23
> > >
> > > filestore_flusher = false
> > >
> > > [client]
> > > rbd cache = true
> > > cache writethrough until flush = true
> > > rbd_readahead_trigger_requests = 50
> > > rbd_readahead_max_bytes = 4096
> > > rbd_readahead_disable_after_bytes = 0
> > > admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok
> > > log file = /var/log/ceph/
> > > rbd concurrent management ops = 20
> > >
> > >
> > >
> > > > "Poor" write performance would indicative of SSDs that are
> > > > unsuitable for Ceph.
> > > >
> > > > > any input is much appreciated ( especially want to know which
> > > > > parameter is crucial for read performance in full ssd cluster )
> > > > >
> > > >
> > > > read_ahead in your clients can improve things, but I guess your
> > > > cluster has more fundamental problems than this.
> > > >
> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-April/028552.html
> > > >
> > > >
> > > > Thanks
> > >
> > >
> > >
> > > > Christian
> > > > --
> > > > Christian Balzer        Network/Systems Engineer
> > > > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > > > http://www.gol.com/
> > > >
> > >
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> >
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux