Re: osd_agent_max_ops relating to number of OSDs in the cache pool

Nick Fisk <nick@xxxxxxxxxx> · Mon, 20 Jul 2015 21:20:06 +0100

Hi David,

I'm also using Ceph to provide block devices to ESXi datastores.

Currently using tgt with the RBD backend to provide iSCSI.

Also tried SCST, LIO and NFS, here's my take on them.

TGT
Pros: Very Stable, talks direct RBD, easy to setup, pacemaker agents, ok performance
Cons: Can't do graceful failover in pacemaker, can't hot extend disks, can't add linux block caches (flashcache), not really maintained anymore, can't see stats in iostat, doesn't support VAAI

LIO
Pros: Good performance, active/passive ALUA, maintained
Cons: Very unstable

SCST
Pros: Good Performance, stable
Cons: PITA to compile after every kernel update

NFS
Pros: Stable, maintained, page cache acts as read cache
Cons: Limited support in ESXi5.5 (better in 6), Poor performance, not using vmfs(is this pro or con?)

Just to touch on a few points, currently Lio has a problem with Ceph RBD's, if Ceph fails to action an IO within ~10s both ESXi and Lio enter a never ending spiral trying to abort each other. This is being actively worked on, along with active/active ALUA support, so will probably become the best solution down the line. 

I ended up choosing TGT as it was the only one which I could use in a production setting, it's not ideal when you look at the list of cons, but after having several dropouts with LIO it's amazing what you will sacrifice for stability.

SCST is a good middle ground between tgt and lio, but after hitting numerous RBD kernel bugs (see below) and keep having to try different kernels recompiling SCST gets old fast.

NFS is actually a really nice solution, but write latency is nearly double that of iSCSI as all ESXi writes are sync writes, so you effectively end up waiting for 2 Ceph IO's for each ESXi write. 1st the actual data write, 2nd the journal of the FS used for the NFS share. Was never able to get more than about 100-150 write iops out of NFS.

Which brings me onto my next point. 

Sync write latency. 

I think a lot of enterprise applications were designed with traditional enterprise storage in mind that can service write IO's measured in microseconds. Whereas in Ceph write IO's tend to happen in around 2-4ms. This normally isn't too much of a problem, however when doing stuff in ESXi like consolidating snapshots/storage vmotion/cloning they are done with 64kb IO's. 

1/0.004 =250 iops
64kb*250 = 16MB/s

Not a hard limit, but you will tend to be limited around about there, which is not fun when you try and copy a 2TB VM. Of course IO from VM's will be passed through to Ceph at whatever size it is submitted as.

I have been testing out something like flashcache to act like a traditional writeback cache, this boosts ESXi performance up to traditional array like levels of performance. However....

1. You need SAS SSD's in your iSCSI nodes if you want HA
2. It's an extra layer for something to go wrong
3. An extra layer to manage with pacemaker
4. You can't use it with TGT with RBD engine, which is probably the biggest blocker for me right now.
5. Kernel RBD client tends to lag librbd

Also please be aware that in older kernels there is a bug which sets the TCP options wrong for RBD (fixed in 3.19) and in more recent kernels (3.19+ I think) max IO sizes are limited by a bug as is the maximum queue depth. These are both fixed in 4.2 I think. Now that there is a RC of 4.2 I can finally start testing RBD to iscsi/NFS again.

Hope that’s helpful
Nick

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Nick Fisk
> Sent: 20 July 2015 15:51
> To: 'David Casier' <david.casier@xxxxxxxx>; ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  osd_agent_max_ops relating to number of OSDs in
> the cache pool
> 
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> > Of David Casier
> > Sent: 20 July 2015 00:27
> > To: ceph-users@xxxxxxxxxxxxxx
> > Subject: Re:  osd_agent_max_ops relating to number of OSDs
> > in the cache pool
> >
> > Nick Fisk <nick@...> writes:
> >
> > >
> > > Hi All,
> > >
> > > I’m doing some testing on the new High/Low speed cache tiering
> > > flushing
> > and I’m trying to get my head
> > > round the effect that changing these 2 settings have on the flushing
> > speed.  When setting the
> > > osd_agent_max_ops to 1, I can get up to 20% improvement before the
> > osd_agent_max_high_ops value kicks in
> > > for high speed flushing. Which is great for bursty workloads.
> > >
> > > As I understand it, these settings loosely effect the number of
> > > concurrent
> > operations the cache pool
> > > OSD’s will flush down to the base pool.
> > >
> > > I may have got completely the wrong idea in my head but I can’t
> > > understand
> > how a static default setting will
> > > work with different cache/base ratios. For example if I had a
> > > relatively
> > small number of very fast cache
> > > tier OSD’s (PCI-E SSD perhaps) and a much larger number of base tier
> > OSD’s, would the value need to be
> > > increased to ensure sufficient utilisation of the base tier and make
> > > sure
> > that the cache tier doesn’t
> > > fill up too fast?
> > >
> > > Alternatively where the cache tier is based on spinning disks or
> > > where the
> > base tier is not as comparatively
> > > large, this value may need to be reduced to stop it saturating the disks.
> > >
> > > Any Thoughts?
> > >
> > > Nick
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users <at> lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> >
> >
> > Hi Nick,
> > The best way is that the working space does not exceed the volume of
> > the tier pool.
> > If the workspace does not fit in the tier pool, the average rates
> > should not exceed the performance of base pool.
> 
> Hi David,
> 
> Thanks for your response, I know in an ideal scenario your working set should
> fit in the tier, however often you will be copying in new data or running some
> sort of workload which causes a dramatic change in the cache contents. Part
> of this work is trying to minimise the impact of cache flushing.
> 
> Nick
> 
> >
> > Cordialement,
> > David Casier.
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com