Re: Old vs New pool on same OSDs - Performance Difference

Nick Fisk <nick@xxxxxxxxxx> · Fri, 5 Jun 2015 17:41:18 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Somnath Roy
> Sent: 04 June 2015 22:41
> To: Nick Fisk; 'Gregory Farnum'
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Old vs New pool on same OSDs - Performance
> Difference
> 
> Nick,
> I noticed that dumping page cache sometime helps as I was hitting Ubuntu
> page cache compaction issue (I shared that to community sometimes back).
> Perf top should show compaction related stack trace then . Setting sysctl vm
> option min_free_kbytes to big numbers (like 5/10 GB in my 64 GB RAM
> setup) may help. But, if it is the same issue over some period of time you will
> hit again if you don't set the above option properly.

Thanks for this, I will look into finding a suitable number and applying it.

> 
> Regarding your second problem:
> If you enable optracker, there are bunch of counters you can dump with
> admin socket. But, if you are saying if it is served from page cache
> performance is improved, it is unlikely it will be within OSD though. But,
> again, same disk serving other RBDS are giving you good numbers (May be
> part of the disk causing problem ?) !
> BTW, are you seeing something wrong in the log by enabling OSD and
> filestore debug level to say 20 ?
> If you can identify what PGs are slowing things down (by log or counters),
> you  can run similar fio reads directly on the drives responsible for holding
> primary OSD for that PG.
> 

I can't seem to find much info regarding the optracker. Do I just enable it by injecting " debug_optracker"? And once its enabled where do I find the counters?

I turned up the debugging and checked a handful of OSD logs, but couldn't see anything obvious which would indicate why it was running slow.

I have also today restarted the OSD's to wipe the stats and then run the fio benchmark again against an old RBD. The op_r_latency from the OSD perf dump matches up with what I am seeing from fio (40-60ms), so something is definitely not right. If I then run a fio benchmark against one of the RBD's which I have recently written to, the average returns to what I would expect. Actual disk latencies via iostat are in the normal range of what I would expect for a 7.2k disk.

There's something funny going on, which seems to relate to reading objects that haven't been written to in a while, either in the OSD or the XFS file system. Interestingly I have 1 OSD which is using EXT4 and the op_r_latency latency is about half compared to the XFS ones after resetting the stats. This could just be a single anomaly, but I wonder if this whole problem is related to XFS?

> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Nick Fisk [mailto:nick@xxxxxxxxxx]
> Sent: Thursday, June 04, 2015 2:12 PM
> To: 'Gregory Farnum'; Somnath Roy
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: RE:  Old vs New pool on same OSDs - Performance
> Difference
> 
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> > Of Gregory Farnum
> > Sent: 04 June 2015 21:22
> > To: Nick Fisk
> > Cc: ceph-users@xxxxxxxxxxxxxx
> > Subject: Re:  Old vs New pool on same OSDs - Performance
> > Difference
> >
> > On Thu, Jun 4, 2015 at 6:31 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> > >
> > > Hi All,
> > >
> > > I have 2 pools both on the same set of OSD’s, 1st is the default rbd
> > > pool
> > created at installation 3 months ago, the other has just recently been
> > created, to verify performance problems.
> > >
> > > As mentioned both pools are on the same set of OSD’s, same crush
> > > ruleset
> > and RBD’s on both are identical in size, version and order. The only
> > real difference that I can think of is that the existing pool as
> > around 5 million objects on it.
> > >
> > > Testing using RBD enabled fio, I see the newly created pool get an
> > expected random read IO performance of around 60 iop’s. The existing
> > pool only gets around half of this. New pool latency = ~15ms Old pool
> > latency = ~35ms for random reads.
> > >
> > > There is no other IO going on in the cluster at the point of running
> > > these
> > tests.
> > >
> > > XFS fragmentation is low, somewhere around 1-2% on most of the disks.
> > Only difference I can think of is that the existing pool has data on
> > it where the new one is empty apart from testing RBD, should this make a
> difference?
> > >
> > > Any ideas?
> > >
> > > Any hints on what I can check to see why latency is so high for the
> > > existing
> > pool?
> > >
> > > Nick
> >
> > Apart from what Somnath said, depending on your PG counts and
> > configuration setup you might also have put enough objects into the
> > cluster that you have a multi-level PG folder hierarchy in the old
> > pool. I wouldn't expect that to make a difference because those
> > folders should be cached in RAM, but if somehow they're not that would
> require more disk accesses.
> >
> > But more likely it's as Somnath suggests and since most of the objects
> > don't exist for images in the new pool it's able to put back ENOENT on
> > accesses much more quickly.
> > -Greg
> 
> Thanks for the replies guys.
> 
> I had previously completely written to both test RBD's until full. Strangely, I
> have just written to them both again and then dropped caches on all OSD
> nodes. Now both seem to perform the same but at the speed of the faster
> pool.
> 
> I have then pointed fio at another existing RBD on the old pool and the
> results are awful, averaging under 10 iops for 64k random read QD=1.
> Unfortunately this RBD has live data on it, so can't overwrite it.
> 
> But something seems up with RBD's (or the underlying objects) that have
> had data written to them a while back. If I make sure the data is in the
> pagecache, then I get really great performance, so it must be something to
> do with reading data off the disk, but I'm lost as to what it might be.
> 
> Iostat doesn't really show anything interesting, but I'm guessing a single
> thread read over 40 disks wouldn't anyway. Are there any counters I could
> look at that might help to break down the steps the OSD goes through to do
> the read to determine where the slow down comes from?
> 
> 
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> 
> ________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com