Re: Old vs New pool on same OSDs - Performance Difference

Nick Fisk <nick@xxxxxxxxxx> · Sat, 20 Jun 2015 21:51:18 +0100

Thanks for your response Somnath, Responses inline

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Somnath Roy
> Sent: 20 June 2015 17:52
> To: Nick Fisk
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Old vs New pool on same OSDs - Performance
> Difference
> 
> Nick,
> I think the only IO operation between 'reached_pg' and 'started' is fetching
> the xttrs. Check the following.
> 
> 1. Enable debug log 20 (for OSD and filestore) and record the time taken by
> the function
> find_object_context/get_object_context/get_snapset_context/getattr (in
> filestore).

I've had to kick our replication jobs back into gear but will try and find a quiet moment in the next 24 hours to try this.

> 
> 2. If nothing suspicious there and you have the luxury to build the code, I
> would suggest to add some logs in ReplicatedPG::do_request and
> ReplicatedPG::do_op
> 

I will have to see  how things progress, it’s a production cluster so I have to be careful about doing things like these.

> 3. If point 1 is the culprit, check within getattr if it is going to omap for
> fetching the attribute or not.

Interesting you say this, I have "filestore_xattr_use_omap" set, as I have a couple of EXT4 OSD's. Could this be the cause of the slowdown? I guess I will know more once I have the debug logs.

> 
> Hope this helps,
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: Nick Fisk [mailto:nick@xxxxxxxxxx]
> Sent: Saturday, June 20, 2015 8:10 AM
> To: Somnath Roy
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: RE:  Old vs New pool on same OSDs - Performance
> Difference
> 
> Sorry to dig up this old thread, but I've finally had time to schedule some
> quiet time on the cluster and perform some more testing.
> 
> Just to recap, the problem is not related to pools, but it appears to be with
> RBD's which haven't been accessed in a while. Re-writing over blocks of an
> old RBD seems to restore performance to what you would expect. If the
> objects are in page cache then performance also seems ok, however disks
> are practically idle during the test suggesting the bottleneck is somewhere
> between the filesystem and the OSD.
> 
> I have cleared the performance counters and then run a fio test against an
> old RBD, the fio test exhibits slow performance of around 12 iops. The fio run
> is for random read 64kb iops.
> 
> I then dump the historic operations and there are several like the below,
> where the duration seems to match up with the latency in fio:-
> 
> "description": "osd_op(client.2626544.0:316
> rb.0.1ba70.238e1f29.0000000098e8 [] 0.e55cb2d7
> ack+read+known_if_redirected e20117)",
>             "initiated_at": "2015-06-20 15:39:05.596187",
>             "age": 289.245112,
>             "duration": 0.083428,
>             "type_data": [
>                 "started",
>                 {
>                     "client": "client.2626544",
>                     "tid": 316
>                 },
>                 [
>                     {
>                         "time": "2015-06-20 15:39:05.596187",
>                         "event": "initiated"
>                     },
>                     {
>                         "time": "2015-06-20 15:39:05.596431",
>                         "event": "reached_pg"
>                     },
>                     {
>                         "time": "2015-06-20 15:39:05.679303",
>                         "event": "started"
>                     },
>                     {
>                         "time": "2015-06-20 15:39:05.679615",
>                         "event": "done"
>                     }
>                 ]
>             ]
> 
> The main delay seems to happen between reached_pg and started, any idea
> what would be happening in this period?
> 
> Nick
> 
> 
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> > Of Somnath Roy
> > Sent: 05 June 2015 19:00
> > To: Nick Fisk; 'Gregory Farnum'
> > Cc: ceph-users@xxxxxxxxxxxxxx
> > Subject: Re:  Old vs New pool on same OSDs - Performance
> > Difference
> >
> > You don't need to enable debug_optracker.
> > Basically, I was taking about admin socket perf dump only which you
> > seems to be dumping. I meant to say with recent version there is one
> > optracker enable/disable flag and if it is disabled, the perf dump
> > will not give you proper data.
> > Hopefully, no scrubbing going on that pool.
> >
> > Thanks & Regards
> > Somnath
> > -----Original Message-----
> > From: Nick Fisk [mailto:nick@xxxxxxxxxx]
> > Sent: Friday, June 05, 2015 9:41 AM
> > To: Somnath Roy; 'Nick Fisk'; 'Gregory Farnum'
> > Cc: ceph-users@xxxxxxxxxxxxxx
> > Subject: RE:  Old vs New pool on same OSDs - Performance
> > Difference
> >
> >
> >
> >
> >
> > > -----Original Message-----
> > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > > Behalf Of Somnath Roy
> > > Sent: 04 June 2015 22:41
> > > To: Nick Fisk; 'Gregory Farnum'
> > > Cc: ceph-users@xxxxxxxxxxxxxx
> > > Subject: Re:  Old vs New pool on same OSDs - Performance
> > > Difference
> > >
> > > Nick,
> > > I noticed that dumping page cache sometime helps as I was hitting
> > > Ubuntu page cache compaction issue (I shared that to community
> > sometimes back).
> > > Perf top should show compaction related stack trace then . Setting
> > > sysctl vm option min_free_kbytes to big numbers (like 5/10 GB in my
> > > 64 GB RAM
> > > setup) may help. But, if it is the same issue over some period of
> > > time you will hit again if you don't set the above option properly.
> >
> > Thanks for this, I will look into finding a suitable number and applying it.
> >
> > >
> > > Regarding your second problem:
> > > If you enable optracker, there are bunch of counters you can dump
> > > with admin socket. But, if you are saying if it is served from page
> > > cache performance is improved, it is unlikely it will be within OSD though.
> > > But, again, same disk serving other RBDS are giving you good numbers
> > > (May be part of the disk causing problem ?) !
> > > BTW, are you seeing something wrong in the log by enabling OSD and
> > > filestore debug level to say 20 ?
> > > If you can identify what PGs are slowing things down (by log or
> > > counters), you  can run similar fio reads directly on the drives
> > > responsible for holding primary OSD for that PG.
> > >
> >
> > I can't seem to find much info regarding the optracker. Do I just
> > enable it by injecting " debug_optracker"? And once its enabled where
> > do I find the counters?
> >
> > I turned up the debugging and checked a handful of OSD logs, but
> > couldn't see anything obvious which would indicate why it was running
> slow.
> >
> > I have also today restarted the OSD's to wipe the stats and then run
> > the fio benchmark again against an old RBD. The op_r_latency from the
> > OSD perf dump matches up with what I am seeing from fio (40-60ms), so
> > something is definitely not right. If I then run a fio benchmark
> > against one of the RBD's which I have recently written to, the average
> returns to what I would expect.
> > Actual disk latencies via iostat are in the normal range of what I
> > would expect for a 7.2k disk.
> >
> > There's something funny going on, which seems to relate to reading
> > objects that haven't been written to in a while, either in the OSD or
> > the XFS file system. Interestingly I have 1 OSD which is using EXT4
> > and the op_r_latency latency is about half compared to the XFS ones
> > after resetting the stats. This could just be a single anomaly, but I
> > wonder if this whole problem is related to XFS?
> >
> > > Thanks & Regards
> > > Somnath
> > >
> > > -----Original Message-----
> > > From: Nick Fisk [mailto:nick@xxxxxxxxxx]
> > > Sent: Thursday, June 04, 2015 2:12 PM
> > > To: 'Gregory Farnum'; Somnath Roy
> > > Cc: ceph-users@xxxxxxxxxxxxxx
> > > Subject: RE:  Old vs New pool on same OSDs - Performance
> > > Difference
> > >
> > > > -----Original Message-----
> > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > > > Behalf Of Gregory Farnum
> > > > Sent: 04 June 2015 21:22
> > > > To: Nick Fisk
> > > > Cc: ceph-users@xxxxxxxxxxxxxx
> > > > Subject: Re:  Old vs New pool on same OSDs -
> > > > Performance Difference
> > > >
> > > > On Thu, Jun 4, 2015 at 6:31 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
> > > > >
> > > > > Hi All,
> > > > >
> > > > > I have 2 pools both on the same set of OSD’s, 1st is the default
> > > > > rbd pool
> > > > created at installation 3 months ago, the other has just recently
> > > > been created, to verify performance problems.
> > > > >
> > > > > As mentioned both pools are on the same set of OSD’s, same crush
> > > > > ruleset
> > > > and RBD’s on both are identical in size, version and order. The
> > > > only real difference that I can think of is that the existing pool
> > > > as around 5 million objects on it.
> > > > >
> > > > > Testing using RBD enabled fio, I see the newly created pool get
> > > > > an
> > > > expected random read IO performance of around 60 iop’s. The
> > > > existing pool only gets around half of this. New pool latency =
> > > > ~15ms Old pool latency = ~35ms for random reads.
> > > > >
> > > > > There is no other IO going on in the cluster at the point of
> > > > > running these
> > > > tests.
> > > > >
> > > > > XFS fragmentation is low, somewhere around 1-2% on most of the
> > disks.
> > > > Only difference I can think of is that the existing pool has data
> > > > on it where the new one is empty apart from testing RBD, should
> > > > this make a
> > > difference?
> > > > >
> > > > > Any ideas?
> > > > >
> > > > > Any hints on what I can check to see why latency is so high for
> > > > > the existing
> > > > pool?
> > > > >
> > > > > Nick
> > > >
> > > > Apart from what Somnath said, depending on your PG counts and
> > > > configuration setup you might also have put enough objects into
> > > > the cluster that you have a multi-level PG folder hierarchy in the
> > > > old pool. I wouldn't expect that to make a difference because
> > > > those folders should be cached in RAM, but if somehow they're not
> > > > that would
> > > require more disk accesses.
> > > >
> > > > But more likely it's as Somnath suggests and since most of the
> > > > objects don't exist for images in the new pool it's able to put
> > > > back ENOENT on accesses much more quickly.
> > > > -Greg
> > >
> > > Thanks for the replies guys.
> > >
> > > I had previously completely written to both test RBD's until full.
> > > Strangely, I have just written to them both again and then dropped
> > > caches on all OSD nodes. Now both seem to perform the same but at
> > > the speed of the faster pool.
> > >
> > > I have then pointed fio at another existing RBD on the old pool and
> > > the results are awful, averaging under 10 iops for 64k random read QD=1.
> > > Unfortunately this RBD has live data on it, so can't overwrite it.
> > >
> > > But something seems up with RBD's (or the underlying objects) that
> > > have had data written to them a while back. If I make sure the data
> > > is in the pagecache, then I get really great performance, so it must
> > > be something to do with reading data off the disk, but I'm lost as
> > > to what it
> > might be.
> > >
> > > Iostat doesn't really show anything interesting, but I'm guessing a
> > > single thread read over 40 disks wouldn't anyway. Are there any
> > > counters I could look at that might help to break down the steps the
> > > OSD goes through to do the read to determine where the slow down
> > comes from?
> > >
> > >
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users@xxxxxxxxxxxxxx
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
> > >
> > >
> > >
> > > ________________________________
> > >
> > > PLEASE NOTE: The information contained in this electronic mail
> > > message is intended only for the use of the designated recipient(s)
> > > named above. If the reader of this message is not the intended
> > > recipient, you are hereby notified that you have received this
> > > message in error and that any review, dissemination, distribution,
> > > or copying of this message is strictly prohibited. If you have
> > > received this communication in error, please notify the sender by
> > > telephone or e-mail (as shown above) immediately and destroy any and
> > > all copies of this message in your possession (whether hard copies
> > > or electronically
> > stored copies).
> > >
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@xxxxxxxxxxxxxx
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> >
> >
> > ________________________________
> >
> > PLEASE NOTE: The information contained in this electronic mail message
> > is intended only for the use of the designated recipient(s) named
> > above. If the reader of this message is not the intended recipient,
> > you are hereby notified that you have received this message in error
> > and that any review, dissemination, distribution, or copying of this
> > message is strictly prohibited. If you have received this
> > communication in error, please notify the sender by telephone or
> > e-mail (as shown above) immediately and destroy any and all copies of
> > this message in your possession (whether hard copies or electronically
> stored copies).
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> 
> ________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com