Thanks for your response Somnath, Responses inline > -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > Somnath Roy > Sent: 20 June 2015 17:52 > To: Nick Fisk > Cc: ceph-users@xxxxxxxxxxxxxx > Subject: Re: Old vs New pool on same OSDs - Performance > Difference > > Nick, > I think the only IO operation between 'reached_pg' and 'started' is fetching > the xttrs. Check the following. > > 1. Enable debug log 20 (for OSD and filestore) and record the time taken by > the function > find_object_context/get_object_context/get_snapset_context/getattr (in > filestore). I've had to kick our replication jobs back into gear but will try and find a quiet moment in the next 24 hours to try this. > > 2. If nothing suspicious there and you have the luxury to build the code, I > would suggest to add some logs in ReplicatedPG::do_request and > ReplicatedPG::do_op > I will have to see how things progress, it’s a production cluster so I have to be careful about doing things like these. > 3. If point 1 is the culprit, check within getattr if it is going to omap for > fetching the attribute or not. Interesting you say this, I have "filestore_xattr_use_omap" set, as I have a couple of EXT4 OSD's. Could this be the cause of the slowdown? I guess I will know more once I have the debug logs. > > Hope this helps, > > Thanks & Regards > Somnath > > -----Original Message----- > From: Nick Fisk [mailto:nick@xxxxxxxxxx] > Sent: Saturday, June 20, 2015 8:10 AM > To: Somnath Roy > Cc: ceph-users@xxxxxxxxxxxxxx > Subject: RE: Old vs New pool on same OSDs - Performance > Difference > > Sorry to dig up this old thread, but I've finally had time to schedule some > quiet time on the cluster and perform some more testing. > > Just to recap, the problem is not related to pools, but it appears to be with > RBD's which haven't been accessed in a while. Re-writing over blocks of an > old RBD seems to restore performance to what you would expect. If the > objects are in page cache then performance also seems ok, however disks > are practically idle during the test suggesting the bottleneck is somewhere > between the filesystem and the OSD. > > I have cleared the performance counters and then run a fio test against an > old RBD, the fio test exhibits slow performance of around 12 iops. The fio run > is for random read 64kb iops. > > I then dump the historic operations and there are several like the below, > where the duration seems to match up with the latency in fio:- > > "description": "osd_op(client.2626544.0:316 > rb.0.1ba70.238e1f29.0000000098e8 [] 0.e55cb2d7 > ack+read+known_if_redirected e20117)", > "initiated_at": "2015-06-20 15:39:05.596187", > "age": 289.245112, > "duration": 0.083428, > "type_data": [ > "started", > { > "client": "client.2626544", > "tid": 316 > }, > [ > { > "time": "2015-06-20 15:39:05.596187", > "event": "initiated" > }, > { > "time": "2015-06-20 15:39:05.596431", > "event": "reached_pg" > }, > { > "time": "2015-06-20 15:39:05.679303", > "event": "started" > }, > { > "time": "2015-06-20 15:39:05.679615", > "event": "done" > } > ] > ] > > The main delay seems to happen between reached_pg and started, any idea > what would be happening in this period? > > Nick > > > > -----Original Message----- > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf > > Of Somnath Roy > > Sent: 05 June 2015 19:00 > > To: Nick Fisk; 'Gregory Farnum' > > Cc: ceph-users@xxxxxxxxxxxxxx > > Subject: Re: Old vs New pool on same OSDs - Performance > > Difference > > > > You don't need to enable debug_optracker. > > Basically, I was taking about admin socket perf dump only which you > > seems to be dumping. I meant to say with recent version there is one > > optracker enable/disable flag and if it is disabled, the perf dump > > will not give you proper data. > > Hopefully, no scrubbing going on that pool. > > > > Thanks & Regards > > Somnath > > -----Original Message----- > > From: Nick Fisk [mailto:nick@xxxxxxxxxx] > > Sent: Friday, June 05, 2015 9:41 AM > > To: Somnath Roy; 'Nick Fisk'; 'Gregory Farnum' > > Cc: ceph-users@xxxxxxxxxxxxxx > > Subject: RE: Old vs New pool on same OSDs - Performance > > Difference > > > > > > > > > > > > > -----Original Message----- > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > > > Behalf Of Somnath Roy > > > Sent: 04 June 2015 22:41 > > > To: Nick Fisk; 'Gregory Farnum' > > > Cc: ceph-users@xxxxxxxxxxxxxx > > > Subject: Re: Old vs New pool on same OSDs - Performance > > > Difference > > > > > > Nick, > > > I noticed that dumping page cache sometime helps as I was hitting > > > Ubuntu page cache compaction issue (I shared that to community > > sometimes back). > > > Perf top should show compaction related stack trace then . Setting > > > sysctl vm option min_free_kbytes to big numbers (like 5/10 GB in my > > > 64 GB RAM > > > setup) may help. But, if it is the same issue over some period of > > > time you will hit again if you don't set the above option properly. > > > > Thanks for this, I will look into finding a suitable number and applying it. > > > > > > > > Regarding your second problem: > > > If you enable optracker, there are bunch of counters you can dump > > > with admin socket. But, if you are saying if it is served from page > > > cache performance is improved, it is unlikely it will be within OSD though. > > > But, again, same disk serving other RBDS are giving you good numbers > > > (May be part of the disk causing problem ?) ! > > > BTW, are you seeing something wrong in the log by enabling OSD and > > > filestore debug level to say 20 ? > > > If you can identify what PGs are slowing things down (by log or > > > counters), you can run similar fio reads directly on the drives > > > responsible for holding primary OSD for that PG. > > > > > > > I can't seem to find much info regarding the optracker. Do I just > > enable it by injecting " debug_optracker"? And once its enabled where > > do I find the counters? > > > > I turned up the debugging and checked a handful of OSD logs, but > > couldn't see anything obvious which would indicate why it was running > slow. > > > > I have also today restarted the OSD's to wipe the stats and then run > > the fio benchmark again against an old RBD. The op_r_latency from the > > OSD perf dump matches up with what I am seeing from fio (40-60ms), so > > something is definitely not right. If I then run a fio benchmark > > against one of the RBD's which I have recently written to, the average > returns to what I would expect. > > Actual disk latencies via iostat are in the normal range of what I > > would expect for a 7.2k disk. > > > > There's something funny going on, which seems to relate to reading > > objects that haven't been written to in a while, either in the OSD or > > the XFS file system. Interestingly I have 1 OSD which is using EXT4 > > and the op_r_latency latency is about half compared to the XFS ones > > after resetting the stats. This could just be a single anomaly, but I > > wonder if this whole problem is related to XFS? > > > > > Thanks & Regards > > > Somnath > > > > > > -----Original Message----- > > > From: Nick Fisk [mailto:nick@xxxxxxxxxx] > > > Sent: Thursday, June 04, 2015 2:12 PM > > > To: 'Gregory Farnum'; Somnath Roy > > > Cc: ceph-users@xxxxxxxxxxxxxx > > > Subject: RE: Old vs New pool on same OSDs - Performance > > > Difference > > > > > > > -----Original Message----- > > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > > > > Behalf Of Gregory Farnum > > > > Sent: 04 June 2015 21:22 > > > > To: Nick Fisk > > > > Cc: ceph-users@xxxxxxxxxxxxxx > > > > Subject: Re: Old vs New pool on same OSDs - > > > > Performance Difference > > > > > > > > On Thu, Jun 4, 2015 at 6:31 AM, Nick Fisk <nick@xxxxxxxxxx> wrote: > > > > > > > > > > Hi All, > > > > > > > > > > I have 2 pools both on the same set of OSD’s, 1st is the default > > > > > rbd pool > > > > created at installation 3 months ago, the other has just recently > > > > been created, to verify performance problems. > > > > > > > > > > As mentioned both pools are on the same set of OSD’s, same crush > > > > > ruleset > > > > and RBD’s on both are identical in size, version and order. The > > > > only real difference that I can think of is that the existing pool > > > > as around 5 million objects on it. > > > > > > > > > > Testing using RBD enabled fio, I see the newly created pool get > > > > > an > > > > expected random read IO performance of around 60 iop’s. The > > > > existing pool only gets around half of this. New pool latency = > > > > ~15ms Old pool latency = ~35ms for random reads. > > > > > > > > > > There is no other IO going on in the cluster at the point of > > > > > running these > > > > tests. > > > > > > > > > > XFS fragmentation is low, somewhere around 1-2% on most of the > > disks. > > > > Only difference I can think of is that the existing pool has data > > > > on it where the new one is empty apart from testing RBD, should > > > > this make a > > > difference? > > > > > > > > > > Any ideas? > > > > > > > > > > Any hints on what I can check to see why latency is so high for > > > > > the existing > > > > pool? > > > > > > > > > > Nick > > > > > > > > Apart from what Somnath said, depending on your PG counts and > > > > configuration setup you might also have put enough objects into > > > > the cluster that you have a multi-level PG folder hierarchy in the > > > > old pool. I wouldn't expect that to make a difference because > > > > those folders should be cached in RAM, but if somehow they're not > > > > that would > > > require more disk accesses. > > > > > > > > But more likely it's as Somnath suggests and since most of the > > > > objects don't exist for images in the new pool it's able to put > > > > back ENOENT on accesses much more quickly. > > > > -Greg > > > > > > Thanks for the replies guys. > > > > > > I had previously completely written to both test RBD's until full. > > > Strangely, I have just written to them both again and then dropped > > > caches on all OSD nodes. Now both seem to perform the same but at > > > the speed of the faster pool. > > > > > > I have then pointed fio at another existing RBD on the old pool and > > > the results are awful, averaging under 10 iops for 64k random read QD=1. > > > Unfortunately this RBD has live data on it, so can't overwrite it. > > > > > > But something seems up with RBD's (or the underlying objects) that > > > have had data written to them a while back. If I make sure the data > > > is in the pagecache, then I get really great performance, so it must > > > be something to do with reading data off the disk, but I'm lost as > > > to what it > > might be. > > > > > > Iostat doesn't really show anything interesting, but I'm guessing a > > > single thread read over 40 disks wouldn't anyway. Are there any > > > counters I could look at that might help to break down the steps the > > > OSD goes through to do the read to determine where the slow down > > comes from? > > > > > > > > > > _______________________________________________ > > > > ceph-users mailing list > > > > ceph-users@xxxxxxxxxxxxxx > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > > > > > > ________________________________ > > > > > > PLEASE NOTE: The information contained in this electronic mail > > > message is intended only for the use of the designated recipient(s) > > > named above. If the reader of this message is not the intended > > > recipient, you are hereby notified that you have received this > > > message in error and that any review, dissemination, distribution, > > > or copying of this message is strictly prohibited. If you have > > > received this communication in error, please notify the sender by > > > telephone or e-mail (as shown above) immediately and destroy any and > > > all copies of this message in your possession (whether hard copies > > > or electronically > > stored copies). > > > > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > ________________________________ > > > > PLEASE NOTE: The information contained in this electronic mail message > > is intended only for the use of the designated recipient(s) named > > above. If the reader of this message is not the intended recipient, > > you are hereby notified that you have received this message in error > > and that any review, dissemination, distribution, or copying of this > > message is strictly prohibited. If you have received this > > communication in error, please notify the sender by telephone or > > e-mail (as shown above) immediately and destroy any and all copies of > > this message in your possession (whether hard copies or electronically > stored copies). > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > ________________________________ > > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If the > reader of this message is not the intended recipient, you are hereby notified > that you have received this message in error and that any review, > dissemination, distribution, or copying of this message is strictly prohibited. If > you have received this communication in error, please notify the sender by > telephone or e-mail (as shown above) immediately and destroy any and all > copies of this message in your possession (whether hard copies or > electronically stored copies). > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com