What release you are using ? filestore_xattr_use_omap is deprecated long back.. Thanks & Regards Somnath -----Original Message----- From: Nick Fisk [mailto:nick@xxxxxxxxxx] Sent: Saturday, June 20, 2015 1:51 PM To: Somnath Roy Cc: ceph-users@xxxxxxxxxxxxxx Subject: RE: Old vs New pool on same OSDs - Performance Difference Thanks for your response Somnath, Responses inline > -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf > Of Somnath Roy > Sent: 20 June 2015 17:52 > To: Nick Fisk > Cc: ceph-users@xxxxxxxxxxxxxx > Subject: Re: Old vs New pool on same OSDs - Performance > Difference > > Nick, > I think the only IO operation between 'reached_pg' and 'started' is > fetching the xttrs. Check the following. > > 1. Enable debug log 20 (for OSD and filestore) and record the time > taken by the function > find_object_context/get_object_context/get_snapset_context/getattr (in > filestore). I've had to kick our replication jobs back into gear but will try and find a quiet moment in the next 24 hours to try this. > > 2. If nothing suspicious there and you have the luxury to build the > code, I would suggest to add some logs in ReplicatedPG::do_request and > ReplicatedPG::do_op > I will have to see how things progress, it’s a production cluster so I have to be careful about doing things like these. > 3. If point 1 is the culprit, check within getattr if it is going to > omap for fetching the attribute or not. Interesting you say this, I have "filestore_xattr_use_omap" set, as I have a couple of EXT4 OSD's. Could this be the cause of the slowdown? I guess I will know more once I have the debug logs. > > Hope this helps, > > Thanks & Regards > Somnath > > -----Original Message----- > From: Nick Fisk [mailto:nick@xxxxxxxxxx] > Sent: Saturday, June 20, 2015 8:10 AM > To: Somnath Roy > Cc: ceph-users@xxxxxxxxxxxxxx > Subject: RE: Old vs New pool on same OSDs - Performance > Difference > > Sorry to dig up this old thread, but I've finally had time to schedule > some quiet time on the cluster and perform some more testing. > > Just to recap, the problem is not related to pools, but it appears to > be with RBD's which haven't been accessed in a while. Re-writing over > blocks of an old RBD seems to restore performance to what you would > expect. If the objects are in page cache then performance also seems > ok, however disks are practically idle during the test suggesting the > bottleneck is somewhere between the filesystem and the OSD. > > I have cleared the performance counters and then run a fio test > against an old RBD, the fio test exhibits slow performance of around > 12 iops. The fio run is for random read 64kb iops. > > I then dump the historic operations and there are several like the > below, where the duration seems to match up with the latency in fio:- > > "description": "osd_op(client.2626544.0:316 > rb.0.1ba70.238e1f29.0000000098e8 [] 0.e55cb2d7 > ack+read+known_if_redirected e20117)", > "initiated_at": "2015-06-20 15:39:05.596187", > "age": 289.245112, > "duration": 0.083428, > "type_data": [ > "started", > { > "client": "client.2626544", > "tid": 316 > }, > [ > { > "time": "2015-06-20 15:39:05.596187", > "event": "initiated" > }, > { > "time": "2015-06-20 15:39:05.596431", > "event": "reached_pg" > }, > { > "time": "2015-06-20 15:39:05.679303", > "event": "started" > }, > { > "time": "2015-06-20 15:39:05.679615", > "event": "done" > } > ] > ] > > The main delay seems to happen between reached_pg and started, any > idea what would be happening in this period? > > Nick > > > > -----Original Message----- > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > > Behalf Of Somnath Roy > > Sent: 05 June 2015 19:00 > > To: Nick Fisk; 'Gregory Farnum' > > Cc: ceph-users@xxxxxxxxxxxxxx > > Subject: Re: Old vs New pool on same OSDs - Performance > > Difference > > > > You don't need to enable debug_optracker. > > Basically, I was taking about admin socket perf dump only which you > > seems to be dumping. I meant to say with recent version there is one > > optracker enable/disable flag and if it is disabled, the perf dump > > will not give you proper data. > > Hopefully, no scrubbing going on that pool. > > > > Thanks & Regards > > Somnath > > -----Original Message----- > > From: Nick Fisk [mailto:nick@xxxxxxxxxx] > > Sent: Friday, June 05, 2015 9:41 AM > > To: Somnath Roy; 'Nick Fisk'; 'Gregory Farnum' > > Cc: ceph-users@xxxxxxxxxxxxxx > > Subject: RE: Old vs New pool on same OSDs - Performance > > Difference > > > > > > > > > > > > > -----Original Message----- > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > > > Behalf Of Somnath Roy > > > Sent: 04 June 2015 22:41 > > > To: Nick Fisk; 'Gregory Farnum' > > > Cc: ceph-users@xxxxxxxxxxxxxx > > > Subject: Re: Old vs New pool on same OSDs - > > > Performance Difference > > > > > > Nick, > > > I noticed that dumping page cache sometime helps as I was hitting > > > Ubuntu page cache compaction issue (I shared that to community > > sometimes back). > > > Perf top should show compaction related stack trace then . Setting > > > sysctl vm option min_free_kbytes to big numbers (like 5/10 GB in > > > my > > > 64 GB RAM > > > setup) may help. But, if it is the same issue over some period of > > > time you will hit again if you don't set the above option properly. > > > > Thanks for this, I will look into finding a suitable number and applying it. > > > > > > > > Regarding your second problem: > > > If you enable optracker, there are bunch of counters you can dump > > > with admin socket. But, if you are saying if it is served from > > > page cache performance is improved, it is unlikely it will be within OSD though. > > > But, again, same disk serving other RBDS are giving you good > > > numbers (May be part of the disk causing problem ?) ! > > > BTW, are you seeing something wrong in the log by enabling OSD and > > > filestore debug level to say 20 ? > > > If you can identify what PGs are slowing things down (by log or > > > counters), you can run similar fio reads directly on the drives > > > responsible for holding primary OSD for that PG. > > > > > > > I can't seem to find much info regarding the optracker. Do I just > > enable it by injecting " debug_optracker"? And once its enabled > > where do I find the counters? > > > > I turned up the debugging and checked a handful of OSD logs, but > > couldn't see anything obvious which would indicate why it was > > running > slow. > > > > I have also today restarted the OSD's to wipe the stats and then run > > the fio benchmark again against an old RBD. The op_r_latency from > > the OSD perf dump matches up with what I am seeing from fio > > (40-60ms), so something is definitely not right. If I then run a fio > > benchmark against one of the RBD's which I have recently written to, > > the average > returns to what I would expect. > > Actual disk latencies via iostat are in the normal range of what I > > would expect for a 7.2k disk. > > > > There's something funny going on, which seems to relate to reading > > objects that haven't been written to in a while, either in the OSD > > or the XFS file system. Interestingly I have 1 OSD which is using > > EXT4 and the op_r_latency latency is about half compared to the XFS > > ones after resetting the stats. This could just be a single anomaly, > > but I wonder if this whole problem is related to XFS? > > > > > Thanks & Regards > > > Somnath > > > > > > -----Original Message----- > > > From: Nick Fisk [mailto:nick@xxxxxxxxxx] > > > Sent: Thursday, June 04, 2015 2:12 PM > > > To: 'Gregory Farnum'; Somnath Roy > > > Cc: ceph-users@xxxxxxxxxxxxxx > > > Subject: RE: Old vs New pool on same OSDs - > > > Performance Difference > > > > > > > -----Original Message----- > > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > > > > Behalf Of Gregory Farnum > > > > Sent: 04 June 2015 21:22 > > > > To: Nick Fisk > > > > Cc: ceph-users@xxxxxxxxxxxxxx > > > > Subject: Re: Old vs New pool on same OSDs - > > > > Performance Difference > > > > > > > > On Thu, Jun 4, 2015 at 6:31 AM, Nick Fisk <nick@xxxxxxxxxx> wrote: > > > > > > > > > > Hi All, > > > > > > > > > > I have 2 pools both on the same set of OSD’s, 1st is the > > > > > default rbd pool > > > > created at installation 3 months ago, the other has just > > > > recently been created, to verify performance problems. > > > > > > > > > > As mentioned both pools are on the same set of OSD’s, same > > > > > crush ruleset > > > > and RBD’s on both are identical in size, version and order. The > > > > only real difference that I can think of is that the existing > > > > pool as around 5 million objects on it. > > > > > > > > > > Testing using RBD enabled fio, I see the newly created pool > > > > > get an > > > > expected random read IO performance of around 60 iop’s. The > > > > existing pool only gets around half of this. New pool latency = > > > > ~15ms Old pool latency = ~35ms for random reads. > > > > > > > > > > There is no other IO going on in the cluster at the point of > > > > > running these > > > > tests. > > > > > > > > > > XFS fragmentation is low, somewhere around 1-2% on most of the > > disks. > > > > Only difference I can think of is that the existing pool has > > > > data on it where the new one is empty apart from testing RBD, > > > > should this make a > > > difference? > > > > > > > > > > Any ideas? > > > > > > > > > > Any hints on what I can check to see why latency is so high > > > > > for the existing > > > > pool? > > > > > > > > > > Nick > > > > > > > > Apart from what Somnath said, depending on your PG counts and > > > > configuration setup you might also have put enough objects into > > > > the cluster that you have a multi-level PG folder hierarchy in > > > > the old pool. I wouldn't expect that to make a difference > > > > because those folders should be cached in RAM, but if somehow > > > > they're not that would > > > require more disk accesses. > > > > > > > > But more likely it's as Somnath suggests and since most of the > > > > objects don't exist for images in the new pool it's able to put > > > > back ENOENT on accesses much more quickly. > > > > -Greg > > > > > > Thanks for the replies guys. > > > > > > I had previously completely written to both test RBD's until full. > > > Strangely, I have just written to them both again and then dropped > > > caches on all OSD nodes. Now both seem to perform the same but at > > > the speed of the faster pool. > > > > > > I have then pointed fio at another existing RBD on the old pool > > > and the results are awful, averaging under 10 iops for 64k random read QD=1. > > > Unfortunately this RBD has live data on it, so can't overwrite it. > > > > > > But something seems up with RBD's (or the underlying objects) that > > > have had data written to them a while back. If I make sure the > > > data is in the pagecache, then I get really great performance, so > > > it must be something to do with reading data off the disk, but I'm > > > lost as to what it > > might be. > > > > > > Iostat doesn't really show anything interesting, but I'm guessing > > > a single thread read over 40 disks wouldn't anyway. Are there any > > > counters I could look at that might help to break down the steps > > > the OSD goes through to do the read to determine where the slow > > > down > > comes from? > > > > > > > > > > _______________________________________________ > > > > ceph-users mailing list > > > > ceph-users@xxxxxxxxxxxxxx > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > > > > > > > ________________________________ > > > > > > PLEASE NOTE: The information contained in this electronic mail > > > message is intended only for the use of the designated > > > recipient(s) named above. If the reader of this message is not the > > > intended recipient, you are hereby notified that you have received > > > this message in error and that any review, dissemination, > > > distribution, or copying of this message is strictly prohibited. > > > If you have received this communication in error, please notify > > > the sender by telephone or e-mail (as shown above) immediately and > > > destroy any and all copies of this message in your possession > > > (whether hard copies or electronically > > stored copies). > > > > > > _______________________________________________ > > > ceph-users mailing list > > > ceph-users@xxxxxxxxxxxxxx > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > > > > ________________________________ > > > > PLEASE NOTE: The information contained in this electronic mail > > message is intended only for the use of the designated recipient(s) > > named above. If the reader of this message is not the intended > > recipient, you are hereby notified that you have received this > > message in error and that any review, dissemination, distribution, > > or copying of this message is strictly prohibited. If you have > > received this communication in error, please notify the sender by > > telephone or e-mail (as shown above) immediately and destroy any and > > all copies of this message in your possession (whether hard copies > > or electronically > stored copies). > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > ________________________________ > > PLEASE NOTE: The information contained in this electronic mail message > is intended only for the use of the designated recipient(s) named > above. If the reader of this message is not the intended recipient, > you are hereby notified that you have received this message in error > and that any review, dissemination, distribution, or copying of this > message is strictly prohibited. If you have received this > communication in error, please notify the sender by telephone or > e-mail (as shown above) immediately and destroy any and all copies of > this message in your possession (whether hard copies or electronically stored copies). > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ________________________________ PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com