Hi Jaroslaw, several things are springing up to mind. I'm assuming the cluster is healthy (other than the slow requests), right? >From the (little) information you send it seems the pools are replicated with size 3, is that correct? Are there any long running delete processes? They usually have a negative impact on performance, specially as they don't really show up in the IOPS statistics. I've also something like this happen when there's a slow disk/osd. You can try to check with "ceph osd perf" and look for higher numbers. Usually restarting that OSD brings back the cluster to life, if that's the issue. If nothing shows, try a "ceph tell osd.* version"; if there's a misbehaving OSD they usually don't respond to the command (slow or even timing out). Also you also don't say how many scrub/deep-scrub processes are running. If not properly handled they are also a performance killer. Last, but by far not least, have you ever thought of creating a SSD pool (even small) and move all pools but .rgw.buckets there? The other ones are small enough, but enjoy having their own "reserved" osds... On Thu, Jul 14, 2016 at 1:59 PM, Jaroslaw Owsiewski <jaroslaw.owsiewski@xxxxxxxxxxxxxxxx> wrote: > Hi, > > we have problem with drastic performance slowing down on a cluster. We used > radosgw with S3 protocol. Our configuration: > > 153 OSD SAS 1.2TB with journal on SSD disks (ratio 4:1) > - no problems with networking, no hardware issues, etc. > > Output from "ceph df": > > GLOBAL: > SIZE AVAIL RAW USED %RAW USED > 166T 129T 38347G 22.44 > POOLS: > NAME ID USED %USED MAX AVAIL > OBJECTS > .rgw 9 70330k 0 39879G > 393178 > .rgw.root 10 848 0 39879G > 3 > .rgw.control 11 0 0 39879G > 8 > .rgw.gc 12 0 0 39879G > 32 > .rgw.buckets 13 10007G 5.86 39879G > 331079052 > .rgw.buckets.index 14 0 0 39879G > 2994652 > .rgw.buckets.extra 15 0 0 39879G > 2 > .log 16 475M 0 39879G > 408 > .intent-log 17 0 0 39879G > 0 > .users 19 729 0 39879G > 49 > .users.email 20 414 0 39879G > 26 > .users.swift 21 0 0 39879G > 0 > .users.uid 22 17170 0 39879G > 89 > > Problems began on last saturday, > Troughput was 400k req per hour - mostly PUTs and HEADs ~100kb. > > Ceph version is hammer. > > > We have two clusters with similar configuration and both experienced same > problems at once. > > Any hints > > > Latest output from "ceph -w": > > 2016-07-14 14:43:16.197131 osd.26 [WRN] 17 slow requests, 16 included below; > oldest blocked for > 34.766976 secs > 2016-07-14 14:43:16.197138 osd.26 [WRN] slow request 32.555599 seconds old, > received at 2016-07-14 14:42:43.641440: osd_op(client.75866283.0:20130084 > .dir.default.75866283.65796.3 [delete] 14.122252f4 > ondisk+write+known_if_redirected e18788) currently commit_sent > 2016-07-14 14:43:16.197145 osd.26 [WRN] slow request 32.536551 seconds old, > received at 2016-07-14 14:42:43.660487: osd_op(client.75866283.0:20130121 > .dir.default.75866283.65799.6 [delete] 14.d2dc1672 > ondisk+write+known_if_redirected e18788) currently commit_sent > 2016-07-14 14:43:16.197153 osd.26 [WRN] slow request 30.971549 seconds old, > received at 2016-07-14 14:42:45.225490: osd_op(client.75866283.0:20132345 > gc.12 [call rgw.gc_set_entry] 12.a45046b8 > ack+ondisk+write+known_if_redirected e18788) currently waiting for rw locks > 2016-07-14 14:43:16.197158 osd.26 [WRN] slow request 30.967568 seconds old, > received at 2016-07-14 14:42:45.229471: osd_op(client.76495939.0:20147494 > gc.12 [call rgw.gc_set_entry] 12.a45046b8 > ack+ondisk+write+known_if_redirected e18788) currently waiting for rw locks > 2016-07-14 14:43:16.197162 osd.26 [WRN] slow request 32.253169 seconds old, > received at 2016-07-14 14:42:43.943870: osd_op(client.75866283.0:20130663 > .dir.default.75866283.65805.7 [delete] 14.2b5a1672 > ondisk+write+known_if_redirected e18788) currently commit_sent > 2016-07-14 14:43:17.197429 osd.26 [WRN] 3 slow requests, 2 included below; > oldest blocked for > 31.967882 secs > 2016-07-14 14:43:17.197434 osd.26 [WRN] slow request 31.579897 seconds old, > received at 2016-07-14 14:42:45.617456: osd_op(client.76495939.0:20147877 > gc.12 [call rgw.gc_set_entry] 12.a45046b8 > ack+ondisk+write+known_if_redirected e18788) currently waiting for rw locks > 2016-07-14 14:43:17.197439 osd.26 [WRN] slow request 30.897873 seconds old, > received at 2016-07-14 14:42:46.299480: osd_op(client.76495939.0:20148668 > gc.12 [call rgw.gc_set_entry] 12.a45046b8 > ack+ondisk+write+known_if_redirected e18788) currently waiting for rw locks > > > Regards > -- > Jarosław Owsiewski > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com