Re: Slow requests on cluster.

Luis Periquito <periquito@xxxxxxxxx> · Thu, 14 Jul 2016 14:26:15 +0100

Hi Jaroslaw,

several things are springing up to mind. I'm assuming the cluster is
healthy (other than the slow requests), right?

>From the (little) information you send it seems the pools are
replicated with size 3, is that correct?

Are there any long running delete processes? They usually have a
negative impact on performance, specially as they don't really show up
in the IOPS statistics.
I've also something like this happen when there's a slow disk/osd. You
can try to check with "ceph osd perf" and look for higher numbers.
Usually restarting that OSD brings back the cluster to life, if that's
the issue.
If nothing shows, try a "ceph tell osd.* version"; if there's a
misbehaving OSD they usually don't respond to the command (slow or
even timing out).

Also you also don't say how many scrub/deep-scrub processes are
running. If not properly handled they are also a performance killer.

Last, but by far not least, have you ever thought of creating a SSD
pool (even small) and move all pools but .rgw.buckets there? The other
ones are small enough, but enjoy having their own "reserved" osds...

On Thu, Jul 14, 2016 at 1:59 PM, Jaroslaw Owsiewski
<jaroslaw.owsiewski@xxxxxxxxxxxxxxxx> wrote:
> Hi,
>
> we have problem with drastic performance slowing down on a cluster. We used
> radosgw with S3 protocol. Our configuration:
>
> 153 OSD SAS 1.2TB with journal on SSD disks (ratio 4:1)
> - no problems with networking, no hardware issues, etc.
>
> Output from "ceph df":
>
> GLOBAL:
>     SIZE     AVAIL     RAW USED     %RAW USED
>     166T      129T       38347G         22.44
> POOLS:
>     NAME                       ID     USED       %USED     MAX AVAIL
> OBJECTS
>     .rgw                       9      70330k         0        39879G
> 393178
>     .rgw.root                  10        848         0        39879G
> 3
>     .rgw.control               11          0         0        39879G
> 8
>     .rgw.gc                    12          0         0        39879G
> 32
>     .rgw.buckets               13     10007G      5.86        39879G
> 331079052
>     .rgw.buckets.index         14          0         0        39879G
> 2994652
>     .rgw.buckets.extra         15          0         0        39879G
> 2
>     .log                       16       475M         0        39879G
> 408
>     .intent-log                17          0         0        39879G
> 0
>     .users                     19        729         0        39879G
> 49
>     .users.email               20        414         0        39879G
> 26
>     .users.swift               21          0         0        39879G
> 0
>     .users.uid                 22      17170         0        39879G
> 89
>
> Problems began on last saturday,
> Troughput was 400k req per hour - mostly PUTs and HEADs ~100kb.
>
> Ceph version is hammer.
>
>
> We have two clusters with similar configuration and both experienced same
> problems at once.
>
> Any hints
>
>
> Latest output from "ceph -w":
>
> 2016-07-14 14:43:16.197131 osd.26 [WRN] 17 slow requests, 16 included below;
> oldest blocked for > 34.766976 secs
> 2016-07-14 14:43:16.197138 osd.26 [WRN] slow request 32.555599 seconds old,
> received at 2016-07-14 14:42:43.641440: osd_op(client.75866283.0:20130084
> .dir.default.75866283.65796.3 [delete] 14.122252f4
> ondisk+write+known_if_redirected e18788) currently commit_sent
> 2016-07-14 14:43:16.197145 osd.26 [WRN] slow request 32.536551 seconds old,
> received at 2016-07-14 14:42:43.660487: osd_op(client.75866283.0:20130121
> .dir.default.75866283.65799.6 [delete] 14.d2dc1672
> ondisk+write+known_if_redirected e18788) currently commit_sent
> 2016-07-14 14:43:16.197153 osd.26 [WRN] slow request 30.971549 seconds old,
> received at 2016-07-14 14:42:45.225490: osd_op(client.75866283.0:20132345
> gc.12 [call rgw.gc_set_entry] 12.a45046b8
> ack+ondisk+write+known_if_redirected e18788) currently waiting for rw locks
> 2016-07-14 14:43:16.197158 osd.26 [WRN] slow request 30.967568 seconds old,
> received at 2016-07-14 14:42:45.229471: osd_op(client.76495939.0:20147494
> gc.12 [call rgw.gc_set_entry] 12.a45046b8
> ack+ondisk+write+known_if_redirected e18788) currently waiting for rw locks
> 2016-07-14 14:43:16.197162 osd.26 [WRN] slow request 32.253169 seconds old,
> received at 2016-07-14 14:42:43.943870: osd_op(client.75866283.0:20130663
> .dir.default.75866283.65805.7 [delete] 14.2b5a1672
> ondisk+write+known_if_redirected e18788) currently commit_sent
> 2016-07-14 14:43:17.197429 osd.26 [WRN] 3 slow requests, 2 included below;
> oldest blocked for > 31.967882 secs
> 2016-07-14 14:43:17.197434 osd.26 [WRN] slow request 31.579897 seconds old,
> received at 2016-07-14 14:42:45.617456: osd_op(client.76495939.0:20147877
> gc.12 [call rgw.gc_set_entry] 12.a45046b8
> ack+ondisk+write+known_if_redirected e18788) currently waiting for rw locks
> 2016-07-14 14:43:17.197439 osd.26 [WRN] slow request 30.897873 seconds old,
> received at 2016-07-14 14:42:46.299480: osd_op(client.76495939.0:20148668
> gc.12 [call rgw.gc_set_entry] 12.a45046b8
> ack+ondisk+write+known_if_redirected e18788) currently waiting for rw locks
>
>
> Regards
> --
> Jarosław Owsiewski
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com