Antw: Hammer reduce recovery impact

"Steffen Weißgerber" <WeissgerberS@xxxxxxx> · Wed, 23 Sep 2015 13:44:10 +0200

Based on the book 'Learning Ceph' (https://www.packtpub.com/application-development/learning-ceph),
chapter performance tuning, we swapped the values for osd_recovery_op_priority
and osd_client_op_priority to 60 and 40.

"... osd recovery op priority: This is
 the priority set for recovery operation. Lower the number, higher the recovery priority.
 Higher recovery priority might cause performance degradation until recovery completes. "

So when setting the value for recovery_op_priority higher then the value for
client_op_priority the client requests should have higher priority than recovery requests.

Since setting the parameters to these values (Giant 0.87.2) our client performance is fine
when OSD's are removed or added to the cluster (e.g adding 25 osd's to the cluster,
having 20 OSD's until then).
Increasing osd_max_backfills higher than 2 again reduces the effect.

Maybe this helps.

Regards

Steffen

>>> Robert LeBlanc <robert@xxxxxxxxxxxxx> 10.09.2015 22:56 >>>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

We are trying to add some additional OSDs to our cluster, but the
impact of the backfilling has been very disruptive to client I/O and
we have been trying to figure out how to reduce the impact. We have
seen some client I/O blocked for more than 60 seconds. There has been
CPU and RAM head room on the OSD nodes, network has been fine, disks
have been busy, but not terrible.

11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals
(10GB), dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta
S51G-1UL.

Clients are QEMU VMs.

[ulhglive-root@ceph5 current]# ceph --version
ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)

Some nodes are 0.94.3

[ulhglive-root@ceph5 current]# ceph status
    cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
     health HEALTH_WARN
            3 pgs backfill
            1 pgs backfilling
            4 pgs stuck unclean
            recovery 2382/33044847 objects degraded (0.007%)
            recovery 50872/33044847 objects misplaced (0.154%)
            noscrub,nodeep-scrub flag(s) set
     monmap e2: 3 mons at
{mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
            election epoch 180, quorum 0,1,2 mon1,mon2,mon3
     osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs
            flags noscrub,nodeep-scrub
      pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects
            128 TB used, 322 TB / 450 TB avail
            2382/33044847 objects degraded (0.007%)
            50872/33044847 objects misplaced (0.154%)
                2300 active+clean
                   3 active+remapped+wait_backfill
                   1 active+remapped+backfilling
recovery io 70401 kB/s, 16 objects/s
  client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s

Each pool is size 4 with min_size 2.

One problem we have is that the requirements of the cluster changed
after setting up our pools, so our PGs are really out of wack. Our
most active pool has only 256 PGs and each PG is about 120 GB is size.
We are trying to clear out a pool that has way too many PGs so that we
can split the PGs in that pool. I think these large PGs is part of our
issues.

Things I've tried:

* Lowered nr_requests on the spindles from 1000 to 100. This reduced
the max latency sometimes up to 3000 ms down to a max of 500-700 ms.
it has also reduced the huge swings in  latency, but has also reduced
throughput somewhat.
* Changed the scheduler from deadline to CFQ. I'm not sure if the the
OSD process gives the recovery threads a different disk priority or if
changing the scheduler without restarting the OSD allows the OSD to
use disk priorities.
* Reduced the number of osd_max_backfills from 2 to 1.
* Tried setting noin to give the new OSDs time to get the PG map and
peer before starting the backfill. This caused more problems than
solved as we had blocked I/O (over 200 seconds) until we set the new
OSDs to in.

Even adding one OSD disk into the cluster is causing these slow I/O
messages. We still have 5 more disks to add from this server and four
more servers to add.

In addition to trying to minimize these impacts, would it be better to
split the PGs then add the rest of the servers, or add the servers
then do the PG split. I'm thinking splitting first would be better,
but I'd like to get other opinions.

No spindle stays at high utilization for long and the await drops
below 20 ms usually within 10 seconds so I/O should be serviced
"pretty quick". My next guess is that the journals are getting full
and blocking while waiting for flushes, but I'm not exactly sure how
to identify that. We are using the defaults for the journal except for
size (10G). We'd like to have journals large to handle bursts, but if
they are getting filled with backfill traffic, it may be counter
productive. Can/does backfill/recovery bypass the journal?

Thanks,

- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com 

wsFcBAEBCAAQBQJV8e5qCRDmVDuy+mK58QAAaIwQAMN5DJlhrZkqwqsVXaKB
nnegQjG6Y02ObLRrg96ghHr+AGgY/HRm3iShng6E1N9CL+XjcHSLeb1JqH9n
2SgGQGoRAU1dY6DIlOs5K8Fwd2bBECh863VymYbO+OLgtXbpp2mWfZZVAkTf
V9ryaEh7tZOY1Mhx7mSIyr9Ur7IxTUOjzExAFPGfTLP1cbjE/FXoQMHh10fe
zSzk/qK0AvajFD0PR04uRyEsGYeCLl68kGQi1R7IQlxZWc7hMhWXKNIFlbKB
lk5+8OGx/LawW7qxpFm8a1SNoiAwMtrPKepvHYGi8u3rfXJa6ZE38jGuoqRs
8jD+b+gS0yxKbahT6S/gAEbgzAH0JF4YSz+nHNrvS6eSebykE9/7HGe9W7WA
HRAkrESi/f1MKtRkud2Nhycx2R0MZLK/HoumnCN8WUmgvOtKsyYpXj6FXghv
VGpi3r6uyC5Xlb8JGREqB1hAUTHAv0+z4biDBvPYrENwFUaerWiIujIeLWV9
aYuiQBjjDCLoqWZj0+gQwn9/zXo8gE7jo3XAemYqGB8NJY1e+RZW6+TgC2rD
Floa1en1PzZsynm1Ho+RPWW509kog5fFkt41nJmmxRi3kNWwiJfKLJvysetl
RYudFG1cEumfI68VyNcuL4dMzf9FsiADsBaHue8g9a5bjJH8LjK4fKZDCCJf
Rzgu
=vlrz
-----END PGP SIGNATURE-----
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

-- 
Klinik-Service Neubrandenburg GmbH
Allendestr. 30, 17036 Neubrandenburg
Amtsgericht Neubrandenburg, HRB 2457
Geschaeftsfuehrerin: Gudrun Kappich
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com