Re: Hammer reduce recovery impact

Christian Balzer <chibi@xxxxxxx> · Fri, 11 Sep 2015 10:41:01 +0900

Hello,

On Thu, 10 Sep 2015 16:16:10 -0600 Robert LeBlanc wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
> 
> Do the recovery options kick in when there is only backfill going on?
>
Aside from having these set just in case as your cluster (and one of mine)
is clearly at the limits of its abilities, that's a good question.

Recovery and backfill are a bit blurry and clearly can happen at the same
time when locking at my logs from yesterday, when testing ways on how to
ease in new OSDs on my test cluster.

It would be nice if somebody in the know aka Devs would pipe up here.

What happens in the following scenarios?

1. OSD fails, is set out, etc. PGs get moved around. -> Recovery
2. Same OSD is brought back in. PGs move to their original OSDs. Recovery
or backfill?
3. New bucket (host or OSD) is added to the crush map, causing minor PG
reshuffles. Recovery or backfill?
4. The same OSD added in 3 is set "in", started. Backfill, one would
assume.

But this is a log entry from a situation like 4:
---
2015-09-10 15:53:30.084063 mon.0 203.216.0.33:6789/0 6254 : [INF] pgmap v791755: 896 pgs: 45 active+remapped+wait_backfill, 2 active+remapped+backfilling, 
10 active+recovery_wait, 839 active+clean; 69546 MB data, 303 GB used, 5323 GB / 5665 GB avail; 2925/54958 objects degraded (5.322%); 15638 kB/s, 3 objects/s recover
ing
---

I read that as both backfilling and recovery going on at the same time.

Christian
> - ----------------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> 
> 
> On Thu, Sep 10, 2015 at 3:01 PM, Somnath Roy  wrote:
> > Try all these..
> >
> > osd recovery max active = 1
> > osd max backfills = 1
> > osd recovery threads = 1
> > osd recovery op priority = 1
> >
> > Thanks & Regards
> > Somnath
> >
> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> > Of Robert LeBlanc Sent: Thursday, September 10, 2015 1:56 PM
> > To: ceph-users@xxxxxxxxxxxxxx
> > Subject:  Hammer reduce recovery impact
> >
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA256
> >
> > We are trying to add some additional OSDs to our cluster, but the
> > impact of the backfilling has been very disruptive to client I/O and
> > we have been trying to figure out how to reduce the impact. We have
> > seen some client I/O blocked for more than 60 seconds. There has been
> > CPU and RAM head room on the OSD nodes, network has been fine, disks
> > have been busy, but not terrible.
> >
> > 11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals
> > (10GB), dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta
> > S51G-1UL.
> >
> > Clients are QEMU VMs.
> >
> > [ulhglive-root@ceph5 current]# ceph --version ceph version 0.94.2
> > (5fb85614ca8f354284c713a2f9c610860720bbf3)
> >
> > Some nodes are 0.94.3
> >
> > [ulhglive-root@ceph5 current]# ceph status
> >     cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
> >      health HEALTH_WARN
> >             3 pgs backfill
> >             1 pgs backfilling
> >             4 pgs stuck unclean
> >             recovery 2382/33044847 objects degraded (0.007%)
> >             recovery 50872/33044847 objects misplaced (0.154%)
> >             noscrub,nodeep-scrub flag(s) set
> >      monmap e2: 3 mons at
> > {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
> >             election epoch 180, quorum 0,1,2 mon1,mon2,mon3
> >      osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs
> >             flags noscrub,nodeep-scrub
> >       pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects
> >             128 TB used, 322 TB / 450 TB avail
> >             2382/33044847 objects degraded (0.007%)
> >             50872/33044847 objects misplaced (0.154%)
> >                 2300 active+clean
> >                    3 active+remapped+wait_backfill
> >                    1 active+remapped+backfilling recovery io 70401
> > kB/s, 16 objects/s client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s
> >
> > Each pool is size 4 with min_size 2.
> >
> > One problem we have is that the requirements of the cluster changed
> > after setting up our pools, so our PGs are really out of wack. Our
> > most active pool has only 256 PGs and each PG is about 120 GB is size.
> > We are trying to clear out a pool that has way too many PGs so that we
> > can split the PGs in that pool. I think these large PGs is part of our
> > issues.
> >
> > Things I've tried:
> >
> > * Lowered nr_requests on the spindles from 1000 to 100. This reduced
> > the max latency sometimes up to 3000 ms down to a max of 500-700 ms.
> > it has also reduced the huge swings in  latency, but has also reduced
> > throughput somewhat.
> > * Changed the scheduler from deadline to CFQ. I'm not sure if the the
> > OSD process gives the recovery threads a different disk priority or if
> > changing the scheduler without restarting the OSD allows the OSD to
> > use disk priorities.
> > * Reduced the number of osd_max_backfills from 2 to 1.
> > * Tried setting noin to give the new OSDs time to get the PG map and
> > peer before starting the backfill. This caused more problems than
> > solved as we had blocked I/O (over 200 seconds) until we set the new
> > OSDs to in.
> >
> > Even adding one OSD disk into the cluster is causing these slow I/O
> > messages. We still have 5 more disks to add from this server and four
> > more servers to add.
> >
> > In addition to trying to minimize these impacts, would it be better to
> > split the PGs then add the rest of the servers, or add the servers
> > then do the PG split. I'm thinking splitting first would be better,
> > but I'd like to get other opinions.
> >
> > No spindle stays at high utilization for long and the await drops
> > below 20 ms usually within 10 seconds so I/O should be serviced
> > "pretty quick". My next guess is that the journals are getting full
> > and blocking while waiting for flushes, but I'm not exactly sure how
> > to identify that. We are using the defaults for the journal except for
> > size (10G). We'd like to have journals large to handle bursts, but if
> > they are getting filled with backfill traffic, it may be counter
> > productive. Can/does backfill/recovery bypass the journal?
> >
> > Thanks,
> >
> > - ----------------
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> > -----BEGIN PGP SIGNATURE----- Version: Mailvelope v1.0.2
> > Comment: https://www.mailvelope.com
> >
> > wsFcBAEBCAAQBQJV8e5qCRDmVDuy+mK58QAAaIwQAMN5DJlhrZkqwqsVXaKB
> > nnegQjG6Y02ObLRrg96ghHr+AGgY/HRm3iShng6E1N9CL+XjcHSLeb1JqH9n
> > 2SgGQGoRAU1dY6DIlOs5K8Fwd2bBECh863VymYbO+OLgtXbpp2mWfZZVAkTf
> > V9ryaEh7tZOY1Mhx7mSIyr9Ur7IxTUOjzExAFPGfTLP1cbjE/FXoQMHh10fe
> > zSzk/qK0AvajFD0PR04uRyEsGYeCLl68kGQi1R7IQlxZWc7hMhWXKNIFlbKB
> > lk5+8OGx/LawW7qxpFm8a1SNoiAwMtrPKepvHYGi8u3rfXJa6ZE38jGuoqRs
> > 8jD+b+gS0yxKbahT6S/gAEbgzAH0JF4YSz+nHNrvS6eSebykE9/7HGe9W7WA
> > HRAkrESi/f1MKtRkud2Nhycx2R0MZLK/HoumnCN8WUmgvOtKsyYpXj6FXghv
> > VGpi3r6uyC5Xlb8JGREqB1hAUTHAv0+z4biDBvPYrENwFUaerWiIujIeLWV9
> > aYuiQBjjDCLoqWZj0+gQwn9/zXo8gE7jo3XAemYqGB8NJY1e+RZW6+TgC2rD
> > Floa1en1PzZsynm1Ho+RPWW509kog5fFkt41nJmmxRi3kNWwiJfKLJvysetl
> > RYudFG1cEumfI68VyNcuL4dMzf9FsiADsBaHue8g9a5bjJH8LjK4fKZDCCJf
> > Rzgu
> > =vlrz
> > -----END PGP SIGNATURE-----
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > ________________________________
> >
> > PLEASE NOTE: The information contained in this electronic mail message
> > is intended only for the use of the designated recipient(s) named
> > above. If the reader of this message is not the intended recipient,
> > you are hereby notified that you have received this message in error
> > and that any review, dissemination, distribution, or copying of this
> > message is strictly prohibited. If you have received this
> > communication in error, please notify the sender by telephone or
> > e-mail (as shown above) immediately and destroy any and all copies of
> > this message in your possession (whether hard copies or electronically
> > stored copies).
> >
> 
> -----BEGIN PGP SIGNATURE-----
> Version: Mailvelope v1.0.2
> Comment: https://www.mailvelope.com
> 
> wsFcBAEBCAAQBQJV8gEnCRDmVDuy+mK58QAAQ7QQAJjm1tu9Tp8q+TPXS6k/
> +MXfpW28p54y67gfBcGHSOJd/VzJsIytFeO9Q5r6uA3U+JFvxVeN8Jpbp8qF
> JyjAR2qttW5MnOcZm8Zf8VI6RVNfCXw9KIqCtO8ZWN89JKNg0ImXqMKOK5rL
> wg1wuk+fFF8PvJlweQS9xOFdXgxfnMXlLfXoYccHzRsRyTHIixrVED1vWgAA
> oLSOYySPaLTjJLfaBIu1M4tb60BLA9Z1rNsHZPLEODGZCCCFEwxjYB+hzDtS
> BbnShRU89rlzkixW22NzGEbjLBUR9stRMfRGDAd8iHOiisqmrJJkiVK/3ZSX
> kyj+aXLE0pCS/Z/w+Utyg0B1jc6kwUoAcdE8q1OMYUEUCC39ZQxJhtJLDarF
> vn/XUCBrDu5f/sVt8z2fjxdQIBvX7fYFN9Quf0gvlXVico+gu3lEBezXzSDX
> gIAJu6B1RoWL445reDZbdPE5ZaXQP/HkcDhwIL6h0i+1PLjPw6dyR9mJ65OR
> Byor/5/tfCOuH6nTgBYNa2Ty4FHx0FzlwVLeUktRlameQ/XoLf51ZIncR/XZ
> rl/lrizRvAm0jMJL11IvMcjnPUZxTBcqJmgk4Zq1w1I62smtZ7gw5C0T/dDv
> oi5/vpDzgDiASEd8GNA5pYsZZHtZicSXzFGbBdj/FwsIJGneTzbUMN/2M9ND
> nHow
> =+qf1
> -----END PGP SIGNATURE-----
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com