Re: Reducing the impact of OSD restarts (noout ain't uptosnuff)

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Fri, 12 Feb 2016 13:29:39 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

What I've seen is that when an OSD starts up in a busy cluster, as
soon as it is "in" (could be "out" before) it starts getting client
traffic. However, it has be "in" to start catching up and peering to
the other OSDs in the cluster. The OSD is not ready to service
requests for that PG yet, but it has the OP queued until it is ready.
On a busy cluster it can take an OSD a long time to become ready
especially if it is servicing client requests at the same time.

If someone isn't able to look into the code to resolve this by the
time I'm finished with the queue optimizations I'm doing (hopefully in
a week or two), I plan on looking into this to see if there is
something that can be done to prevent the OPs from being accepted
until the OSD is ready for them.
- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Fri, Feb 12, 2016 at 9:42 AM, Nick Fisk  wrote:
> I wonder if Christian is hitting some performance issue when the OSD or
> number of OSD's all start up at once? Or maybe the OSD is still doing some
> internal startup procedure and when the IO hits it on a very busy cluster,
> it causes it to become overloaded for a few seconds?
>
> I've seen similar things in the past where if I did not have enough min free
> KB's configured, PG's would take a long time to peer/activate and cause slow
> ops.
>
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>> Steve Taylor
>> Sent: 12 February 2016 16:32
>> To: Nick Fisk ; 'Christian Balzer' ; ceph-
>> users@xxxxxxxxxxxxxx
>> Subject: Re:  Reducing the impact of OSD restarts (noout ain't
>> uptosnuff)
>>
>> Nick is right. Setting noout is the right move in this scenario.
> Restarting an
>> OSD shouldn't block I/O unless nodown is also set, however. The exception
>> to this would be a case where min_size can't be achieved because of the
>> down OSD, i.e. min_size=3 and 1 of 3 OSDs is restarting. That would
> certainly
>> block writes. Otherwise the cluster will recognize down OSDs as down
>> (without nodown set), redirect I/O requests to OSDs that are up, and
> backfill
>> as necessary when things are back to normal.
>>
>> You can set min_size to something lower if you don't have enough OSDs to
>> allow you to restart one without blocking writes. If this isn't the case,
>> something deeper is going on with your cluster. You shouldn't get slow
>> requests due to restarting a single OSD with only noout set and idle disks
> on
>> the remaining OSDs. I've done this many, many times.
>>
>> Steve Taylor | Senior Software Engineer | StorageCraft Technology
>> Corporation
>> 380 Data Drive Suite 300 | Draper | Utah | 84020
>> Office: 801.871.2799 | Fax: 801.545.4705
>>
>> If you are not the intended recipient of this message, be advised that any
>> dissemination or copying of this message is prohibited.
>> If you received this message erroneously, please notify the sender and
>> delete it, together with any attachments.
>>
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>> Nick Fisk
>> Sent: Friday, February 12, 2016 9:07 AM
>> To: 'Christian Balzer' ; ceph-users@xxxxxxxxxxxxxx
>> Subject: Re:  Reducing the impact of OSD restarts (noout ain't
>> uptosnuff)
>>
>>
>>
>> > -----Original Message-----
>> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
>> > Of Christian Balzer
>> > Sent: 12 February 2016 15:38
>> > To: ceph-users@xxxxxxxxxxxxxx
>> > Subject: Re:  Reducing the impact of OSD restarts (noout
>> > ain't
>> > uptosnuff)
>> >
>> > On Fri, 12 Feb 2016 15:56:31 +0100 Burkhard Linke wrote:
>> >
>> > > Hi,
>> > >
>> > > On 02/12/2016 03:47 PM, Christian Balzer wrote:
>> > > > Hello,
>> > > >
>> > > > yesterday I upgraded our most busy (in other words lethally
>> > > > overloaded) production cluster to the latest Firefly in
>> > > > preparation for a Hammer upgrade and then phasing in of a cache
> tier.
>> > > >
>> > > > When restarting the ODSs it took 3 minutes (1 minute in a
>> > > > consecutive repeat to test the impact of primed caches) during
>> > > > which the cluster crawled to a near stand-still and the dreaded
>> > > > slow requests piled up, causing applications in the VMs to fail.
>> > > >
>> > > > I had of course set things to "noout" beforehand, in hopes of
>> > > > staving off this kind of scenario.
>> > > >
>> > > > Note that the other OSDs and their backing storage were NOT
>> > > > overloaded during that time, only the backing storage of the OSD
>> > > > being restarted was under duress.
>> > > >
>> > > > I was under the (wishful thinking?) impression that with noout set
>> > > > and a controlled OSD shutdown/restart, operations would be
>> > > > redirect to the new primary for the duration.
>> > > > The strain on the restarted OSDs when recovering those operations
>> > > > (which I also saw) I was prepared for, the near screeching halt
>> > > > not so much.
>> > > >
>> > > > Any thoughts on how to mitigate this further or is this the
>> > > > expected behavior?
>> > >
>> > > I wouldn't use noout in this scenario. It keeps the cluster from
>> > > recognizing that a OSD is not available; other OSD will still try to
>> > > write to that OSD. This is probably the cause of the blocked requests.
>> > > Redirecting only works if the cluster is able to detect a PG as
>> > > being degraded.
>> > >
>> > Oh well, that makes of course sense, but I found some article stating
>> > that
>> it
>> > also would redirect things and the recovery activity I saw afterwards
>> suggests
>> > it did so at some point.
>>
>> Doesn't noout just stop the crushmap from being modified and hence data
>> shuffling. Nodown controls whether or not the OSD is available for IO?
>>
>> Maybe try the reverse. Set noup so that OSD's don't participate in IO and
>> then bring them in manually?
>>
>> >
>> > > If the cluster is aware of the OSD being missing, it could handle
>> > > the write requests more gracefully. To prevent it from backfilling
>> > > etc, I prefer to use nobackfill and norecover. It blocks backfill on
>> > > the cluster level, but allows requests to be carried out (at least
>> > > in my understanding of these flags).
>> > >
>> > Yes, I concur and was thinking of that as well. Will give it a spin
>> > with
>> the
>> > upgrade to Hammer.
>> >
>> > > 'noout' is fine for large scale cluster maintenance, since it keeps
>> > > the cluster from backfilling. I've used when I had to power down our
>> > > complete cluster.
>> > >
>> > Guess with my other, less busy clusters, this never showed up on my
> radar.
>> >
>> > Regards,
>> >
>> > Christian
>> > > Regards,
>> > > Burkhard
>> > > _______________________________________________
>> > > ceph-users mailing list
>> > > ceph-users@xxxxxxxxxxxxxx
>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >
>> >
>> >
>> > --
>> > Christian Balzer        Network/Systems Engineer
>> > chibi@xxxxxxx       Global OnLine Japan/Rakuten Communications
>> > http://www.gol.com/
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.3.4
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWvkCiCRDmVDuy+mK58QAAo2cP/1T8Mswjr+k25L9mjTU+
prOnXMZqLgJviGWXKECcbWq0ApGfFB2MSz5X5d88xT8B0TOlq0ozMy1E7RN1
0BUXxWWW3yDTib+u7DD9hLr9OgqEiEF4ASJopHkLJyu0ej9qv6G3RfUDSrm4
aA6s961Y3GdqIE+0iser6XPCDEfZ3yd5gwGcyNCDbQsy8mIB7hfYO5qdlzGY
v+7YQoXWhO8B1vcLD2goVLlExlVmYXT8yjVOum/a6lZHBW7OKD8v7+KByFs8
ih1dnqb3gXUKdXnA4ScOiYc1ZnNFY2hltPAgrNtBvKKlWmBbp5HXyMfQbIeE
K2w6mCkMEjdsUK9aQlVcDXwauoIeib5+7MBBJBQI4rIGy9tKLrvajZcJ8MXk
ly3mCEdyAlFSaZ6O/I7kFnyNZbl2Krjr58DEtJ60h0BBd72kTarfvoqbhq0Y
3KDsnSVAx4rT+kt3yuw6po2mM6XkiCR/tjPp8XIgzJ6mred5jM7/UJBCEqTH
hlJWEA4fmvD/p4wbe9XlXZnhFwFH7Hb+AQaa/LRxU5kJLcPXzRFNw/4F4ItI
dHGwIyo62va2AkxunH1IHT7T9Gt10PQSJ6q2D53Errpx1Och5qnliffPwvWA
JCEgsrTIwF2wdskMms79XKVBiIkPujLL4c8pFvh+MHqLvIEPjSu5aLU0tIAc
aQ5z
=/jV9
-----END PGP SIGNATURE-----
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com