Re: Reducing the impact of OSD restarts (noout ain't uptosnuff)

Steve Taylor <steve.taylor@xxxxxxxxxxxxxxxx> · Fri, 12 Feb 2016 22:17:01 +0000

I could be wrong, but I didn't think a PG would have to peer when an OSD is restarted with noout set. If I'm wrong, then this peering would definitely block I/O. I just did a quick test on a non-busy cluster and didn't see any peering when my OSD went down or up, but I'm not sure how good a test that is. The OSD should also stay "in" throughout the restart with noout set, so it wouldn't have been "out" before to cause peering when it came "in."

I do know that OSDs don’t mark themselves "up" until they're caught up on OSD maps. They won't accept any op requests until they're "up," so they shouldn't have any catching up to do by the time they start taking op requests. In theory they're ready to handle I/O by the time they start handling I/O. At least that's my understanding.

It would be interesting to see what this cluster looks like as far as OSD count, journal configuration, network, CPU, RAM, etc. Something is obviously amiss. Even in a semi-decent configuration one should be able to restart a single OSD with noout under little load without causing blocked op requests.

Steve Taylor | Senior Software Engineer | StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2799 | Fax: 801.545.4705

If you are not the intended recipient of this message, be advised that any dissemination or copying of this message is prohibited.
If you received this message erroneously, please notify the sender and delete it, together with any attachments.

-----Original Message-----
From: Robert LeBlanc [mailto:robert@xxxxxxxxxxxxx] 
Sent: Friday, February 12, 2016 1:30 PM
To: Nick Fisk <nick@xxxxxxxxxx>
Cc: Steve Taylor <steve.taylor@xxxxxxxxxxxxxxxx>; Christian Balzer <chibi@xxxxxxx>; ceph-users@xxxxxxxxxxxxxx
Subject: Re:  Reducing the impact of OSD restarts (noout ain't uptosnuff)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

What I've seen is that when an OSD starts up in a busy cluster, as soon as it is "in" (could be "out" before) it starts getting client traffic. However, it has be "in" to start catching up and peering to the other OSDs in the cluster. The OSD is not ready to service requests for that PG yet, but it has the OP queued until it is ready.
On a busy cluster it can take an OSD a long time to become ready especially if it is servicing client requests at the same time.

If someone isn't able to look into the code to resolve this by the time I'm finished with the queue optimizations I'm doing (hopefully in a week or two), I plan on looking into this to see if there is something that can be done to prevent the OPs from being accepted until the OSD is ready for them.
- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Fri, Feb 12, 2016 at 9:42 AM, Nick Fisk  wrote:
> I wonder if Christian is hitting some performance issue when the OSD 
> or number of OSD's all start up at once? Or maybe the OSD is still 
> doing some internal startup procedure and when the IO hits it on a 
> very busy cluster, it causes it to become overloaded for a few seconds?
>
> I've seen similar things in the past where if I did not have enough 
> min free KB's configured, PG's would take a long time to peer/activate 
> and cause slow ops.
>
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf 
>> Of Steve Taylor
>> Sent: 12 February 2016 16:32
>> To: Nick Fisk ; 'Christian Balzer' ; ceph- users@xxxxxxxxxxxxxx
>> Subject: Re:  Reducing the impact of OSD restarts (noout 
>> ain't
>> uptosnuff)
>>
>> Nick is right. Setting noout is the right move in this scenario.
> Restarting an
>> OSD shouldn't block I/O unless nodown is also set, however. The 
>> exception to this would be a case where min_size can't be achieved 
>> because of the down OSD, i.e. min_size=3 and 1 of 3 OSDs is 
>> restarting. That would
> certainly
>> block writes. Otherwise the cluster will recognize down OSDs as down 
>> (without nodown set), redirect I/O requests to OSDs that are up, and
> backfill
>> as necessary when things are back to normal.
>>
>> You can set min_size to something lower if you don't have enough OSDs 
>> to allow you to restart one without blocking writes. If this isn't 
>> the case, something deeper is going on with your cluster. You 
>> shouldn't get slow requests due to restarting a single OSD with only 
>> noout set and idle disks
> on
>> the remaining OSDs. I've done this many, many times.
>>
>> Steve Taylor | Senior Software Engineer | StorageCraft Technology 
>> Corporation
>> 380 Data Drive Suite 300 | Draper | Utah | 84020
>> Office: 801.871.2799 | Fax: 801.545.4705
>>
>> If you are not the intended recipient of this message, be advised 
>> that any dissemination or copying of this message is prohibited.
>> If you received this message erroneously, please notify the sender 
>> and delete it, together with any attachments.
>>
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf 
>> Of Nick Fisk
>> Sent: Friday, February 12, 2016 9:07 AM
>> To: 'Christian Balzer' ; ceph-users@xxxxxxxxxxxxxx
>> Subject: Re:  Reducing the impact of OSD restarts (noout 
>> ain't
>> uptosnuff)
>>
>>
>>
>> > -----Original Message-----
>> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On 
>> > Behalf Of Christian Balzer
>> > Sent: 12 February 2016 15:38
>> > To: ceph-users@xxxxxxxxxxxxxx
>> > Subject: Re:  Reducing the impact of OSD restarts 
>> > (noout ain't
>> > uptosnuff)
>> >
>> > On Fri, 12 Feb 2016 15:56:31 +0100 Burkhard Linke wrote:
>> >
>> > > Hi,
>> > >
>> > > On 02/12/2016 03:47 PM, Christian Balzer wrote:
>> > > > Hello,
>> > > >
>> > > > yesterday I upgraded our most busy (in other words lethally
>> > > > overloaded) production cluster to the latest Firefly in 
>> > > > preparation for a Hammer upgrade and then phasing in of a cache
> tier.
>> > > >
>> > > > When restarting the ODSs it took 3 minutes (1 minute in a 
>> > > > consecutive repeat to test the impact of primed caches) during 
>> > > > which the cluster crawled to a near stand-still and the dreaded 
>> > > > slow requests piled up, causing applications in the VMs to fail.
>> > > >
>> > > > I had of course set things to "noout" beforehand, in hopes of 
>> > > > staving off this kind of scenario.
>> > > >
>> > > > Note that the other OSDs and their backing storage were NOT 
>> > > > overloaded during that time, only the backing storage of the 
>> > > > OSD being restarted was under duress.
>> > > >
>> > > > I was under the (wishful thinking?) impression that with noout 
>> > > > set and a controlled OSD shutdown/restart, operations would be 
>> > > > redirect to the new primary for the duration.
>> > > > The strain on the restarted OSDs when recovering those 
>> > > > operations (which I also saw) I was prepared for, the near 
>> > > > screeching halt not so much.
>> > > >
>> > > > Any thoughts on how to mitigate this further or is this the 
>> > > > expected behavior?
>> > >
>> > > I wouldn't use noout in this scenario. It keeps the cluster from 
>> > > recognizing that a OSD is not available; other OSD will still try 
>> > > to write to that OSD. This is probably the cause of the blocked requests.
>> > > Redirecting only works if the cluster is able to detect a PG as 
>> > > being degraded.
>> > >
>> > Oh well, that makes of course sense, but I found some article 
>> > stating that
>> it
>> > also would redirect things and the recovery activity I saw 
>> > afterwards
>> suggests
>> > it did so at some point.
>>
>> Doesn't noout just stop the crushmap from being modified and hence 
>> data shuffling. Nodown controls whether or not the OSD is available for IO?
>>
>> Maybe try the reverse. Set noup so that OSD's don't participate in IO 
>> and then bring them in manually?
>>
>> >
>> > > If the cluster is aware of the OSD being missing, it could handle 
>> > > the write requests more gracefully. To prevent it from 
>> > > backfilling etc, I prefer to use nobackfill and norecover. It 
>> > > blocks backfill on the cluster level, but allows requests to be 
>> > > carried out (at least in my understanding of these flags).
>> > >
>> > Yes, I concur and was thinking of that as well. Will give it a spin 
>> > with
>> the
>> > upgrade to Hammer.
>> >
>> > > 'noout' is fine for large scale cluster maintenance, since it 
>> > > keeps the cluster from backfilling. I've used when I had to power 
>> > > down our complete cluster.
>> > >
>> > Guess with my other, less busy clusters, this never showed up on my
> radar.
>> >
>> > Regards,
>> >
>> > Christian
>> > > Regards,
>> > > Burkhard
>> > > _______________________________________________
>> > > ceph-users mailing list
>> > > ceph-users@xxxxxxxxxxxxxx
>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >
>> >
>> >
>> > --
>> > Christian Balzer        Network/Systems Engineer
>> > chibi@xxxxxxx       Global OnLine Japan/Rakuten Communications
>> > http://www.gol.com/
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.3.4
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWvkCiCRDmVDuy+mK58QAAo2cP/1T8Mswjr+k25L9mjTU+
prOnXMZqLgJviGWXKECcbWq0ApGfFB2MSz5X5d88xT8B0TOlq0ozMy1E7RN1
0BUXxWWW3yDTib+u7DD9hLr9OgqEiEF4ASJopHkLJyu0ej9qv6G3RfUDSrm4
aA6s961Y3GdqIE+0iser6XPCDEfZ3yd5gwGcyNCDbQsy8mIB7hfYO5qdlzGY
v+7YQoXWhO8B1vcLD2goVLlExlVmYXT8yjVOum/a6lZHBW7OKD8v7+KByFs8
ih1dnqb3gXUKdXnA4ScOiYc1ZnNFY2hltPAgrNtBvKKlWmBbp5HXyMfQbIeE
K2w6mCkMEjdsUK9aQlVcDXwauoIeib5+7MBBJBQI4rIGy9tKLrvajZcJ8MXk
ly3mCEdyAlFSaZ6O/I7kFnyNZbl2Krjr58DEtJ60h0BBd72kTarfvoqbhq0Y
3KDsnSVAx4rT+kt3yuw6po2mM6XkiCR/tjPp8XIgzJ6mred5jM7/UJBCEqTH
hlJWEA4fmvD/p4wbe9XlXZnhFwFH7Hb+AQaa/LRxU5kJLcPXzRFNw/4F4ItI
dHGwIyo62va2AkxunH1IHT7T9Gt10PQSJ6q2D53Errpx1Och5qnliffPwvWA
JCEgsrTIwF2wdskMms79XKVBiIkPujLL4c8pFvh+MHqLvIEPjSu5aLU0tIAc
aQ5z
=/jV9
-----END PGP SIGNATURE-----
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com