RE: Recovery question

Somnath Roy <Somnath.Roy@xxxxxxxxxxx> · Mon, 23 Feb 2015 22:00:04 +0000

Sage,
I don't understand how osd.1 will be mapped to the same pg when it comes back ?
So, the peering process will get the old pg id from osd.1's pg log and map it accordingly ? Is this an optimization as if an osd mapped to the same pg we can start from the last_backfill position ?
Only if a completely new osd joins the cluster the mapping will happen through CRUSH ?

Sorry for bombarding with so many questions :-)

Thanks & Regards
Somnath

-----Original Message-----
From: Sage Weil [mailto:sweil@xxxxxxxxxx] 
Sent: Monday, February 23, 2015 1:43 PM
To: Somnath Roy
Cc: Samuel Just (sam.just@xxxxxxxxxxx); Ceph Development
Subject: RE: Recovery question

On Mon, 23 Feb 2015, Somnath Roy wrote:
> Got it, thanks !
> 
> << We'll serve reads and writes with just [2,3] and the pg will show up as 'degraded'
> So, the moment osd.1 is down in the map [2,3] , 2 will be designated as primary ? As my understanding is, read/write won't be served from replica OSD, right ?

The moment the mapping becomes [2,3], 2 is now the primary, and it can serve IO.

The slow part is here:

> If we had say
> 
>  4: [2,3,4]
> 
> then we'll get
> 
>  5: [1,2,3]
> 
> osd will realize osd.1 cannot log recovery and will install a pg_temp 
> so that

We can't serve IO with [1,2,3] because 1 is out of date, so there is a lag until it installs the pg_temp record.  There is a pull request that will preemptively calculate new mappings and set up pg_temp records that should mitigate this issue, but it needs some testing, and I think there is still room for improvement (for example, by serving IO with a usable but non-optimal pg-temp record while we are waiting for it to be removed).

See
	https://github.com/ceph/ceph/pull/3429

There's like a bunch of similar stuff we can do to improve peering latency that we'll want ot spend some time on for infernalis.

sage

> 
>  6: [2,3,4]
> 
> and PG will go active (serve IO).  when backfill completes, it will 
> remove the pg_temp and
> 
>  7: [1,2,3]
> 
> > 3. Will the flow be similar if one of the replica OSD goes down 
> > instead of primary in the step '2' I mentioned earlier ?  Say, osd.2 
> > went down instead of osd.1 ?
> 
> Yeah, basically the same.  Who is primary doesn't really matter.
> 
> sage
> 
> 
> 
> 
> > 
> > Regards
> > Somnath
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> > Sent: Monday, February 23, 2015 1:03 PM
> > To: Somnath Roy
> > Cc: Samuel Just (sam.just@xxxxxxxxxxx); Ceph Development
> > Subject: Re: Recovery question
> > 
> > On Mon, 23 Feb 2015, Somnath Roy wrote:
> > > Hi,
> > > Can anyone help me understand what will happen in the following scenarios ?
> > >
> > > 1. Current PG map : 3.5 -> OSD[1,2,3]
> > >
> > > 2. 1 is down and new map : 3.5 -> OSD[2,3,4]
> > 
> > More likely it's:
> > 
> >  1: 3.5 -> [1,2,3]
> >  2: 3.5 -> [2,3]   (osd.1 is down)
> >  3: 3.5 -> [2,3,4] (osd.1 is marked out)
> > 
> > > 3. Need to for backfill recovery for 4 and it started
> > 
> > If log recovery will work, we'll do that and it's nice and quick.  
> > If backfill is needed, we will do
> > 
> >  4: 3.5 -> [2,3]  (up=[2,3,4]) (pg_temp record added to map to 
> > log-recoverable OSDs)
> > 
> > > 4. Meanwhile OSD 1 came up , it was down for short amount of time
> > 
> >  5: 3.5 -> [1,2,3] (osd.1 is back up and in)
> > 
> > > 5. Will pg 3.5 mapping change considering OSD 1 recovery could be 
> > > log based ?
> > 
> > It will change immediately when osd.1 is back up, regardless of what 
> > data is where.  If it's log recoverable, then no mapping changes 
> > will be needed.  If it's not, then
> > 
> >  6: 3.5 -> [2,3,4]  (up=[1,2,3]) (add pg_temp mapping while we 
> > backfill osd.1)
> >  7: 3.5 -> [1,2,3]  (pg_temp entry removed when backfill completes)
> > 
> > > 6. Also, if OSD 4 recovery could be log based, will there be any 
> > > change in pg map if OSD 1 is up during the recovery ?
> > 
> > See above
> > 
> > Hope that helps!
> > sage
> > 
> > ________________________________
> > 
> > PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies).
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html