Re: questions about pg membership

Sage Weil <sage@xxxxxxxxxxxx> · Thu, 2 Dec 2010 11:37:37 -0800 (PST)

On Thu, 2 Dec 2010, Henry C Chang wrote:
> Hi Sage,
> 
> I have some questions about the pg membership.
> 
> Suppose I have a ceph cluster of 3 osds (osd0, osd1, osd2) and the
> replication number is 2. I observed that if osd2 failed and became
> down and out, one pg's acting osds would changed from [0,2] to [1,0].
> 
> Does it mean that the primary of the pg changed from osd0 to osd1?

Yeah.  Normally that shouldn't happen, though.  If you know the osdmap 
epochs during which you saw those two mappings, can you send the map files 
and the pgid for me to look at?  You can dump a specific epoch with

 ceph osd dump <epoch> -o /tmp/foo

and calculate a pg mapping using a specific map file with

 osdmaptool --test-map-gp 1.123 /tmp/foo

You can also see the current pg mapping with

 ceph pg map 1.123

but unfortunately (iirc) that command doesn't let you specify an older 
epoch (it probably should!).

> If
> the client wants to access the object in the pg at that time, would
> the client's request be blocked until osd1 has acquired the missing
> object from osd0?
> 
> If so, is there any way (e.g. a crush rule?) to choose osd0 as the new
> primary since it was the replica and had the object already? Then, it
> should be able to shorten the time the client has to wait.

The client's request is blocked in a couple of ways.  First, when the pg 
mapping changes, the client resends the request to the (same or different) 
primary osd.  This is primarily for simplicity; it used to be smarter 
about not resending if the primary didn't change but it was complicating 
the code at a time when we were trying to make things really stable.  That 
can be fixed later with a protocol feature bit.

On reads, the primary has to have the object in question.  On writes, all 
replicas need to have the object before modifying it.  In those cases, the 
request waits on the primary while recovery on that happens (immediately).  
There's not much way around that, unfortunately, although the write case 
might possibly be optimized some as it currently waits for things to flush 
to the fs and that maybe avoidable in certain cases.

The primary changing nodes is more of a concern, though; the rest is small 
optimizations.  Let's figure out why the mapping is changing like that!

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html