Re: Degraded data redundancy: NUM pgs undersized

Jörg Kastning <joerg.kastning@xxxxxxxxxxxxxxxx> · Tue, 04 Sep 2018 14:04:41 +0200

Hello Lothar,

Thanks for your reply.

Am 04.09.2018 um 11:20 schrieb Lothar Gesslein:
By pure chance 15 pgs are now actually replicated to all 3 osds, so they
have enough copies (clean). But the placement is "wrong", it would like
to move the data to different osds (remapped) if possible.

That seems to be correct. I've added a third bucket of type datacenter 
and moved on host bucket so that each datacenter has one host with one 
osd. The PGs were rebalanced (if that is the correct term) and status 
changed to HEALTH_OK with all PGs active+clean.

Now I moved the host in dc2 to another datacenter and removed dc2 from 
the CRUSH map. Now I have all PGs active+clean+remapped. So now your 
next statement applies:

It replicated to 2 osds in the initial placement but wasn't able to find
a suitable third osd. Then by increasing pgp_num it recalculated the
placement, again selected two osds and moved the data there. It won't
remove the data from the "wrong" osd until it has a new place for it, so
you end up with three copies, but remapped pgs.

Ok, I think I got this.

  3. What's wrong here and what do I have to do to get the cluster back
to active+clean, again?

I guess you want to have "two copies in dc1, one copy in dc2"?

If you stay with only 3 osds that is the only way to distribute 3
objects anyway, so you don't need any crush rule.

What your crush rule is currently expressing is

"in the default root, select n buckets (where n is the pool size, 3 in
this case) of type datacenter, select one leaf (meaning osd) in each
datacenter". You only have 2 datacenter buckets, so that will only ever
select 2 osds.

If your cluster is going to grow to at least 2 osds in each dc, you can
go with

http://cephnotes.ksperis.com/blog/2017/01/23/crushmap-for-2-dc/

I would translate this crush rule as

"in the default root, select 2 buckets of type datacenter, select n-1
(where n is the pool size, so here 3-1 = 2) leafs in each datacenter"

You will need at least two osds in each dc for this, because it is
random (with respect to the weights) in which dc the 2 copies will be
placed and which gets the remaining copy.

I don't get it why I need to have at least two osds in each dc. Because 
I thought when I only have three osds it is implicit clear where to 
write the two copies.

In case I have two osds in each dc I would never know on which side the 
two copies of my three replicas are.

Let's try an example to check if my understanding of the matter is 
correct or not:

I have two dc dcA and dcB with two osds in each dc. Due to the random 
placement two copies of object A are written in dcA and one in dcB. From 
the next object B two copies are written in dcB and one in dcA.

In case I have two osds in dcA and only one in dcB the two copies of an 
object are written to dcA every time and only one copy in dcB.

Did I get it right?

Best regards,
Joerg

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com