pgs not active

Chad William Seys <cwseys@xxxxxxxxxxxxxxxx> · Thu, 16 Dec 2021 10:40:34 -0600

Hi all,
We had a situation where 1 drive failed at the same time as a node. This 
caused files in cephfs not not be readable and 'ceph status' to display 
the error message "pgs not active".

Our cluster is either 3 replicas or equivalent EC (k2m2).  Eventually 
all the PGs became active and not data was lost.

So the question is why were the PGs not active?

I'm guessing these PGs were not active because the number of replicas 
was less than "min_size"?  The min_size for the 3 replicated pool is two 
and because of the down node and the down drive for some PGs two 
replicas would have been missing.  That makes sense.  (For the EC pool 
the min_size is 3, same story but a little sadder.)

It would be great to get those PGs active sooner.  Besides buying faster 
hardware, any suggestions?

I also have a couple questions:

If the replicas available are less than min_size, is reading of those 
PGs allowed?  That seems to be a safe operation, but the min_size 
definition says "Sets the minimum number of replicas required for I/O" 
which implies reading, not just writing.

Does ceph prioritize repairing PGs which a client tries to access?  It 
seems as though most of the files on our cluster are not not heavily 
accessed.  If a client tried to access a specific file, it shouldn't 
take too long to repair the broken PGs if the other repairs are paused.

Thanks for your help!
C.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx