OSD blocked every request until a peer came online

mailinglist@xxxxxxxxxxxxxxxxxxx · Thu, 26 Feb 2015 08:53:47 +0100

Hi,

I'm running a 3 nodes cluster with 6 OSDs in each node. I'm using two 
types of pools, size 3 and size 2 and min_size 1, with the node being 
the failure domain. I've stopped every OSD in a single node to make some 
maintenance which left the cluster in a degraded, but operational state. 
For yet unknown reasons one of the OSDs still running restarted and a 
few PGs became down+peering. I guess this is normal, because those PGs 
were in a size 2 pool and their replica must have been on the node which 
was in maintenance so the restarting OSD couldn't get a peer to check 
the contents so it wasn't elected as primary. But, the interesting thing 
was that every request hitting that OSD was blocked, even for PGs which 
had peers on the third, fully operational node. In the OSD's logs I was 
blocked requests rising in numbers and delay. After bringing back the 
node from maintenance the restarted OSD found its peers and everything 
went back to normal.

My question is that is it normal, that an OSD blocks every request if a 
few of its PGs are down+peering? I thought only those requests would be 
blocked that tries to hit the downed PGs. By the way, I'm running 0.87.

Best regards,
Mate
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com