Hi,
I'm running a 3 nodes cluster with 6 OSDs in each node. I'm using two
types of pools, size 3 and size 2 and min_size 1, with the node being
the failure domain. I've stopped every OSD in a single node to make some
maintenance which left the cluster in a degraded, but operational state.
For yet unknown reasons one of the OSDs still running restarted and a
few PGs became down+peering. I guess this is normal, because those PGs
were in a size 2 pool and their replica must have been on the node which
was in maintenance so the restarting OSD couldn't get a peer to check
the contents so it wasn't elected as primary. But, the interesting thing
was that every request hitting that OSD was blocked, even for PGs which
had peers on the third, fully operational node. In the OSD's logs I was
blocked requests rising in numbers and delay. After bringing back the
node from maintenance the restarted OSD found its peers and everything
went back to normal.
My question is that is it normal, that an OSD blocks every request if a
few of its PGs are down+peering? I thought only those requests would be
blocked that tries to hit the downed PGs. By the way, I'm running 0.87.
Best regards,
Mate
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com