Re: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

Brett Chancellor <bchancellor@xxxxxxxxxxxxxx> · Mon, 19 Aug 2019 02:17:53 -0400

For me, it was the .rgw.meta pool that had very dense placement groups. The OSDs would fail to start and would then commit suicide while trying to scan the PGs. We had to remove all references of those placement groups just to get the OSDs to start. It wasn't pretty.

On Mon, Aug 19, 2019, 2:09 AM Troy Ablan <tablan@xxxxxxxxx> wrote:
Yes, it's possible that they do, but since all of the affected OSDs are 

still down and the monitors have been restarted since, all of those 

pools have pgs that are in unknown state and don't return anything in 

ceph pg ls.

There weren't that many placement groups for the SSDs, but also I don't 

know that there were that many objects.  There were of course a ton of 

omap key/values.

-Troy

On 8/18/19 10:57 PM, Brett Chancellor wrote:

> This sounds familiar. Do any of these pools on the SSD have fairly dense 

> placement group to object ratios? Like more than 500k objects per pg? 

> (ceph pg ls)

> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com