Re: OSD down after PG increase

Dan Van Der Ster <daniel.vanderster@xxxxxxx> · Thu, 13 Mar 2014 10:15:06 +0000

Why do you create so many PGs ?? The goal is 100 per OSD, with your numbers you have

3 * (48000) / 140 ~= 1000 per OSD.

-- Dan van der Ster || Data & Storage Services || CERN IT Department --

On 13 Mar 2014 at 11:11:16, Kasper Dieter (dieter.kasper@xxxxxxxxxxxxxx) wrote:

We have observed a very similar behavior. 

In a 140 OSD cluster (new created and idle) ~8000 PGs are available. 

After adding two new pools (each with 20000 PGs) 

100 out of 140 OSDs are going down + out. 

The cluster never recovers. 

This problem can be reproduced every time with v0.67 and 0.72. 

With v0.61 this problem does not show up. 

-Dieter 

On Thu, Mar 13, 2014 at 10:46:05AM +0100, Gandalf Corvotempesta wrote: 

> 2014-03-13 9:02 GMT+01:00 Andrey Korolyov <andrey@xxxxxxx>: 

> > Yes, if you have essentially high amount of commited data in the cluster 

> > and/or large number of PG(tens of thousands). 

> 

> I've increased from 64 to 8192 PGs 

> 

> > If you have a room to 

> > experiment with this transition from scratch you may want to play with 

> > numbers in the OSD` queues since they causing deadlock-like behaviour on 

> > operations like increasing PG count or large pool deletion. If cluster 

> > has no I/O at all at the moment, such behaviour is not expected definitely. 

> 

> My cluster was totally idle, it's a test with ceph-ansible repository and nobody

> was using it. 

> _______________________________________________ 

> ceph-users mailing list 

> ceph-users@xxxxxxxxxxxxxx 

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________ 

ceph-users mailing list 

ceph-users@xxxxxxxxxxxxxx 

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com