Version? -Sam On Tue, Jan 20, 2015 at 9:45 AM, Gregory Farnum <greg@xxxxxxxxxxx> wrote: > On Tue, Jan 20, 2015 at 2:40 AM, Christian Eichelmann > <christian.eichelmann@xxxxxxxx> wrote: >> Hi all, >> >> I want to understand what Ceph does if several OSDs are down. First of our, >> some words to our Setup: >> >> We have 5 Monitors and 12 OSD Server, each has 60x2TB Disks. These Servers >> are spread across 4 racks in our datacenter. Every rack holds 3 OSD Server. >> We have a replication factor of 4 and a crush rule applied that says "step >> chooseleaf firstn 0 type rack". So, in my oppinion, every rack should hold a >> copy of all the data in our ceph cluster. Is that more or less correct? >> >> So, our cluster is in state health OK and I am rebooting one of our OSD >> servers. That means 60 of 720 OSDs are going down. Since this hardware takes >> quite some time to boot up, we are using "mon osd down out subtree limit = >> host" to avoid rebalancing when a whole server goes down. Ceph show this >> output of "ceph -s" while the OSDs are down: >> >> health HEALTH_WARN 22227 pgs degraded; 1 pgs peering; 22227 pgs stuck >> degraded; 1 pgs stuck inactive; 22228 pgs stuck unclean; 22227 pgs stuck und >> ersized; 22227 pgs undersized; recovery 623/7420 objects degraded (8.396%); >> 60/720 in osds are down >> monmap e5: 5 mons at >> {mon-bs01=10.76.28.160:6789/0,mon-bs02=10.76.28.161:6789/0,mon-bs03=10.76.28.162:6789/0,mon-bs04=10.76.28.8:6789/0,mon-bs05=1 >> 0.76.28.9:6789/0}, election epoch 228, quorum 0,1,2,3,4 >> mon-bs04,mon-bs05,mon-bs01,mon-bs02,mon-bs03 >> osdmap e60390: 720 osds: 660 up, 720 in >> pgmap v15427437: 67584 pgs, 2 pools, 7253 MB data, 1855 objects >> 3948 GB used, 1304 TB / 1308 TB avail >> 623/7420 objects degraded (8.396%) >> 45356 active+clean >> 1 peering >> 22227 active+undersized+degraded >> >> The pgs that are degraded and undersized are not a problem, since this >> behaviour is expected. I am worried about the peering pg (it stays in this >> state until all osds are up again) since this would cause I/O to hang if I >> am not mistaken. >> >> After the host is back up and all OSDs are up and running again, I see this: >> >> health HEALTH_WARN 2 pgs stuck unclean >> monmap e5: 5 mons at >> {mon-bs01=10.76.28.160:6789/0,mon-bs02=10.76.28.161:6789/0,mon-bs03=10.76.28.162:6789/0,mon-bs04=10.76.28.8:6789/0,mon-bs05=10.76.28.9:6789/0}, >> election epoch 228, quorum 0,1,2,3,4 >> mon-bs04,mon-bs05,mon-bs01,mon-bs02,mon-bs03 >> osdmap e60461: 720 osds: 720 up, 720 in >> pgmap v15427555: 67584 pgs, 2 pools, 7253 MB data, 1855 objects >> 3972 GB used, 1304 TB / 1308 TB avail >> 2 inactive >> 67582 active+clean >> >> Without any interaction, it will stay in this state. I guess these two >> inactive pgs will also cause I/O to hang? Some more information: >> >> ceph health detail >> HEALTH_WARN 2 pgs stuck unclean >> pg 9.f765 is stuck unclean for 858.298811, current state inactive, last >> acting [91,362,484,553] >> pg 9.ea0f is stuck unclean for 963.441117, current state inactive, last >> acting [91,233,485,524] >> >> I was trying to give osd.91 a kick with "ceph osd down 91" >> >> After the osd is back in the cluster: >> health HEALTH_WARN 3 pgs peering; 54 pgs stuck inactive; 57 pgs stuck >> unclean >> >> So even worse. I decided to take the osd out. The cluster goes back to >> HEALTH_OK. Bringing the OSD back in, the cluster does some rebalancing, >> ending with the cluster in an OK state again. >> >> That actually happens everytime when there are some OSDs going down. I don't >> understand why the cluster is not able to get back to a healthy state >> without admin interaction. In a setup with several hundred OSDs it is normal >> business that some of the go down from time to time. Are there any ideas why >> this is happening? Right now, we do not have many data in our cluster, so I >> can do some tests. Any suggestions would be appreciated. > > Have you done any digging into the state of the PGs reported as > peering or inactive or whatever when this pops up? Running pg_query, > looking at their calculated and acting sets, etc. > > I suspect it's more likely you're exposing a reporting bug with stale > data, rather than actually stuck PGs, but it would take more > information to check that out. > -Greg > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com