On Wed, Nov 28, 2012 at 1:30 PM, Cláudio Martins <ctpm@xxxxxxxxxx> wrote: > > On Wed, 28 Nov 2012 13:08:08 -0800 Samuel Just <sam.just@xxxxxxxxxxx> wrote: >> Can you post the output of ceph -s? > > 'ceph -s' right now gives > > health HEALTH_WARN 923 pgs degraded; 8666 pgs down; 9606 pgs peering; 7 pgs recovering; 406 pgs recovery_wait; 3769 pgs stale; 9606 pgs stuck inactive; 3769 pgs stuck stale; 11052 pgs stuck unclean; recovery 121068/902868 degraded (13.409%); 4824/300956 unfound (1.603%); 2/18 in osds are down > monmap e1: 1 mons at {0=193.136.128.202:6789/0}, election epoch 1, quorum 0 0 > osdmap e7669: 62 osds: 16 up, 18 in > pgmap v47643: 12480 pgs: 35 active, 1223 active+clean, 129 stale+active, 321 active+recovery_wait, 198 stale+active+clean, 236 peering, 2 active+remapped, 2 stale+active+recovery_wait, 6126 down+peering, 249 active+degraded, 2 stale+active+recovering+degraded, 598 stale+peering, 7 active+clean+scrubbing, 29 active+recovery_wait+remapped, 2067 stale+down+peering, 618 stale+active+degraded, 52 active+recovery_wait+degraded, 61 remapped+peering, 365 down+remapped+peering, 2 stale+active+recovery_wait+degraded, 45 stale+remapped+peering, 108 stale+down+remapped+peering, 5 active+recovering; 1175 GB data, 1794 GB used, 25969 GB / 27764 GB avail; 121068/902868 degraded (13.409%); 4824/300956 unfound (1.603%) > mdsmap e1: 0/0/1 up > > > > The cluster has been in this state since the last attempt to get it > going. I added about 100GB of swap on each machine to avoid the OOM > killer. Running like this resulted in the machines trashing wildly and > getting to ~2000 load avg, and after a while the osds started > dying/commited suicide, but *not* from OOM. Some of the few that remain > have bloated to around 1.9GB of mem usage. > > If you want, I can try to restart the whole thing tomorrow and collect > fresh log output from the dying OSDs, or any other action or debug info > that you might find useful. Made any progress here? Given your RAM limitations I'd suggest doing an incremental startup of the OSDs, maybe 4 per node, let me stabilize as much as they can, then add another couple per node, let them stabilize, etc. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html