I've been throught your post many times (google likes it ;) I've been trying all the noout/nodown/noup. But I will look into the XFS issue you are talking about. And read all of the post one more time.. /C On Wed, Sep 17, 2014 at 12:01 AM, Craig Lewis <clewis at centraldesktop.com> wrote: > I ran into a similar issue before. I was having a lot of OSD crashes > caused by XFS memory allocation deadlocks. My OSDs crashed so many times > that they couldn't replay the OSD Map before they would be marked > unresponsive. > > See if this sounds familiar: > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040002.html > > If so, Sage's procedure to apply the osdmaps fixed my cluster: > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-May/040176.html > > > > > > On Tue, Sep 16, 2014 at 2:51 PM, Christopher Thorjussen < > christopher.thorjussen at onlinebackupcompany.com> wrote: > >> I've got several osds that are spinning at 100%. >> >> I've retained some professional services to have a look. Its out of my >> newbie reach.. >> >> /Christopher >> >> On Tue, Sep 16, 2014 at 11:23 PM, Craig Lewis <clewis at centraldesktop.com> >> wrote: >> >>> Is it using any CPU or Disk I/O during the 15 minutes? >>> >>> On Sun, Sep 14, 2014 at 11:34 AM, Christopher Thorjussen < >>> christopher.thorjussen at onlinebackupcompany.com> wrote: >>> >>>> I'm waiting for my cluster to recover from a crashed disk and a second >>>> osd that has been taken out (crushmap, rm, stopped). >>>> >>>> Now I'm stuck at looking at this output ('ceph -w') while my osd.58 >>>> goes down every 15 minute. >>>> >>>> 2014-09-14 20:08:56.535688 mon.0 [INF] pgmap v31056972: 24192 pgs: 1 >>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301 >>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail; >>>> 148288/25473878 degraded (0.582%) >>>> 2014-09-14 20:08:57.549302 mon.0 [INF] pgmap v31056973: 24192 pgs: 1 >>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301 >>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail; >>>> 148288/25473878 degraded (0.582%) >>>> 2014-09-14 20:08:58.562771 mon.0 [INF] pgmap v31056974: 24192 pgs: 1 >>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301 >>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail; >>>> 148288/25473878 degraded (0.582%) >>>> 2014-09-14 20:08:59.569851 mon.0 [INF] pgmap v31056975: 24192 pgs: 1 >>>> active, 23888 active+clean, 2 active+remapped+backfilling, 301 >>>> active+degraded; 36677 GB data, 93360 GB used, 250 TB / 341 TB avail; >>>> 148288/25473878 degraded (0.582%) >>>> >>>> Here is a log from when I restarted osd.58 and through the next reboot >>>> 15 minutes later: http://pastebin.com/rt64vx9M >>>> Short, it just waits for 15 minutes not doing anything and then goes >>>> down putting lots of lines like this in the log for that osd: >>>> >>>> 2014-09-14 20:02:08.517727 7fbd3909a700 0 -- 10.47.18.33:6812/27234 >>>> >> 10.47.18.32:6824/21269 pipe(0x35c12280 sd=117 :38289 s=2 pgs=159 >>>> cs=1 l=0 c=0x35bcf1e0).fault with nothing to send, going to standby >>>> 2014-09-14 20:02:08.519312 7fbd37b85700 0 -- 10.47.18.33:6812/27234 >>>> >> 10.47.18.34:6808/5278 pipe(0x36c64500 sd=130 :44909 s=2 pgs=16370 >>>> cs=1 l=0 c=0x36cc4f20).fault with nothing to send, going to standby >>>> >>>> Then I have to restart it. And it repeats. >>>> >>>> What should/can I do? Take it out? >>>> >>>> I've got 4 servers with 24 disks each. Details about servers: >>>> http://pastebin.com/XQeSh8gJ >>>> Running dumpling - 0.67.10 >>>> >>>> Cheers, >>>> Christopher Thorjussen >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users at lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140917/614219a1/attachment.htm>