Hello, On Thu, 30 Jul 2015 10:55:30 +0200 Khalid Ahsein wrote: > Hello everybody, > > I’m running since 4 months a ceph cluster configured with two monitors : > > 1 host : 16GB RAM - 12x 4TB disks - 12 OSD - 1 monitor - RAID-1 for > system 1 host : 16GB RAM - 12x 4TB disks - 12 OSD - 1 monitor - RAID-1 > for system > Too little RAM, just 2 monitors, just 2 nodes... > This night I’ve encountered an issue with the crash of the first host. > > My first question is why with 1 host down, all my cluster was down > (unable to do ceph status — hang command) and all my rbd was stuck > without possibility to R/W. Re-read the documentation, you need at least 3 monitors to survive the loss of one (monitor) node. Your osd_pool_default_min_size would have left in a usable situation, 2 nodes is really a minimal case. > I rebooted the first host, and 2 hours later > the second go down with the same issue (all rbd down and ceph hang). > > After reboot, here is ceph status : > > # ceph status > cluster 9c29f469-7bad-4b64-97bf-3fbb1bbc0c5f > health HEALTH_ERR > 3 pgs inconsistent > 1 pgs peering > 1 pgs stuck inactive > 1 pgs stuck unclean > 36 requests are blocked > 32 sec > 928 scrub errors > clock skew detected on mon.drt-becks > monmap e1: 2 mons at > {drt-becks=172.16.21.6:6789/0,drt-marco=172.16.21.4:6789/0} election > epoch 26, quorum 0,1 drt-marco,drt-becks osdmap e961: 24 osds: 24 up, 24 > in pgmap v2532968: 400 pgs, 1 pools, 512 GB data, 130 kobjects > 1039 GB used, 88092 GB / 89177 GB avail > 393 active+clean > 3 active+clean+scrubbing+deep > 3 active+clean+inconsistent > 1 peering > client io 57290 B/s wr, 7 op/s > You will want to: a) fix your NTP, clock skew. b) check your logs about the scrub errors c) same for the stuck requests > Also I found this error on DMESG about the crash : > > Message from syslogd@drt-marco at Jul 30 04:03:57 ... > kernel:[4876519.657178] BUG: soft lockup - CPU#7 stuck for 22s! > [btrfs-cleaner:32713] > > All my volumes are on BTRFS, maybe it was not a good idea ? > Depending on your OS, kernel version, most definitely. Plenty of BTRFS problems in the ML archives to be found. Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com