Hello, On Thu, 30 Jul 2015 11:39:29 +0200 Khalid Ahsein wrote: > Good morning christian, > > thank you for your quick response. > so I need to upgrade to 64 GB or 96 GB to be more secure ? > 32GB would be sufficient, 64GB will give you read performance benefits with hot objects (large pagecache). > And sorry I though that 2 monitors was the minimum. We will work to add > a new host quickly. > Good, can't really help you with your key problems, though. > About osd_pool_default_min_size should I change something for the > future ? > It's fine for your setup, 2 is the norm with a replication of 3. Christian > thank you again > K > > > Le 30 juil. 2015 à 11:12, Christian Balzer <chibi@xxxxxxx> a écrit : > > > > > > Hello, > > > > On Thu, 30 Jul 2015 10:55:30 +0200 Khalid Ahsein wrote: > > > >> Hello everybody, > >> > >> I’m running since 4 months a ceph cluster configured with two > >> monitors : > >> > >> 1 host : 16GB RAM - 12x 4TB disks - 12 OSD - 1 monitor - RAID-1 for > >> system 1 host : 16GB RAM - 12x 4TB disks - 12 OSD - 1 monitor - RAID-1 > >> for system > >> > > Too little RAM, just 2 monitors, just 2 nodes... > > > >> This night I’ve encountered an issue with the crash of the first host. > >> > >> My first question is why with 1 host down, all my cluster was down > >> (unable to do ceph status — hang command) and all my rbd was stuck > >> without possibility to R/W. > > > > Re-read the documentation, you need at least 3 monitors to survive the > > loss of one (monitor) node. > > > > Your osd_pool_default_min_size would have left in a usable situation, 2 > > nodes is really a minimal case. > > > >> I rebooted the first host, and 2 hours later > >> the second go down with the same issue (all rbd down and ceph hang). > >> > >> After reboot, here is ceph status : > >> > >> # ceph status > >> cluster 9c29f469-7bad-4b64-97bf-3fbb1bbc0c5f > >> health HEALTH_ERR > >> 3 pgs inconsistent > >> 1 pgs peering > >> 1 pgs stuck inactive > >> 1 pgs stuck unclean > >> 36 requests are blocked > 32 sec > >> 928 scrub errors > >> clock skew detected on mon.drt-becks > >> monmap e1: 2 mons at > >> {drt-becks=172.16.21.6:6789/0,drt-marco=172.16.21.4:6789/0} election > >> epoch 26, quorum 0,1 drt-marco,drt-becks osdmap e961: 24 osds: 24 up, > >> 24 in pgmap v2532968: 400 pgs, 1 pools, 512 GB data, 130 kobjects > >> 1039 GB used, 88092 GB / 89177 GB avail > >> 393 active+clean > >> 3 active+clean+scrubbing+deep > >> 3 active+clean+inconsistent > >> 1 peering > >> client io 57290 B/s wr, 7 op/s > >> > > You will want to: > > a) fix your NTP, clock skew. > > b) check your logs about the scrub errors > > c) same for the stuck requests > > > >> Also I found this error on DMESG about the crash : > >> > >> Message from syslogd@drt-marco at Jul 30 04:03:57 ... > >> kernel:[4876519.657178] BUG: soft lockup - CPU#7 stuck for 22s! > >> [btrfs-cleaner:32713] > >> > >> All my volumes are on BTRFS, maybe it was not a good idea ? > >> > > Depending on your OS, kernel version, most definitely. > > Plenty of BTRFS problems in the ML archives to be found. > > > > Christian > > > > -- > > Christian Balzer Network/Systems Engineer > > chibi@xxxxxxx <mailto:chibi@xxxxxxx> Global OnLine > > Japan/Fusion Communications http://www.gol.com/ <http://www.gol.com/> -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com