Re: Crash and question

Christian Balzer <chibi@xxxxxxx> · Tue, 4 Aug 2015 14:56:44 +0900



Hello,

On Thu, 30 Jul 2015 11:39:29 +0200 Khalid Ahsein wrote:

> Good morning christian,
> 
> thank you for your quick response.
> so I need to upgrade to 64 GB or 96 GB to be more secure ?
> 
32GB would be sufficient, 64GB will give you read performance benefits
with hot objects (large pagecache).

> And sorry I though that 2 monitors was the minimum. We will work to add
> a new host quickly.
> 
Good, can't really help you with your key problems, though.

> About osd_pool_default_min_size should I change something for the
> future ? 
> 
It's fine for your setup, 2 is the norm with a replication of 3.

Christian
> thank you again
> K
> 
> > Le 30 juil. 2015 à 11:12, Christian Balzer <chibi@xxxxxxx> a écrit :
> > 
> > 
> > Hello,
> > 
> > On Thu, 30 Jul 2015 10:55:30 +0200 Khalid Ahsein wrote:
> > 
> >> Hello everybody,
> >> 
> >> I’m running since 4 months a ceph cluster configured with two
> >> monitors :
> >> 
> >> 1 host : 16GB RAM - 12x 4TB disks - 12 OSD - 1 monitor - RAID-1 for
> >> system 1 host : 16GB RAM - 12x 4TB disks - 12 OSD - 1 monitor - RAID-1
> >> for system
> >> 
> > Too little RAM, just 2 monitors, just 2 nodes...
> > 
> >> This night I’ve encountered an issue with the crash of the first host.
> >> 
> >> My first question is why with 1 host down, all my cluster was down
> >> (unable to do ceph status — hang command) and all my rbd was stuck
> >> without possibility to R/W. 
> > 
> > Re-read the documentation, you need at least 3 monitors to survive the
> > loss of one (monitor) node.
> > 
> > Your osd_pool_default_min_size would have left in a usable situation, 2
> > nodes is really a minimal case.
> > 
> >> I rebooted the first host, and 2 hours later
> >> the second go down with the same issue (all rbd down and ceph hang).
> >> 
> >> After reboot, here is ceph status :
> >> 
> >> # ceph status
> >>    cluster 9c29f469-7bad-4b64-97bf-3fbb1bbc0c5f
> >>     health HEALTH_ERR
> >>            3 pgs inconsistent
> >>            1 pgs peering
> >>            1 pgs stuck inactive
> >>            1 pgs stuck unclean
> >>            36 requests are blocked > 32 sec
> >>            928 scrub errors
> >>            clock skew detected on mon.drt-becks
> >>     monmap e1: 2 mons at
> >> {drt-becks=172.16.21.6:6789/0,drt-marco=172.16.21.4:6789/0} election
> >> epoch 26, quorum 0,1 drt-marco,drt-becks osdmap e961: 24 osds: 24 up,
> >> 24 in pgmap v2532968: 400 pgs, 1 pools, 512 GB data, 130 kobjects
> >>            1039 GB used, 88092 GB / 89177 GB avail
> >>                 393 active+clean
> >>                   3 active+clean+scrubbing+deep
> >>                   3 active+clean+inconsistent
> >>                   1 peering
> >>  client io 57290 B/s wr, 7 op/s
> >> 
> > You will want to:
> > a) fix your NTP, clock skew.
> > b) check your logs about the scrub errors
> > c) same for the stuck requests
> > 
> >> Also I found this error on DMESG about the crash :
> >> 
> >> Message from syslogd@drt-marco at Jul 30 04:03:57 ...
> >> kernel:[4876519.657178] BUG: soft lockup - CPU#7 stuck for 22s!
> >> [btrfs-cleaner:32713]
> >> 
> >> All my volumes are on BTRFS, maybe it was not a good idea ?
> >> 
> > Depending on your OS, kernel version, most definitely. 
> > Plenty of BTRFS problems in the ML archives to be found.
> > 
> > Christian
> > 
> > -- 
> > Christian Balzer        Network/Systems Engineer                
> > chibi@xxxxxxx <mailto:chibi@xxxxxxx>   	Global OnLine
> > Japan/Fusion Communications http://www.gol.com/ <http://www.gol.com/>


-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com