Re: Crash and question

Christian Balzer <chibi@xxxxxxx> · Thu, 30 Jul 2015 18:12:01 +0900

Hello,

On Thu, 30 Jul 2015 10:55:30 +0200 Khalid Ahsein wrote:

> Hello everybody,
> 
> I’m running since 4 months a ceph cluster configured with two monitors :
> 
> 1 host : 16GB RAM - 12x 4TB disks - 12 OSD - 1 monitor - RAID-1 for
> system 1 host : 16GB RAM - 12x 4TB disks - 12 OSD - 1 monitor - RAID-1
> for system
> 
Too little RAM, just 2 monitors, just 2 nodes...

> This night I’ve encountered an issue with the crash of the first host.
> 
> My first question is why with 1 host down, all my cluster was down
> (unable to do ceph status — hang command) and all my rbd was stuck
> without possibility to R/W. 

Re-read the documentation, you need at least 3 monitors to survive the
loss of one (monitor) node.

Your osd_pool_default_min_size would have left in a usable situation, 2
nodes is really a minimal case.

> I rebooted the first host, and 2 hours later
> the second go down with the same issue (all rbd down and ceph hang).
> 
> After reboot, here is ceph status :
> 
> # ceph status
>     cluster 9c29f469-7bad-4b64-97bf-3fbb1bbc0c5f
>      health HEALTH_ERR
>             3 pgs inconsistent
>             1 pgs peering
>             1 pgs stuck inactive
>             1 pgs stuck unclean
>             36 requests are blocked > 32 sec
>             928 scrub errors
>             clock skew detected on mon.drt-becks
>      monmap e1: 2 mons at
> {drt-becks=172.16.21.6:6789/0,drt-marco=172.16.21.4:6789/0} election
> epoch 26, quorum 0,1 drt-marco,drt-becks osdmap e961: 24 osds: 24 up, 24
> in pgmap v2532968: 400 pgs, 1 pools, 512 GB data, 130 kobjects
>             1039 GB used, 88092 GB / 89177 GB avail
>                  393 active+clean
>                    3 active+clean+scrubbing+deep
>                    3 active+clean+inconsistent
>                    1 peering
>   client io 57290 B/s wr, 7 op/s
> 
You will want to:
a) fix your NTP, clock skew.
b) check your logs about the scrub errors
c) same for the stuck requests

> Also I found this error on DMESG about the crash :
> 
> Message from syslogd@drt-marco at Jul 30 04:03:57 ...
>  kernel:[4876519.657178] BUG: soft lockup - CPU#7 stuck for 22s!
> [btrfs-cleaner:32713]
> 
> All my volumes are on BTRFS, maybe it was not a good idea ?
> 
Depending on your OS, kernel version, most definitely. 
Plenty of BTRFS problems in the ML archives to be found.

Christian

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com