One Mon log huge and this Mon down often

joao.luis@xxxxxxxxxxx (Joao Eduardo Luis) · Sun, 24 Aug 2014 12:58:05 +0100

On 08/24/2014 01:57 AM, debian Only wrote:
> this is happen i use *ceph-deploy create ceph01-vm ceph02-vm ceph04-vm
> *to create 3 Mons member.
> now every 10 hours, one  Mon will down.   every time have this error,
>   some time the hardisk have enough space left,such as 30G.
>
> i deployed Ceph before,  only create one Mon at first step *ceph-deploy
> create ceph01-vm ,  and then ceph-deploy mon add ceph02-vm, *not meet
> this problem.
>
> i do not know why ?

Your monitor shutdown because the disk the monitor is sitting on has 
dropped to (or below) 5% of available disk space.  This is meant to 
prevent the monitor from running out of disk space and be unable to 
store critical cluster information.  5% is a rough estimate, which may 
be adequate for some disks, but may be either too small or too large for 
small disks and large disks respectively.  This value can be adjusted if 
you feel like you need to, using the 'mon_data_avail_crit' option (which 
defaults to 5, as in 5%, but can be adjusted to whatever suits you best).

The big problem here however seems to be that you're running out of 
space due to huge monitor logs. Is that it?

If so, I would ask you to run the following commands and share the results:

ceph daemon mon.* config get debug_mon
ceph daemon mon.* config get debug_ms
ceph daemon mon.* config get debug_paxos

   -Joao

>
> 2014-08-23 10:19:43.910650 7f3c0028c700  0
> mon.ceph01-vm at 1(peon).data_health(56) *update_stats avail 5% total
> 15798272 used 12941508 avail 926268*
> 2014-08-23 10:19:43.910806 7f3c0028c700 -1
> mon.ceph01-vm at 1(peon).data_health(56) reached critical levels of
> available space on local monitor storage -- shutdown!
> 2014-08-23 10:19:43.910811 7f3c0028c700  0 ** Shutdown via Data Health
> Service **
> 2014-08-23 10:19:43.931427 7f3bffa8b700  1
> mon.ceph01-vm at 1(peon).paxos(paxos active c 15814..16493) is_readable
> now=2014-08-23 10:19:43.931433 lease_expire=2014-08-23 10:19:45.989585
> has v0 lc 16493
> 2014-08-23 10:19:43.931486 7f3bfe887700 -1 mon.ceph01-vm at 1(peon) e2 ***
> Got Signal Interrupt ***
> 2014-08-23 10:19:43.931515 7f3bfe887700  1 mon.ceph01-vm at 1(peon) e2 shutdown
> 2014-08-23 10:19:43.931725 7f3bfe887700  0 quorum service shutdown
> 2014-08-23 10:19:43.931730 7f3bfe887700  0
> mon.ceph01-vm at 1(shutdown).health(56) HealthMonitor::service_shutdown 1
> services
> 2014-08-23 10:19:43.931735 7f3bfe887700  0 quorum service shutdown
>
>
>
> 2014-08-22 21:31 GMT+07:00 debian Only <onlydebian at gmail.com
> <mailto:onlydebian at gmail.com>>:
>
>     this time ceph01-vm down, no big log happen ,  other 2 ok.    do not
>     what's the reason,  this is not my first time install Ceph.  but
>     this is first time i meet that mon down again and again.
>
>     ceph.conf on each OSDs and MONs
>       [global]
>     fsid = 075f1aae-48de-412e-b024-b0f014dbc8cf
>     mon_initial_members = ceph01-vm, ceph02-vm, ceph04-vm
>     mon_host = 192.168.123.251,192.168.123.252,192.168.123.250
>     auth_cluster_required = cephx
>     auth_service_required = cephx
>     auth_client_required = cephx
>     filestore_xattr_use_omap = true
>
>     rgw print continue = false
>     rgw dns name = ceph-radosgw
>     osd pool default pg num = 128
>     osd pool default pgp num = 128
>
>
>     [client.radosgw.gateway]
>     host = ceph-radosgw
>     keyring = /etc/ceph/ceph.client.radosgw.keyring
>     rgw socket path = /var/run/ceph/ceph.radosgw.gateway.fastcgi.sock
>     log file = /var/log/ceph/client.radosgw.gateway.log
>
>
>     2014-08-22 18:15 GMT+07:00 Joao Eduardo Luis <joao.luis at inktank.com
>     <mailto:joao.luis at inktank.com>>:
>
>         On 08/22/2014 10:21 AM, debian Only wrote:
>
>             i have  3 mons in Ceph 0.80.5 on Wheezy. have one RadosGW
>
>             when happen this first time, i increase the mon log device.
>             this time mon.ceph02-vm down, only this mon down,  other 2
>             is ok.
>
>             pls some one give me some guide.
>
>                27M Aug 22 02:11 ceph-mon.ceph04-vm.log
>                43G Aug 22 02:11 ceph-mon.ceph02-vm.log
>                2G Aug 22 02:11 ceph-mon.ceph01-vm.log
>
>
>         Depending on the debug level you set, and depending on which
>         subsystems you set a higher debug level, the monitor can spit
>         out A LOT of information in a short period of time.  43GB is
>         nothing compared to some 100+ GB logs I've had churn through in
>         the past.
>
>         However, I'm not grasping what kind of help you need.  According
>         to your 'ceph -s' below the monitors seem okay -- all are in,
>         health is OK.
>
>         If you issue is with having that one monitor spitting out
>         humongous amounts of debug info here's what you need to do:
>
>         - If you added one or more 'debug <something> = X' to that
>         monitor's ceph.conf, you will want to remove them so that in a
>         future restart the monitor doesn't start with non-default debug
>         levels.
>
>         - You will want to inject default debug levels into that one
>         monitor.
>
>         Depending on what debug levels you have increased, you will want
>         to run a version of "ceph tell mon.ceph02-vm injectargs
>         '--debug-mon 1/5 --debug-ms 0/5 --debug-paxos 1/5'"
>
>            -Joao
>
>
>             # ceph -s
>                   cluster 075f1aae-48de-412e-b024-__b0f014dbc8cf
>                    health HEALTH_OK
>                    monmap e2: 3 mons at
>             {ceph01-vm=192.168.123.251:__6789/0,ceph02-vm=192.168.123.__252:6789/0,ceph04-vm=192.168.__123.250:6789/0
>             <http://192.168.123.251:6789/0,ceph02-vm=192.168.123.252:6789/0,ceph04-vm=192.168.123.250:6789/0>
>             <http://192.168.123.251:6789/__0,ceph02-vm=192.168.123.252:__6789/0,ceph04-vm=192.168.123.__250:6789/0
>             <http://192.168.123.251:6789/0,ceph02-vm=192.168.123.252:6789/0,ceph04-vm=192.168.123.250:6789/0>>},
>
>             election epoch 44, quorum 0,1,2 ceph04-vm,ceph01-vm,ceph02-vm
>                    mdsmap e10: 1/1/1 up {0=ceph06-vm=up:active}
>                    osdmap e145: 10 osds: 10 up, 10 in
>                     pgmap v4394: 2392 pgs, 21 pools, 4503 MB data, 1250
>             objects
>                           13657 MB used, 4908 GB / 4930 GB avail
>                               2392 active+clean
>
>
>             /2014-08-22 02:06:34.738828 7ff2b9557700  1
>
>             mon.ceph02-vm at 2(peon).paxos(__paxos active c 9037..9756)
>             is_readable
>             now=2014-08-22 02:06:34.738830 lease_expire=2014-08-22
>             02:06:39.701305
>             has v0 lc 9756/
>             /2014-08-22 02:06:36.618805 7ff2b9557700  1
>
>             mon.ceph02-vm at 2(peon).paxos(__paxos active c 9037..9756)
>             is_readable
>             now=2014-08-22 02:06:36.618807 lease_expire=2014-08-22
>             02:06:39.701305
>             has v0 lc 9756/
>             /2014-08-22 02:06:36.620019 7ff2b9557700  1
>
>             mon.ceph02-vm at 2(peon).paxos(__paxos active c 9037..9756)
>             is_readable
>             now=2014-08-22 02:06:36.620021 lease_expire=2014-08-22
>             02:06:39.701305
>             has v0 lc 9756/
>             /2014-08-22 02:06:36.620975 7ff2b9557700  1
>
>             mon.ceph02-vm at 2(peon).paxos(__paxos active c 9037..9756)
>             is_readable
>             now=2014-08-22 02:06:36.620977 lease_expire=2014-08-22
>             02:06:39.701305
>             has v0 lc 9756/
>             /2014-08-22 02:06:36.629362 7ff2b9557700  0
>             mon.ceph02-vm at 2(peon) e2
>
>             handle_command mon_command({"prefix": "mon_status",
>             "format": "json"} v
>             0) v1/
>             /2014-08-22 02:06:36.633007 7ff2b9557700  0
>             mon.ceph02-vm at 2(peon) e2
>             handle_command mon_command({"prefix": "status", "format":
>             "json"} v 0) v1/
>             /2014-08-22 02:06:36.637002 7ff2b9557700  0
>             mon.ceph02-vm at 2(peon) e2
>
>             handle_command mon_command({"prefix": "health", "detail":
>             "", "format":
>             "json"} v 0) v1/
>             /2014-08-22 02:06:36.640971 7ff2b9557700  0
>             mon.ceph02-vm at 2(peon) e2
>
>             handle_command mon_command({"dumpcontents": ["pgs_brief"],
>             "prefix": "pg
>             dump", "format": "json"} v 0) v1/
>             /2014-08-22 02:06:36.641014 7ff2b9557700  1
>
>             mon.ceph02-vm at 2(peon).paxos(__paxos active c 9037..9756)
>             is_readable
>             now=2014-08-22 02:06:36.641016 lease_expire=2014-08-22
>             02:06:39.701305
>             has v0 lc 9756/
>             /2014-08-22 02:06:37.520387 7ff2b9557700  1
>
>             mon.ceph02-vm at 2(peon).paxos(__paxos active c 9037..9757)
>             is_readable
>             now=2014-08-22 02:06:37.520388 lease_expire=2014-08-22
>             02:06:42.501572
>             has v0 lc 9757/
>
>
>
>             _________________________________________________
>             ceph-users mailing list
>             ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com>
>             http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
>             <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>
>
>
>         --
>         Joao Eduardo Luis
>         Software Engineer | http://inktank.com | http://ceph.com
>
>
>

-- 
Joao Eduardo Luis
Software Engineer | http://inktank.com | http://ceph.com