Re: ceph-mon terminated with status 28

Brad Hubbard <bhubbard@xxxxxxxxxx> · Mon, 14 Dec 2015 16:46:52 -0500 (EST)

----- Original Message -----
> From: "Tom Deneau" <tom.deneau@xxxxxxx>
> To: "Brad Hubbard" <bhubbard@xxxxxxxxxx>
> Cc: ceph-devel@xxxxxxxxxxxxxxx
> Sent: Tuesday, 15 December, 2015 3:21:27 AM
> Subject: RE: ceph-mon terminated with status 28
> 
> Thanks, Brad.  That was the problem.

Np.

> 
> Is there a reason why we don't log more descriptive info for this kind of
> failure?

I guess it may not have been anticipated that init would swallow these types of
errors early in the process and just report the return code.

If you wouldn't mind opening a tracker for "Fatal errors at start-up are not
logged", or something similar,  I can take a look at getting some meaningful log
entries reported during these early failures.

Let me know the tracker number.

Cheers,
Brad

> 
> -- Tom
> 
> > -----Original Message-----
> > From: Brad Hubbard [mailto:bhubbard@xxxxxxxxxx]
> > Sent: Sunday, December 13, 2015 4:19 PM
> > To: Deneau, Tom
> > Cc: ceph-devel@xxxxxxxxxxxxxxx
> > Subject: Re: ceph-mon terminated with status 28
> > 
> > ----- Original Message -----
> > > From: "Tom Deneau" <tom.deneau@xxxxxxx>
> > > To: ceph-devel@xxxxxxxxxxxxxxx
> > > Sent: Sunday, 13 December, 2015 11:49:16 PM
> > > Subject: ceph-mon terminated with status 28
> > >
> > > I am trying to understand the following failure:
> > >
> > > A small cluster was running fine, and then was left unused for a while.
> > > When I went to try to use it again, the mon socket wasn't there and I
> > > could see that ceph-mon was not running.  I saw the lines below at the
> > > end of dmesg output.
> > > When I tried to restart ceph-mon using sudo start ceph-mon id=monhost,
> > > I got the same set of errors newly appended to dmesg output.
> > >
> > > I don't see anything more descriptive in /var/log/ceph/ceph-mon.log,
> > > just the recording of new mon processes starting.
> > >
> > > In this particular small cluster, the mon process was running on the
> > > same node with 7 osd processes.  sudo initctl list shows that the osd
> > > procs are still up, although logging the fact that they can't
> > > communicate with the mon socket.
> > >
> > > Is there someplace else I should look for more details as to why mon
> > > is down and can't be restarted?
> > >
> > > -- Tom Deneau
> > >
> > > dmesg output:
> > > --------------
> > >  init: ceph-mon (ceph/monhost) main process (16538) terminated with
> > > status 28
> > >  init: ceph-mon (ceph/monhost) main process ended, respawning
> > >  init: ceph-create-keys main process (16227) killed by TERM signal
> > >  init: ceph-mon (ceph/monhost) main process (16546) terminated with
> > > status 28
> > >  init: ceph-mon (ceph/monhost) main process ended, respawning
> > >  init: ceph-create-keys main process (16548) killed by TERM signal
> > >  init: ceph-mon (ceph/monhost) main process (16556) terminated with
> > > status 28
> > >  init: ceph-mon (ceph/monhost) main process ended, respawning
> > >  init: ceph-create-keys main process (16558) killed by TERM signal
> > >  init: ceph-mon (ceph/monhost) main process (16566) terminated with
> > > status 28
> > >  init: ceph-mon (ceph/monhost) respawning too fast, stopped
> > >  init: ceph-create-keys main process (16568) killed by TERM signal
> > 
> > It looks like it's complaining about lack of space?
> > 
> > src/ceph_mon.cc:
> > 
> > 204 int main(int argc, const char **argv)·
> > 205 {
> > ----8<----
> > 475   {
> > 476     // check fs stats. don't start if it's critically close to full.
> > 477     ceph_data_stats_t stats;
> > 478     int err = get_fs_stats(stats, g_conf->mon_data.c_str());
> > 479     if (err < 0) {
> > 480       cerr << "error checking monitor data's fs stats: " <<
> > cpp_strerror(err)
> > 481            << std::endl;
> > 482       exit(-err);
> > 483     }
> > 484     if (stats.avail_percent <= g_conf->mon_data_avail_crit) {
> > 485       cerr << "error: monitor data filesystem reached concerning
> > levels of"
> > 486            << " available storage space (available: "
> > 487            << stats.avail_percent << "% " <<
> > prettybyte_t(stats.byte_avail)
> > 488            << ")\nyou may adjust 'mon data avail crit' to a lower
> > value"
> > 489            << " to make this go away (default: " << g_conf-
> > >mon_data_avail_crit
> > 490            << "%)\n" << std::endl;
> > 491       exit(ENOSPC);
> > 492     }
> > 
> > #define ENOSPC          28      /* No space left on device */
> > 
> > Try starting ceph-mon from the command line and see if you get the above
> > message.
> > 
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > >
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+��ݢj"��
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html