RE: ceph-mon terminated with status 28

"Deneau, Tom" <tom.deneau@xxxxxxx> · Tue, 15 Dec 2015 20:11:52 +0000

Brad --

The issue is in tracker now..
http://tracker.ceph.com/issues/14088

-- Tom

> -----Original Message-----
> From: Brad Hubbard [mailto:bhubbard@xxxxxxxxxx]
> Sent: Monday, December 14, 2015 3:47 PM
> To: Deneau, Tom
> Cc: ceph-devel@xxxxxxxxxxxxxxx
> Subject: Re: ceph-mon terminated with status 28
> 
> ----- Original Message -----
> > From: "Tom Deneau" <tom.deneau@xxxxxxx>
> > To: "Brad Hubbard" <bhubbard@xxxxxxxxxx>
> > Cc: ceph-devel@xxxxxxxxxxxxxxx
> > Sent: Tuesday, 15 December, 2015 3:21:27 AM
> > Subject: RE: ceph-mon terminated with status 28
> >
> > Thanks, Brad.  That was the problem.
> 
> Np.
> 
> >
> > Is there a reason why we don't log more descriptive info for this kind
> of
> > failure?
> 
> I guess it may not have been anticipated that init would swallow these
> types of
> errors early in the process and just report the return code.
> 
> If you wouldn't mind opening a tracker for "Fatal errors at start-up are
> not
> logged", or something similar,  I can take a look at getting some
> meaningful log
> entries reported during these early failures.
> 
> Let me know the tracker number.
> 
> Cheers,
> Brad
> 
> >
> > -- Tom
> >
> > > -----Original Message-----
> > > From: Brad Hubbard [mailto:bhubbard@xxxxxxxxxx]
> > > Sent: Sunday, December 13, 2015 4:19 PM
> > > To: Deneau, Tom
> > > Cc: ceph-devel@xxxxxxxxxxxxxxx
> > > Subject: Re: ceph-mon terminated with status 28
> > >
> > > ----- Original Message -----
> > > > From: "Tom Deneau" <tom.deneau@xxxxxxx>
> > > > To: ceph-devel@xxxxxxxxxxxxxxx
> > > > Sent: Sunday, 13 December, 2015 11:49:16 PM
> > > > Subject: ceph-mon terminated with status 28
> > > >
> > > > I am trying to understand the following failure:
> > > >
> > > > A small cluster was running fine, and then was left unused for a
> while.
> > > > When I went to try to use it again, the mon socket wasn't there and
> I
> > > > could see that ceph-mon was not running.  I saw the lines below at
> the
> > > > end of dmesg output.
> > > > When I tried to restart ceph-mon using sudo start ceph-mon
> id=monhost,
> > > > I got the same set of errors newly appended to dmesg output.
> > > >
> > > > I don't see anything more descriptive in /var/log/ceph/ceph-mon.log,
> > > > just the recording of new mon processes starting.
> > > >
> > > > In this particular small cluster, the mon process was running on the
> > > > same node with 7 osd processes.  sudo initctl list shows that the
> osd
> > > > procs are still up, although logging the fact that they can't
> > > > communicate with the mon socket.
> > > >
> > > > Is there someplace else I should look for more details as to why mon
> > > > is down and can't be restarted?
> > > >
> > > > -- Tom Deneau
> > > >
> > > > dmesg output:
> > > > --------------
> > > >  init: ceph-mon (ceph/monhost) main process (16538) terminated with
> > > > status 28
> > > >  init: ceph-mon (ceph/monhost) main process ended, respawning
> > > >  init: ceph-create-keys main process (16227) killed by TERM signal
> > > >  init: ceph-mon (ceph/monhost) main process (16546) terminated with
> > > > status 28
> > > >  init: ceph-mon (ceph/monhost) main process ended, respawning
> > > >  init: ceph-create-keys main process (16548) killed by TERM signal
> > > >  init: ceph-mon (ceph/monhost) main process (16556) terminated with
> > > > status 28
> > > >  init: ceph-mon (ceph/monhost) main process ended, respawning
> > > >  init: ceph-create-keys main process (16558) killed by TERM signal
> > > >  init: ceph-mon (ceph/monhost) main process (16566) terminated with
> > > > status 28
> > > >  init: ceph-mon (ceph/monhost) respawning too fast, stopped
> > > >  init: ceph-create-keys main process (16568) killed by TERM signal
> > >
> > > It looks like it's complaining about lack of space?
> > >
> > > src/ceph_mon.cc:
> > >
> > > 204 int main(int argc, const char **argv)·
> > > 205 {
> > > ----8<----
> > > 475   {
> > > 476     // check fs stats. don't start if it's critically close to
> full.
> > > 477     ceph_data_stats_t stats;
> > > 478     int err = get_fs_stats(stats, g_conf->mon_data.c_str());
> > > 479     if (err < 0) {
> > > 480       cerr << "error checking monitor data's fs stats: " <<
> > > cpp_strerror(err)
> > > 481            << std::endl;
> > > 482       exit(-err);
> > > 483     }
> > > 484     if (stats.avail_percent <= g_conf->mon_data_avail_crit) {
> > > 485       cerr << "error: monitor data filesystem reached concerning
> > > levels of"
> > > 486            << " available storage space (available: "
> > > 487            << stats.avail_percent << "% " <<
> > > prettybyte_t(stats.byte_avail)
> > > 488            << ")\nyou may adjust 'mon data avail crit' to a lower
> > > value"
> > > 489            << " to make this go away (default: " << g_conf-
> > > >mon_data_avail_crit
> > > 490            << "%)\n" << std::endl;
> > > 491       exit(ENOSPC);
> > > 492     }
> > >
> > > #define ENOSPC          28      /* No space left on device */
> > >
> > > Try starting ceph-mon from the command line and see if you get the
> above
> > > message.
> > >
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-
> devel"
> > > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
> > > > info at  http://vger.kernel.org/majordomo-info.html
> > > >
> >
> N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:+v�
> ��w�j�m��������zZ+��ݢj"��
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f