Re: [PATCH v2] libceph: print fsid and client gid with mon id and osd id

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Mar 25, 2022 at 3:54 AM Satoru Takeuchi
<satoru.takeuchi@xxxxxxxxx> wrote:
>
> Hi Ilya
>
> > >>> Hi Daichi,
> > >>>
> > >>> I would suggest two things:
> > >>>
> > >>> 1) Leave dout messages alone for now.  They aren't shown by default and
> > >>>      are there for developers to do debugging.  In that setting, multiple
> > >>>      clusters or client instances should be rare.
> > >>
> > >> Sure, I'll leave dout messages alone.
> > >>
> > >>> 2) For pr_info/pr_warn/etc messages, make the format consistent and
> > >>>      more grepable, e.g.
> > >>>
> > >>>        libceph (<fsid> <gid>): <message>
> > >>>
> > >>>        libceph (ef1ab157-688c-483b-a94d-0aeec9ca44e0 4181): osd10 down
> > >>>
> > >>>      as I suggested earlier.  Sometimes printing just the fsid, sometimes
> > >>>      the fsid and the gid and sometimes none is undesirable.
> > >>
> > >> Let me confirm two points:
> > >>
> > >> - For consistency, should all pr_info/pr_warn/etc messages in libceph
> > >>     uses format with fsid and gid? Or does your suggestion means messages
> > >>     with osd or mon id (i.e. messages which I had edited in this patch)
> > >>     should have consistent format?
> > >
> > > Definitely not all messages.  I'd start with those that are most common
> > > and that _you_ think are important to distinguish between.  Whether mon
> > > or osd id is included is probably irrelevant.
> > >
> > > The reason I'm deferring to you here is that I haven't seen this come
> > > up as an issue.  99% of users would connect to a single Ceph cluster
> > > (which means a single fsid) with a single libceph instance (which means
> > > a single gid).
> > >
> > > But consistency is very important, so IMO a particular message should
> > > either be not touched at all or be converted to a consistent format.
> > >
> > >>
> > >> - What should be displayed if fsid or gid cannot be obtained? For example,
> > >>     we may not know fsid yet when establishing session with mon. Also,
> > >>     decode_new_up_state_weight() outputs message like "osd10 down" etc, but
> > >>     it seems not easy to get client gid within this function. This is why
> > >>     there are sometimes fsid only and sometimes gid only and sometimes both
> > >>     in my patch.
> > >
> > > For when the mon session is being established, do nothing (i.e.
> > > leave existing messages as is).  For after the fsid and gid become
> > > known, either convert to a consistent format with both fsid and gid or
> > > do nothing if the message isn't important.  If this causes too much
> > > code churn because of additional parameters being passed around, we may
> > > need to reconsider whether this change is worth it at all.
> >
> > Thank you for your kind comments. They are very helpful for me. I'll think
> > again about when fsid and gid in the logs are useful.
> >
> > Daichi
>
> There are multiple Ceph clusters in our system. A cluster provides RBD
> on top of HDD
> and another cluster provides RBD on top of SSD. All Ceph(more
> precisely, Rook/Ceph)
> clusters are in one Kubernetes cluster. Many types of
> applications(Pods) can co-exist
> for each node and there is not so special that an application consumes RBD(SSD)
> and another application consumes RBD(HDD) are on the same node. In
> this case, it's
> very useful to distinguish from which cluster each kernel message comes.
>
> We've already encountered trouble that could have been solved earlier
> if this fix is applied.
> We encountered slow I/Os in an application using RBD(HDD) and found
> that there were
> some "OSD n down/up" messages. Then we took some time to confirm whether these
> messages were related to the problem because there was another
> application using RBD(SSD)
> in the same node and it's hard to know these messages were about
> RBD(HDD) cluster.
>
> What we really need for now is to clear from which "OSD n down/up"
> messages come.
> So, how about adding "libceph (<fsid> <gid>):" prefix to these two
> messages and don't touch
> any other messages? Does it make sense?

Hi Satoru,

Sure, starting with just these two messages is fine with me but if
I were you I would also cover the other three similar messages that
may be reported while processing new osdmaps:

    osd<id> down
    osd<id> up

    +

    osd<id> weight ...
    osd<id> primary-affinity ...
    osd<id> does not exist

Thanks,

                Ilya



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Ceph Dev]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux