Hi Ilya > >>> Hi Daichi, > >>> > >>> I would suggest two things: > >>> > >>> 1) Leave dout messages alone for now. They aren't shown by default and > >>> are there for developers to do debugging. In that setting, multiple > >>> clusters or client instances should be rare. > >> > >> Sure, I'll leave dout messages alone. > >> > >>> 2) For pr_info/pr_warn/etc messages, make the format consistent and > >>> more grepable, e.g. > >>> > >>> libceph (<fsid> <gid>): <message> > >>> > >>> libceph (ef1ab157-688c-483b-a94d-0aeec9ca44e0 4181): osd10 down > >>> > >>> as I suggested earlier. Sometimes printing just the fsid, sometimes > >>> the fsid and the gid and sometimes none is undesirable. > >> > >> Let me confirm two points: > >> > >> - For consistency, should all pr_info/pr_warn/etc messages in libceph > >> uses format with fsid and gid? Or does your suggestion means messages > >> with osd or mon id (i.e. messages which I had edited in this patch) > >> should have consistent format? > > > > Definitely not all messages. I'd start with those that are most common > > and that _you_ think are important to distinguish between. Whether mon > > or osd id is included is probably irrelevant. > > > > The reason I'm deferring to you here is that I haven't seen this come > > up as an issue. 99% of users would connect to a single Ceph cluster > > (which means a single fsid) with a single libceph instance (which means > > a single gid). > > > > But consistency is very important, so IMO a particular message should > > either be not touched at all or be converted to a consistent format. > > > >> > >> - What should be displayed if fsid or gid cannot be obtained? For example, > >> we may not know fsid yet when establishing session with mon. Also, > >> decode_new_up_state_weight() outputs message like "osd10 down" etc, but > >> it seems not easy to get client gid within this function. This is why > >> there are sometimes fsid only and sometimes gid only and sometimes both > >> in my patch. > > > > For when the mon session is being established, do nothing (i.e. > > leave existing messages as is). For after the fsid and gid become > > known, either convert to a consistent format with both fsid and gid or > > do nothing if the message isn't important. If this causes too much > > code churn because of additional parameters being passed around, we may > > need to reconsider whether this change is worth it at all. > > Thank you for your kind comments. They are very helpful for me. I'll think > again about when fsid and gid in the logs are useful. > > Daichi There are multiple Ceph clusters in our system. A cluster provides RBD on top of HDD and another cluster provides RBD on top of SSD. All Ceph(more precisely, Rook/Ceph) clusters are in one Kubernetes cluster. Many types of applications(Pods) can co-exist for each node and there is not so special that an application consumes RBD(SSD) and another application consumes RBD(HDD) are on the same node. In this case, it's very useful to distinguish from which cluster each kernel message comes. We've already encountered trouble that could have been solved earlier if this fix is applied. We encountered slow I/Os in an application using RBD(HDD) and found that there were some "OSD n down/up" messages. Then we took some time to confirm whether these messages were related to the problem because there was another application using RBD(SSD) in the same node and it's hard to know these messages were about RBD(HDD) cluster. What we really need for now is to clear from which "OSD n down/up" messages come. So, how about adding "libceph (<fsid> <gid>):" prefix to these two messages and don't touch any other messages? Does it make sense? Best, Satoru