Re: Log reading/how do I tell what an OSD is trying to connect to

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



It's a little strange, but with just the one-sided log it looks as
though the OSD is setting up a bunch of connections and then
deliberately tearing them down again within  second or two (i.e., this
is not a direct messenger bug, but it might be an OSD one, or it might
be something else).
Is it possible that you have some firewalls set up that are allowing
through some traffic but not others? The OSDs use a bunch of ports and
it looks like maybe there are at least intermittent issues with them
heartbeating.
-Greg

On Wed, Nov 12, 2014 at 11:32 AM, Scott Laird <scott@xxxxxxxxxxx> wrote:
> Here are the first 33k lines or so:
> https://dl.dropboxusercontent.com/u/104949139/ceph-osd-log.txt
>
> This is a different (but more or less identical) machine from the past set
> of logs.  This system doesn't have quite as many drives in it, so I couldn't
> spot a same-host error burst, but it's logging tons of the same errors while
> trying to talk to 10.2.0.34.
>
> On Wed Nov 12 2014 at 10:47:30 AM Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>
>> On Tue, Nov 11, 2014 at 6:28 PM, Scott Laird <scott@xxxxxxxxxxx> wrote:
>> > I'm having a problem with my cluster.  It's running 0.87 right now, but
>> > I
>> > saw the same behavior with 0.80.5 and 0.80.7.
>> >
>> > The problem is that my logs are filling up with "replacing existing
>> > (lossy)
>> > channel" log lines (see below), to the point where I'm filling drives to
>> > 100% almost daily just with logs.
>> >
>> > It doesn't appear to be network related, because it happens even when
>> > talking to other OSDs on the same host.
>>
>> Well, that means it's probably not physical network related, but there
>> can still be plenty wrong with the networking stack... ;)
>>
>> > The logs pretty much all point to
>> > port 0 on the remote end.  Is this an indicator that it's failing to
>> > resolve
>> > port numbers somehow, or is this normal at this point in connection
>> > setup?
>>
>> That's definitely unusual, but I'd need to see a little more to be
>> sure if it's bad. My guess is that these pipes are connections from
>> the other OSD's Objecter, which is treated as a regular client and
>> doesn't bind to a socket for incoming connections.
>>
>> The repetitive channel replacements are concerning, though — they can
>> be harmless in some circumstances but this looks more like the
>> connection is simply failing to establish and so it's retrying over
>> and over again. Can you restart the OSDs with "debug ms = 10" in their
>> config file and post the logs somewhere? (There is not really any
>> documentation available on what they mean, but the deeper detail ones
>> might also be more understandable to you.)
>> -Greg
>>
>> >
>> > The systems that are causing this problem are somewhat unusual; they're
>> > running OSDs in Docker containers, but they *should* be configured to
>> > run as
>> > root and have full access to the host's network stack.  They manage to
>> > work,
>> > mostly, but things are still really flaky.
>> >
>> > Also, is there documentation on what the various fields mean, short of
>> > digging through the source?  And how does Ceph resolve OSD numbers into
>> > host/port addresses?
>> >
>> >
>> > 2014-11-12 01:50:40.802604 7f7828db8700  0 -- 10.2.0.36:6819/1 >>
>> > 10.2.0.36:0/1 pipe(0x1ce31c80 sd=135 :6819 s=0 pgs=0 cs=0 l=1
>> > c=0x1e070580).accept replacing existing (lossy) channel (new one
>> > lossy=1)
>> >
>> > 2014-11-12 01:50:40.802708 7f7816538700  0 -- 10.2.0.36:6830/1 >>
>> > 10.2.0.36:0/1 pipe(0x1ff61080 sd=120 :6830 s=0 pgs=0 cs=0 l=1
>> > c=0x1f3db2e0).accept replacing existing (lossy) channel (new one
>> > lossy=1)
>> >
>> > 2014-11-12 01:50:40.803346 7f781ba8d700  0 -- 10.2.0.36:6819/1 >>
>> > 10.2.0.36:0/1 pipe(0x1ce31180 sd=125 :6819 s=0 pgs=0 cs=0 l=1
>> > c=0x1e070420).accept replacing existing (lossy) channel (new one
>> > lossy=1)
>> >
>> > 2014-11-12 01:50:40.803944 7f781996c700  0 -- 10.2.0.36:6830/1 >>
>> > 10.2.0.36:0/1 pipe(0x1ff618c0 sd=107 :6830 s=0 pgs=0 cs=0 l=1
>> > c=0x1f3d8420).accept replacing existing (lossy) channel (new one
>> > lossy=1)
>> >
>> > 2014-11-12 01:50:40.804185 7f7816538700  0 -- 10.2.0.36:6819/1 >>
>> > 10.2.0.36:0/1 pipe(0x1ffd1e40 sd=20 :6819 s=0 pgs=0 cs=0 l=1
>> > c=0x1e070840).accept replacing existing (lossy) channel (new one
>> > lossy=1)
>> >
>> > 2014-11-12 01:50:40.805235 7f7813407700  0 -- 10.2.0.36:6819/1 >>
>> > 10.2.0.36:0/1 pipe(0x1ffd1340 sd=60 :6819 s=0 pgs=0 cs=0 l=1
>> > c=0x1b2d6260).accept replacing existing (lossy) channel (new one
>> > lossy=1)
>> >
>> > 2014-11-12 01:50:40.806364 7f781bc8f700  0 -- 10.2.0.36:6819/1 >>
>> > 10.2.0.36:0/1 pipe(0x1ffd0b00 sd=162 :6819 s=0 pgs=0 cs=0 l=1
>> > c=0x675c580).accept replacing existing (lossy) channel (new one lossy=1)
>> >
>> > 2014-11-12 01:50:40.806425 7f781aa7d700  0 -- 10.2.0.36:6830/1 >>
>> > 10.2.0.36:0/1 pipe(0x1db29600 sd=143 :6830 s=0 pgs=0 cs=0 l=1
>> > c=0x1f3d9600).accept replacing existing (lossy) channel (new one
>> > lossy=1)
>> >
>> >
>> >
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux