Re: Possible bug in krbd (4.4.0)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



pls disregard comment about "the same ceph_connection struct".

On Tue, Jan 3, 2017 at 4:13 PM, Max Yehorov <myehorov@xxxxxxxxxx> wrote:
> Hi,
>
> I have encountered a weird possible bug. There is an rbd image mapped
> and mounted on a client machine. It is not possible to umount it. Both
> lsof and fuser show no mention of neither device nor mountpoint. It is
> not exported via nfs kernel server, so unlikely it is blocked by
> kernel.
>
> There is an odd pattern in syslog, two osds are constantly loose
> connections. A wild guess is that umount tries to contact primary osd
> and fails?
>
> After I enabled kernel debug I saw the following:
>
> [9586733.605792] libceph:  con_open ffff880748f58030 10.80.16.74:6812
> [9586733.623876] libceph:  connect 10.80.16.74:6812
> [9586733.625091] libceph:  connect 10.80.16.74:6812 EINPROGRESS sk_state = 2
> [9586756.681246] libceph:  con_keepalive ffff881057d082b8
> [9586767.713067] libceph:  fault ffff880748f59830 state 5 to peer
> 10.80.16.78:6812
> [9586767.713593] libceph: osd27 10.80.16.78:6812 socket closed (con state OPEN)
> [9586767.721145] libceph:  con_close ffff880748f59830 peer 10.80.16.78:6812
> [9586767.724440] libceph:  con_open ffff880748f59830 10.80.16.78:6812
> [9586767.742487] libceph:  connect 10.80.16.78:6812
> [9586767.743696] libceph:  connect 10.80.16.78:6812 EINPROGRESS sk_state = 2
> [9587346.956812] libceph:  try_read start on ffff881057d082b8 state 5
> [9587466.968125] libceph:  try_write start ffff881057d082b8 state 5
> [9587634.021257] libceph:  fault ffff880748f58030 state 5 to peer
> 10.80.16.74:6812
> [9587634.021781] libceph: osd19 10.80.16.74:6812 socket closed (con state OPEN)
> [9587634.029336] libceph:  con_close ffff880748f58030 peer 10.80.16.74:6812
> [9587634.032628] libceph:  con_open ffff880748f58030 10.80.16.74:6812
> [9587634.050677] libceph:  connect 10.80.16.74:6812
> [9587634.051888] libceph:  connect 10.80.16.74:6812 EINPROGRESS sk_state = 2
> [9587668.124746] libceph:  fault ffff880748f59830 state 5 to peer
> 10.80.16.78:6812
>
> grep of ceph_sock_state_change
> kernel: [9585833.117190] libceph:  ceph_sock_state_change
> ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_CLOSE_WAIT
> kernel: [9585833.121912] libceph:  ceph_sock_state_change
> ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_LAST_ACK
> kernel: [9585833.122467] libceph:  ceph_sock_state_change
> ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_CLOSE
> kernel: [9585833.151589] libceph:  ceph_sock_state_change
> ffff880748f58030 state = CON_STATE_CONNECTING(3) sk_state =
> TCP_ESTABLISHED
> kernel: [9586733.591304] libceph:  ceph_sock_state_change
> ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_CLOSE_WAIT
> kernel: [9586733.596020] libceph:  ceph_sock_state_change
> ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_LAST_ACK
> kernel: [9586733.596573] libceph:  ceph_sock_state_change
> ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_CLOSE
> kernel: [9586733.625709] libceph:  ceph_sock_state_change
> ffff880748f58030 state = CON_STATE_CONNECTING(3) sk_state =
> TCP_ESTABLISHED
> kernel: [9587634.018152] libceph:  ceph_sock_state_change
> ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_CLOSE_WAIT
> kernel: [9587634.022853] libceph:  ceph_sock_state_change
> ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_LAST_ACK
> kernel: [9587634.023406] libceph:  ceph_sock_state_change
> ffff880748f58030 state = CON_STATE_OPEN(5) sk_state = TCP_CLOSE
>
> A couple of observations:
> the two OSDs in question have the same port 6812, but different IPs
> (10.80.16.74 and 10.80.16.78), what is more interesting that they have
> the same ceph_connection struct, note the ffff880748f59830 in the log
> snippet above. So it seems that because two "struct sock *sk" share
> the same "ceph_connection *con = sk->sk_user_data" they enter an
> endless loop of establishing and closing the connection.
>
> Does it sound plausible?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux