Re: [PATCH v3 1/3] libceph: fail the sparse-read if there still has data in socket

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 12/16/23 01:06, Jeff Layton wrote:
On Fri, 2023-12-15 at 08:20 +0800, xiubli@xxxxxxxxxx wrote:
From: Xiubo Li <xiubli@xxxxxxxxxx>

Once this happens that means there have bugs.

URL: https://tracker.ceph.com/issues/63586
Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx>
---
  net/ceph/osd_client.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 5753036d1957..848ef19055a0 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -5912,10 +5912,12 @@ static int osd_sparse_read(struct ceph_connection *con,
  		fallthrough;
  	case CEPH_SPARSE_READ_DATA:
  		if (sr->sr_index >= count) {
-			if (sr->sr_datalen && count)
+			if (sr->sr_datalen) {
  				pr_warn_ratelimited("sr_datalen %u sr_index %d count %u\n",
  						    sr->sr_datalen, sr->sr_index,
  						    count);
+				return -EREMOTEIO;
+			}
sr->sr_state = CEPH_SPARSE_READ_HDR;
  			goto next_op;
Do you really need to fail the read in this case? Would it not be better
to just advance past the extra junk? Or is this problem more indicative
of a malformed frame?

It'd be nice to have some specific explanation of the problem this is
fixing and how it was triggered. If this due to a misbehaving server?
Bad hw?

Hi Jeff,

Why I added this check before was that when I was debugging the sparse-read issue last time I found even when the extent array 'count == 0', sometimes the 'sr_datalen != 0'. I just thought the ceph side's logic allowed it.

But this time from going through and debugging more in the ceph side code, I didn't get where was doing this. So I just suspected it should be an issue somewhere in kclient side and it also possibly would trigger the sparse-read bug.  So I just removed the 'count' check and finally found the root case for both these two issues.

IMO normally this shouldn't happen anyway. Once this happens in kclient side it will 100% corrupt the following msgs and reset the connection from my tests. Actually when the 'count ==0' the 'sr_datalen' will be a random number, sometimes very large, so just advancing past the extra junk makes no any sense.

Thanks

- Xiubo






[Index of Archives]     [CEPH Users]     [Ceph Large]     [Ceph Dev]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux