Re: bad crc in data caused by a race condtion in write_partial_message_data

caifeng.zhu@xxxxxxxxxxx · Mon, 16 Jan 2017 11:09:36 +0800

On Sun, Jan 15, 2017 at 10:26:38AM -0600, Alex Elder wrote:
> On 01/15/2017 01:45 AM, caifeng.zhu@xxxxxxxxxxx wrote:
> > Hi, all
> > 
> > Let's look at the problem first. We have a lot of 'bad crc in data' 
> > warnings at OSDs, like below:
> >     2017-01-14 23:25:54.671599 7f67201b3700  0 bad crc in data 1480547403 != exp 3751318843
> >     2017-01-14 23:25:54.681146 7f67201b3700  0 bad crc in data 3044715775 != exp 3018112170
> >     2017-01-14 23:25:54.681822 7f67201b3700  0 bad crc in data 2815383560 != exp 1455746011
> >     2017-01-14 23:25:54.686106 7f67205da700  0 bad crc in data 1781929234 != exp 498105391
> >     2017-01-14 23:25:54.688092 7f67205da700  0 bad crc in data 1845054835 != exp 3337474350
> >     2017-01-14 23:25:54.693225 7f67205da700  0 bad crc in data 1518733907 != exp 3781627678
> >     2017-01-14 23:25:54.755653 7f6724115700  0 bad crc in data 1173337243 != exp 3759627242
> >     ...
> > This problem occurs when we are testing(by fio) an NFS client, whose NFS server is
> > built on an XFS + RBD combination. The bad effect of the problem is that: OSD will close 
> > the connection of crc error and drop all reply messages sent through the connection. 
> > But the kernel rbd client will hold the requests and wait for the already dropped
> > replies which will never come. A deadlock occurs.
> 
> The first problem is the reports of bad CRCs.  And the OSD reporting
> this for messages sent by the RBD kernel client makes sense, given
> what you say below.
> 
> Your statement that a deadlock occurs after that doesn't sound right.
> Are these services (OSDs, RBD client, NFS client) running on different
> machines?
> 

Yes, all these services are running on different machines.

> > After some analysis, we suspect write_partial_message_data may have a race condtion. 
> > (Code below is got from gitbub.)
> >     1562		page = ceph_msg_data_next(cursor, &page_offset, &length,
> >     1563					  &last_piece);
> >     1564		ret = ceph_tcp_sendpage(con->sock, page, page_offset,
> >     1565					length, !last_piece);
> >     ...
> >     1572		if (do_datacrc && cursor->need_crc)
> >     1573			crc = ceph_crc32c_page(crc, page, page_offset, length);
> > At line 1564 ~ 1572, a worker thread of libceph workquue may send the page out by TCP 
> > and compute the CRC. But simultaneously, at the VFS/XFS level, there may be another thread
> > writing to file position cached by the sending-out page. If page sending and crc compution 
> > is interleaved by data writing, bad CRC will be complained by the receiving OSD.
> 
> This should not be happening.  Data supplied to the Ceph messenger
> for sending can no longer be subject to modification.
> 

I don't understand why pages at ceph messenger can NOT be modified.
As far as I know, pages caching file data are unlocked before bio and 
can be got for further writing. 

Could you please explain why?

> > To verify our suspection, we add the debug patch below:
> 
> Your patch does indeed seem to show that a page is modified
> during the call to ceph_tcp_sendpage().  But again, that should
> not be happening.
> 
> I hope I'm not missing something obvious here...
> 
> > (Code below is based on our linux version.)
> > @@ -1527,9 +1527,14 @@ static int write_partial_message_data(st
> >                 bool last_piece;
> >                 bool need_crc;
> >                 int ret;
> > +               u32 crc2 = 0;
> > 
> >                 page = ceph_msg_data_next(&msg->cursor, &page_offset, &length,
> >                                                         &last_piece);
> > +
> > +               if (do_datacrc && cursor->need_crc)
> > +                       crc2 = ceph_crc32c_page(crc, page, page_offset, length);
> > +
> >                 ret = ceph_tcp_sendpage(con->sock, page, page_offset,
> >                                       length, last_piece);
> >                 if (ret <= 0) {
> > @@ -1538,8 +1543,12 @@ static int write_partial_message_data(st
> > 
> >                         return ret;
> >                 }
> > -               if (do_datacrc && cursor->need_crc)
> > +               if (do_datacrc && cursor->need_crc) {
> >                         crc = ceph_crc32c_page(crc, page, page_offset, length);
> > +                       if (crc2 != crc)
> > +                               pr_warn("tampered page %p: "
> > +                                   "before 0x%x, current 0x%x\n", page, crc2, crc);
> > +               }
> >                 need_crc = ceph_msg_data_advance(&msg->cursor, (size_t)ret);
> >         }
> > And get the the warning messages below
> >     [Sun Jan 15 14:11:29 2017] libceph: tampered page ffffea002a8fb140: before 0x1aa3b794, current 0x5fe707d6
> >     [Sun Jan 15 14:11:29 2017] libceph: tampered page ffffea002122f680: before 0xec71a744, current 0x9b7d382f
> >     [Sun Jan 15 14:11:29 2017] libceph: tampered page ffffea002372b740: before 0xa1849173, current 0x888a93c1
> >     [Sun Jan 15 14:11:30 2017] libceph: tampered page ffffea0027a5a500: before 0x6fcb56ac, current 0x1b9aeced
> > 
> > A possible solution may be that: if crc checking is enabled, the page should be copied 
> > for sending and crc computation. Is that OK?
> 
> Sending a copied page would ensure the page you're sending doesn't
> change.  But if someone else is modifying the original, there's no
> way of knowing whether the copied page is really intact either.
> 
> Sending a copy would treat a symptom and not the underlying problem.
> So I'd say "no, that's not OK."

If the pages at ceph messenger should not be modified, then page copy 
is a way to hide the real problem, not a solution.

> 
> 					-Alex
> 
> > 
> > Best Regards.
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html