Re: bad crc in data caused by a race condtion in write_partial_message_data

Ilya Dryomov <idryomov@xxxxxxxxx> · Sun, 15 Jan 2017 18:01:05 +0100

On Sun, Jan 15, 2017 at 8:45 AM,  <caifeng.zhu@xxxxxxxxxxx> wrote:
> Hi, all
>
> Let's look at the problem first. We have a lot of 'bad crc in data'
> warnings at OSDs, like below:
>     2017-01-14 23:25:54.671599 7f67201b3700  0 bad crc in data 1480547403 != exp 3751318843
>     2017-01-14 23:25:54.681146 7f67201b3700  0 bad crc in data 3044715775 != exp 3018112170
>     2017-01-14 23:25:54.681822 7f67201b3700  0 bad crc in data 2815383560 != exp 1455746011
>     2017-01-14 23:25:54.686106 7f67205da700  0 bad crc in data 1781929234 != exp 498105391
>     2017-01-14 23:25:54.688092 7f67205da700  0 bad crc in data 1845054835 != exp 3337474350
>     2017-01-14 23:25:54.693225 7f67205da700  0 bad crc in data 1518733907 != exp 3781627678
>     2017-01-14 23:25:54.755653 7f6724115700  0 bad crc in data 1173337243 != exp 3759627242
>     ...
> This problem occurs when we are testing(by fio) an NFS client, whose NFS server is
> built on an XFS + RBD combination. The bad effect of the problem is that: OSD will close
> the connection of crc error and drop all reply messages sent through the connection.
> But the kernel rbd client will hold the requests and wait for the already dropped
> replies which will never come. A deadlock occurs.
>
> After some analysis, we suspect write_partial_message_data may have a race condtion.
> (Code below is got from gitbub.)
>     1562                page = ceph_msg_data_next(cursor, &page_offset, &length,
>     1563                                          &last_piece);
>     1564                ret = ceph_tcp_sendpage(con->sock, page, page_offset,
>     1565                                        length, !last_piece);
>     ...
>     1572                if (do_datacrc && cursor->need_crc)
>     1573                        crc = ceph_crc32c_page(crc, page, page_offset, length);
> At line 1564 ~ 1572, a worker thread of libceph workquue may send the page out by TCP
> and compute the CRC. But simultaneously, at the VFS/XFS level, there may be another thread
> writing to file position cached by the sending-out page. If page sending and crc compution
> is interleaved by data writing, bad CRC will be complained by the receiving OSD.
>
> To verify our suspection, we add the debug patch below:
> (Code below is based on our linux version.)

... which is based on?  This should be fixed in 4.3+ and all recent stable
kernels.

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html