Re: [PATCH] mark rbd requiring stable pages

Ilya Dryomov <idryomov@xxxxxxxxx> · Wed, 21 Oct 2015 22:57:20 +0200

On Wed, Oct 21, 2015 at 10:51 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> On Fri, Oct 16, 2015 at 1:09 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
>> Hmm...  On the one hand, yes, we do compute CRCs, but that's optional,
>> so enabling this unconditionally is probably too harsh.  OTOH we are
>> talking to the network, which means all sorts of delays, retransmission
>> issues, etc, so I wonder how exactly "unstable" pages behave when, say,
>> added to an skb - you can't write anything to a page until networking
>> is fully done with it and expect it to work.  It's particularly
>> alarming that you've seen corruptions.
>>
>> Currently the only users of this flag are block integrity stuff and
>> md-raid5, which makes me wonder what iscsi, nfs and others do in this
>> area.  There's an old ticket on this topic somewhere on the tracker, so
>> I'll need to research this.  Thanks for bringing this up!
>
> Hi Mike,
>
> I was hoping to grab you for a few minutes, but you weren't there...
>
> I spent a better part of today reading code and mailing lists on this
> topic.  It is of course a bug that we use sendpage() which inlines
> pages into an skb and do nothing to keep those pages stable.  We have
> csums enabled by default, so setting BDI_CAP_STABLE_WRITES in the crc
> case is an obvious fix.
>
> I looked at drbd and iscsi and I think iscsi could do the same - ditch
> the fallback to sock_no_sendpage() in the datadgst_en case (and get rid
> of iscsi_sw_tcp_conn::sendpage member while at it).  Using stable pages
> rather than having a roll-your-own implementation which doesn't close
> the race but only narrows it sounds like a win, unless copying through
> sendmsg() is for some reason cheaper than stable-waiting?
>
> drbd still needs the non-zero-copy version for its async protocol for
> when they free the pages before the NIC has chance to put them on the
> wire.  md-raid5 it turns out has an option to essentially disable most
> of its stripe cache and so it sets BDI_CAP_STABLE_WRITES to compensate
> if that option is enabled.
>
> What I'm worried about is the !crc (!datadgst_en) case.  I'm failing to
> convince myself that mucking with sendpage()ed pages while they sit in
> the TCP queue (or anywhere in the networking stack, really), is safe -
> there is nothing to prevent pages from being modified after sendpage()
> returned and Ronny reports data corruptions that pretty much went away
> with BDI_CAP_STABLE_WRITES set.  I may be, after prolonged staring at
> this, starting to confuse fs with block, though.  How does that work in
> iscsi land?
>
> (There was/is also this [1] bug, which is kind of related and probably
> worth looking into at some point later.  ceph shouldn't be that easily
> affected - we've got state, but there is a ticket for it.)
>
> [1] http://www.spinics.net/lists/linux-nfs/msg34913.html

And now with Mike on the CC and a mention that at least one scenario of
[1] got fixed in NFS by a6b31d18b02f ("SUNRPC: Fix a data corruption
issue when retransmitting RPC calls").

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html