Re: [PATCH] mark rbd requiring stable pages

Ilya Dryomov <idryomov@xxxxxxxxx> · Wed, 21 Oct 2015 22:51:05 +0200

On Fri, Oct 16, 2015 at 1:09 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote:
> Hmm...  On the one hand, yes, we do compute CRCs, but that's optional,
> so enabling this unconditionally is probably too harsh.  OTOH we are
> talking to the network, which means all sorts of delays, retransmission
> issues, etc, so I wonder how exactly "unstable" pages behave when, say,
> added to an skb - you can't write anything to a page until networking
> is fully done with it and expect it to work.  It's particularly
> alarming that you've seen corruptions.
>
> Currently the only users of this flag are block integrity stuff and
> md-raid5, which makes me wonder what iscsi, nfs and others do in this
> area.  There's an old ticket on this topic somewhere on the tracker, so
> I'll need to research this.  Thanks for bringing this up!

Hi Mike,

I was hoping to grab you for a few minutes, but you weren't there...

I spent a better part of today reading code and mailing lists on this
topic.  It is of course a bug that we use sendpage() which inlines
pages into an skb and do nothing to keep those pages stable.  We have
csums enabled by default, so setting BDI_CAP_STABLE_WRITES in the crc
case is an obvious fix.

I looked at drbd and iscsi and I think iscsi could do the same - ditch
the fallback to sock_no_sendpage() in the datadgst_en case (and get rid
of iscsi_sw_tcp_conn::sendpage member while at it).  Using stable pages
rather than having a roll-your-own implementation which doesn't close
the race but only narrows it sounds like a win, unless copying through
sendmsg() is for some reason cheaper than stable-waiting?

drbd still needs the non-zero-copy version for its async protocol for
when they free the pages before the NIC has chance to put them on the
wire.  md-raid5 it turns out has an option to essentially disable most
of its stripe cache and so it sets BDI_CAP_STABLE_WRITES to compensate
if that option is enabled.

What I'm worried about is the !crc (!datadgst_en) case.  I'm failing to
convince myself that mucking with sendpage()ed pages while they sit in
the TCP queue (or anywhere in the networking stack, really), is safe -
there is nothing to prevent pages from being modified after sendpage()
returned and Ronny reports data corruptions that pretty much went away
with BDI_CAP_STABLE_WRITES set.  I may be, after prolonged staring at
this, starting to confuse fs with block, though.  How does that work in
iscsi land?

(There was/is also this [1] bug, which is kind of related and probably
worth looking into at some point later.  ceph shouldn't be that easily
affected - we've got state, but there is a ticket for it.)

[1] http://www.spinics.net/lists/linux-nfs/msg34913.html

Thanks,

                Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html