On Wed, Oct 21, 2015 at 10:51 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: > On Fri, Oct 16, 2015 at 1:09 PM, Ilya Dryomov <idryomov@xxxxxxxxx> wrote: >> Hmm... On the one hand, yes, we do compute CRCs, but that's optional, >> so enabling this unconditionally is probably too harsh. OTOH we are >> talking to the network, which means all sorts of delays, retransmission >> issues, etc, so I wonder how exactly "unstable" pages behave when, say, >> added to an skb - you can't write anything to a page until networking >> is fully done with it and expect it to work. It's particularly >> alarming that you've seen corruptions. >> >> Currently the only users of this flag are block integrity stuff and >> md-raid5, which makes me wonder what iscsi, nfs and others do in this >> area. There's an old ticket on this topic somewhere on the tracker, so >> I'll need to research this. Thanks for bringing this up! > > Hi Mike, > > I was hoping to grab you for a few minutes, but you weren't there... > > I spent a better part of today reading code and mailing lists on this > topic. It is of course a bug that we use sendpage() which inlines > pages into an skb and do nothing to keep those pages stable. We have > csums enabled by default, so setting BDI_CAP_STABLE_WRITES in the crc > case is an obvious fix. > > I looked at drbd and iscsi and I think iscsi could do the same - ditch > the fallback to sock_no_sendpage() in the datadgst_en case (and get rid > of iscsi_sw_tcp_conn::sendpage member while at it). Using stable pages > rather than having a roll-your-own implementation which doesn't close > the race but only narrows it sounds like a win, unless copying through > sendmsg() is for some reason cheaper than stable-waiting? > > drbd still needs the non-zero-copy version for its async protocol for > when they free the pages before the NIC has chance to put them on the > wire. md-raid5 it turns out has an option to essentially disable most > of its stripe cache and so it sets BDI_CAP_STABLE_WRITES to compensate > if that option is enabled. > > What I'm worried about is the !crc (!datadgst_en) case. I'm failing to > convince myself that mucking with sendpage()ed pages while they sit in > the TCP queue (or anywhere in the networking stack, really), is safe - > there is nothing to prevent pages from being modified after sendpage() > returned and Ronny reports data corruptions that pretty much went away > with BDI_CAP_STABLE_WRITES set. I may be, after prolonged staring at > this, starting to confuse fs with block, though. How does that work in > iscsi land? > > (There was/is also this [1] bug, which is kind of related and probably > worth looking into at some point later. ceph shouldn't be that easily > affected - we've got state, but there is a ticket for it.) > > [1] http://www.spinics.net/lists/linux-nfs/msg34913.html And now with Mike on the CC and a mention that at least one scenario of [1] got fixed in NFS by a6b31d18b02f ("SUNRPC: Fix a data corruption issue when retransmitting RPC calls"). Thanks, Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html