On Tue, Feb 1, 2022 at 2:24 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Mon, 2022-01-31 at 16:58 +0100, Ilya Dryomov wrote: > > Both msgr1 and msgr2 in crc mode are zero copy in the sense that > > message data is read from the socket directly into the destination > > buffer. We assume that the destination buffer is stable (i.e. remains > > unchanged while it is being read to) though. Otherwise, CRC errors > > ensue: > > > > libceph: read_partial_message 0000000048edf8ad data crc 1063286393 != exp. 228122706 > > libceph: osd1 (1)192.168.122.1:6843 bad crc/signature > > > > libceph: bad data crc, calculated 57958023, expected 1805382778 > > libceph: osd2 (2)192.168.122.1:6876 integrity error, bad crc > > > > Introduce rxbounce option to enable use of a bounce buffer when > > receiving message data. In particular this is needed if a mapped > > image is a Windows VM disk, passed to QEMU. Windows has a system-wide > > "dummy" page that may be mapped into the destination buffer (potentially > > more than once into the same buffer) by the Windows Memory Manager in > > an effort to generate a single large I/O [1][2]. QEMU makes a point of > > preserving overlap relationships when cloning I/O vectors, so krbd gets > > exposed to this behaviour. > > > > [1] "What Is Really in That MDL?" > > https://docs.microsoft.com/en-us/previous-versions/windows/hardware/design/dn614012(v=vs.85) > > [2] https://blogs.msmvps.com/kernelmustard/2005/05/04/dummy-pages/ > > > > URL: https://bugzilla.redhat.com/show_bug.cgi?id=1973317 > > Signed-off-by: Ilya Dryomov <idryomov@xxxxxxxxx> > > --- > > include/linux/ceph/libceph.h | 1 + > > include/linux/ceph/messenger.h | 1 + > > net/ceph/ceph_common.c | 7 ++++ > > net/ceph/messenger.c | 4 +++ > > net/ceph/messenger_v1.c | 54 +++++++++++++++++++++++++++---- > > net/ceph/messenger_v2.c | 58 ++++++++++++++++++++++++++-------- > > 6 files changed, 105 insertions(+), 20 deletions(-) > > > > diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h > > index 6a89ea410e43..edf62eaa6285 100644 > > --- a/include/linux/ceph/libceph.h > > +++ b/include/linux/ceph/libceph.h > > @@ -35,6 +35,7 @@ > > #define CEPH_OPT_TCP_NODELAY (1<<4) /* TCP_NODELAY on TCP sockets */ > > #define CEPH_OPT_NOMSGSIGN (1<<5) /* don't sign msgs (msgr1) */ > > #define CEPH_OPT_ABORT_ON_FULL (1<<6) /* abort w/ ENOSPC when full */ > > +#define CEPH_OPT_RXBOUNCE (1<<7) /* double-buffer read data */ > > > > #define CEPH_OPT_DEFAULT (CEPH_OPT_TCP_NODELAY) > > > > diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h > > index 6c6b6ea52bb8..e7f2fb2fc207 100644 > > --- a/include/linux/ceph/messenger.h > > +++ b/include/linux/ceph/messenger.h > > @@ -461,6 +461,7 @@ struct ceph_connection { > > struct ceph_msg *out_msg; /* sending message (== tail of > > out_sent) */ > > > > + struct page *bounce_page; > > u32 in_front_crc, in_middle_crc, in_data_crc; /* calculated crc */ > > > > struct timespec64 last_keepalive_ack; /* keepalive2 ack stamp */ > > diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c > > index ecc400a0b7bb..4c6441536d55 100644 > > --- a/net/ceph/ceph_common.c > > +++ b/net/ceph/ceph_common.c > > @@ -246,6 +246,7 @@ enum { > > Opt_cephx_sign_messages, > > Opt_tcp_nodelay, > > Opt_abort_on_full, > > + Opt_rxbounce, > > }; > > > > enum { > > @@ -295,6 +296,7 @@ static const struct fs_parameter_spec ceph_parameters[] = { > > fsparam_u32 ("osdkeepalive", Opt_osdkeepalivetimeout), > > fsparam_enum ("read_from_replica", Opt_read_from_replica, > > ceph_param_read_from_replica), > > + fsparam_flag ("rxbounce", Opt_rxbounce), > > Yuck. > > It sure would be nice to automagically detect when this was needed > somehow. The option is fine once you know you need it, but getting to > that point may be painful. I'm updating the man page in [1]. I should probably expand rxbounce paragraph with the exact error strings -- that should make it easy to find for anyone interested. > > Maybe we should we make the warnings about failing crc messages suggest > rxbounce? We could also consider making it so that when you fail a crc > check and the connection is reset, that the new connection enables > rxbounce automatically? Do you envision that happening after just a single checksum mismatch? For that particular connection or globally? If we decide to do that, I think it should be global (just because currently all connection settings are global) but then some sort of threshold is going to be needed... [1] https://github.com/ceph/ceph/pull/44842/files#diff-90064d9b1388dcb61bb57ff61f60443b15ef3d74f31fef7f63e43b5ae3a03f37 Thanks, Ilya