On Tue, 2022-02-01 at 16:46 +0100, Ilya Dryomov wrote: > On Tue, Feb 1, 2022 at 2:24 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > > > On Mon, 2022-01-31 at 16:58 +0100, Ilya Dryomov wrote: > > > Both msgr1 and msgr2 in crc mode are zero copy in the sense that > > > message data is read from the socket directly into the destination > > > buffer. We assume that the destination buffer is stable (i.e. remains > > > unchanged while it is being read to) though. Otherwise, CRC errors > > > ensue: > > > > > > libceph: read_partial_message 0000000048edf8ad data crc 1063286393 != exp. 228122706 > > > libceph: osd1 (1)192.168.122.1:6843 bad crc/signature > > > > > > libceph: bad data crc, calculated 57958023, expected 1805382778 > > > libceph: osd2 (2)192.168.122.1:6876 integrity error, bad crc > > > > > > Introduce rxbounce option to enable use of a bounce buffer when > > > receiving message data. In particular this is needed if a mapped > > > image is a Windows VM disk, passed to QEMU. Windows has a system-wide > > > "dummy" page that may be mapped into the destination buffer (potentially > > > more than once into the same buffer) by the Windows Memory Manager in > > > an effort to generate a single large I/O [1][2]. QEMU makes a point of > > > preserving overlap relationships when cloning I/O vectors, so krbd gets > > > exposed to this behaviour. > > > > > > [1] "What Is Really in That MDL?" > > > https://docs.microsoft.com/en-us/previous-versions/windows/hardware/design/dn614012(v=vs.85) > > > [2] https://blogs.msmvps.com/kernelmustard/2005/05/04/dummy-pages/ > > > > > > URL: https://bugzilla.redhat.com/show_bug.cgi?id=1973317 > > > Signed-off-by: Ilya Dryomov <idryomov@xxxxxxxxx> > > > --- > > > include/linux/ceph/libceph.h | 1 + > > > include/linux/ceph/messenger.h | 1 + > > > net/ceph/ceph_common.c | 7 ++++ > > > net/ceph/messenger.c | 4 +++ > > > net/ceph/messenger_v1.c | 54 +++++++++++++++++++++++++++---- > > > net/ceph/messenger_v2.c | 58 ++++++++++++++++++++++++++-------- > > > 6 files changed, 105 insertions(+), 20 deletions(-) > > > > > > diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h > > > index 6a89ea410e43..edf62eaa6285 100644 > > > --- a/include/linux/ceph/libceph.h > > > +++ b/include/linux/ceph/libceph.h > > > @@ -35,6 +35,7 @@ > > > #define CEPH_OPT_TCP_NODELAY (1<<4) /* TCP_NODELAY on TCP sockets */ > > > #define CEPH_OPT_NOMSGSIGN (1<<5) /* don't sign msgs (msgr1) */ > > > #define CEPH_OPT_ABORT_ON_FULL (1<<6) /* abort w/ ENOSPC when full */ > > > +#define CEPH_OPT_RXBOUNCE (1<<7) /* double-buffer read data */ > > > > > > #define CEPH_OPT_DEFAULT (CEPH_OPT_TCP_NODELAY) > > > > > > diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h > > > index 6c6b6ea52bb8..e7f2fb2fc207 100644 > > > --- a/include/linux/ceph/messenger.h > > > +++ b/include/linux/ceph/messenger.h > > > @@ -461,6 +461,7 @@ struct ceph_connection { > > > struct ceph_msg *out_msg; /* sending message (== tail of > > > out_sent) */ > > > > > > + struct page *bounce_page; > > > u32 in_front_crc, in_middle_crc, in_data_crc; /* calculated crc */ > > > > > > struct timespec64 last_keepalive_ack; /* keepalive2 ack stamp */ > > > diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c > > > index ecc400a0b7bb..4c6441536d55 100644 > > > --- a/net/ceph/ceph_common.c > > > +++ b/net/ceph/ceph_common.c > > > @@ -246,6 +246,7 @@ enum { > > > Opt_cephx_sign_messages, > > > Opt_tcp_nodelay, > > > Opt_abort_on_full, > > > + Opt_rxbounce, > > > }; > > > > > > enum { > > > @@ -295,6 +296,7 @@ static const struct fs_parameter_spec ceph_parameters[] = { > > > fsparam_u32 ("osdkeepalive", Opt_osdkeepalivetimeout), > > > fsparam_enum ("read_from_replica", Opt_read_from_replica, > > > ceph_param_read_from_replica), > > > + fsparam_flag ("rxbounce", Opt_rxbounce), > > > > Yuck. > > > > It sure would be nice to automagically detect when this was needed > > somehow. The option is fine once you know you need it, but getting to > > that point may be painful. > > I'm updating the man page in [1]. I should probably expand rxbounce > paragraph with the exact error strings -- that should make it easy to > find for anyone interested. > Fair enough. That seems like the reasonable place to start. > > > > Maybe we should we make the warnings about failing crc messages suggest > > rxbounce? We could also consider making it so that when you fail a crc > > check and the connection is reset, that the new connection enables > > rxbounce automatically? > > Do you envision that happening after just a single checksum mismatch? > For that particular connection or globally? If we decide to do that, > I think it should be global (just because currently all connection > settings are global) but then some sort of threshold is going to be > needed... > I don't know. I didn't really think through all of the potential issues. Probably we should just wait to see how big a problem this is before we do anything automatic. > [1] https://github.com/ceph/ceph/pull/44842/files#diff-90064d9b1388dcb61bb57ff61f60443b15ef3d74f31fef7f63e43b5ae3a03f37 > > Thanks, > > Ilya -- Jeff Layton <jlayton@xxxxxxxxxx>