I still do not understand why this should give any notice-able performance advantage.Usually omitting invalidations gives a healthy bump. Also, RDMA WRITE is generally faster than READ at the HW level in various ways.
Yes, but this should be essentially identical to running nvme-rdma with 512KB of immediate-data (the nvme term is in-capsule data). In the upstream nvme target we have inline_data_size port attribute that is tunable for that (defaults to PAGE_SIZE).