On 2020/7/28 01:25, Sagi Grimberg wrote: > > > On 7/26/20 6:52 AM, Coly Li wrote: >> Currently nvme_tcp_try_send_data() doesn't use kernel_sendpage() to >> send slab pages. But for pages allocated by __get_free_pages() without >> __GFP_COMP, which also have refcount as 0, they are still sent by >> kernel_sendpage() to remote end, this is problematic. >> >> When bcache uses a remote NVMe SSD via nvme-over-tcp as its cache >> device, writing meta data e.g. cache_set->disk_buckets to remote SSD may >> trigger a kernel panic due to the above problem. Bcause the meta data >> pages for cache_set->disk_buckets are allocated by __get_free_pages() >> without __GFP_COMP. >> >> This problem should be fixed both in upper layer driver (bcache) and >> nvme-over-tcp code. This patch fixes the nvme-over-tcp code by checking >> whether the page refcount is 0, if yes then don't use kernel_sendpage() >> and call sock_no_sendpage() to send the page into network stack. >> >> Such check is done by macro sendpage_ok() in this patch, which is defined >> in include/linux/net.h as, >> (!PageSlab(page) && page_count(page) >= 1) >> If sendpage_ok() returns false, sock_no_sendpage() will handle the page >> other than kernel_sendpage(). >> >> The code comments in this patch is copied and modified from drbd where >> the similar problem already gets solved by Philipp Reisner. This is the >> best code comment including my own version. >> >> Signed-off-by: Coly Li <colyli@xxxxxxx> >> Cc: Chaitanya Kulkarni <chaitanya.kulkarni@xxxxxxx> >> Cc: Christoph Hellwig <hch@xxxxxx> >> Cc: Hannes Reinecke <hare@xxxxxxx> >> Cc: Jan Kara <jack@xxxxxxxx> >> Cc: Jens Axboe <axboe@xxxxxxxxx> >> Cc: Mikhail Skorzhinskii <mskorzhinskiy@xxxxxxxxxxxxxx> >> Cc: Philipp Reisner <philipp.reisner@xxxxxxxxxx> >> Cc: Sagi Grimberg <sagi@xxxxxxxxxxx> >> Cc: Vlastimil Babka <vbabka@xxxxxxxx> >> Cc: stable@xxxxxxxxxxxxxxx >> --- >> Changelog: >> v3: introduce a more common name sendpage_ok() for the open coded check >> v2: fix typo in patch subject. >> v1: the initial version. >> >> drivers/nvme/host/tcp.c | 13 +++++++++++-- >> include/linux/net.h | 2 ++ >> 2 files changed, 13 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c >> index 79ef2b8e2b3c..f9952f6d94b9 100644 >> --- a/drivers/nvme/host/tcp.c >> +++ b/drivers/nvme/host/tcp.c >> @@ -887,8 +887,17 @@ static int nvme_tcp_try_send_data(struct >> nvme_tcp_request *req) >> else >> flags |= MSG_MORE | MSG_SENDPAGE_NOTLAST; >> - /* can't zcopy slab pages */ >> - if (unlikely(PageSlab(page))) { >> + /* >> + * e.g. XFS meta- & log-data is in slab pages, or bcache meta >> + * data pages, or other high order pages allocated by >> + * __get_free_pages() without __GFP_COMP, which have a >> page_count >> + * of 0 and/or have PageSlab() set. We cannot use send_page for >> + * those, as that does get_page(); put_page(); and would cause >> + * either a VM_BUG directly, or __page_cache_release a page that >> + * would actually still be referenced by someone, leading to >> some >> + * obscure delayed Oops somewhere else. >> + */ > > I was hoping that this comment would move to the helper as well. > Sure, I will do that. > Agree with Christoph comment as well. I will move the inline sendpage_ok() to a separated patch. Coly Li