Question about how to map io request sg pages to user spsace

Xiaoguang Wang <xiaoguang.wang@xxxxxxxxxxxxxxxxx> · Fri, 18 Feb 2022 15:15:20 +0800



    hi,

    
    I have some questions about how to map block device io requests'
    pages to

    user space, which may need your help, thanks in advance.

    
    Let me first have a brief introduction, one of our customers use
    tcm_loop &

    tcmu to export a virtual block device to user space, tcm_loop and
    tcmu are 

    belong to scsi/target subsystem.  This virtual block device has a
    user-space

    backend, which visit remote distributed filesystem to complete io
    requests.

    The data flow likes below: 

      1) client app issue io request to this virtual block.

      2) tcm_loop & tcmu are kernel modules, they handle io
    requests.

      3) tcmu maintain an internal data area, which indeed is a xarray
    managing

          kernel pages. tcmu allocates kernel pages to data area and
    copy io requests

          sg pages to tcmu data area's kernel pages.

      4) tcmu maps data area's kernel pages to user space, then tcmu
    user space

          backend can read or fill mmaped user space area.

    
    But this solution have obvious overhead, allocating tcmu data area
    pages and one

    extra copy, which results in tcmu throughput bottleneck, so I try to
    map block device

    io requests' sg pages to user space directly, which I believe it can
    improve tcmu

    throughput. Currenly I have implemented two prototypes:

    
    Solution 1:

    use vm_insert_pages, which is like tcp
    getsockopt(TCP_ZEROCOPY_RECEIVE).

    But there're two restrictions：

      1 anonymous pages can not be mmaped to user space

       ==> vm_insert_pages

       ====> insert_pages

       ======> insert_page_in_batch_locked

       ========> validate_page_before_insert

       In validate_page_before_insert(), it shows anonymous page can not
    be mapped to

       use space, we know that if issuing direct io block device, io
    requests' sg pages maybe

       anonymous page.

           if (PageAnon(page) || PageSlab(page) || page_has_type(page))
    

               return -EINVAL;

       I wonder why there is such restriction? for safety reasons ?

    
      2,  warn_on triggered in __folio_mark_dirty

      When doing zap_page_range in tcmu user space backend when io
    completes, there is

      a warn_on triggered in __folio_mark_dirty:

          if (folio->mapping) {   /* Race with truncate? */

              WARN_ON_ONCE(warn && !folio_test_uptodate(folio));

      
      I'm not familiar with folio yet, but I think the reason is that
    when issuing a buffered

      read to block device, it's page cache mapped to user space, but
    initially it's newly

      allocated, hence page_update flag not set.  In zap_pte_range,
    there is such codes:

          if (!PageAnon(page)) {

              if (pte_dirty(ptent)) {

                  force_flush = 1;

                  set_page_dirty(page);

              }

     So this warn_on is reasonable.

     Indeed what I want is just to map io request sg pages to tcmu user
    space backend, then backend

     can read or write data to mapped area, I don't want to care about
    page or its mapping status, so 

     I choose to use remap_pfn_range.

    
    Then solution 2, use remap_pfn_range.

    remap_pfn_range works well, but it has somewhat obvious overhead.
    For a 512kb io request,

    it has 128 pages, and usually this 128 page‘ pfn are not
    consecutive, so in worst cases, for a 512kb

    io request, I'd need to issue 128 calls to remap_pfn_range, it's
    horrible. And in remap_pfn_range,

    if x86 page attribute table feature is enabled, lookup_memtype
    called by track_pfn_remap() also

    introduce obvious overhead.

    
    Finally my question is that is there any simple and efficient helper
    to map block device sg pages

    to user space, it may accept an array of pages as parameter,
    anonymous pages can be mapped

    to user space, and pages would be treated as a special
    pte(pte_special returns true), so

    vm_normal_page returns NULL,  above warn_on won't trigger.  Does
    this sounds reasonable,

    I'm not a qualified mm developer, but if you think this new helper
    is reasonable, I can try to add

    such one, thanks.

    
    Regards,

    Xiaoguang Wang

       
    Regards,

    Xiaoguang Wang