Re: [PATCH 0/9] Yet another version of CAT stuff (no idea about the version number)

Eli <qiaoliyong@xxxxxxxxx> · Fri, 15 Dec 2017 23:11:48 +0800



    On 2017年12月15日 17:06, Martin Kletzander
      wrote:

    
    On
      Thu, Dec 14, 2017 at 07:46:27PM +0800, Eli wrote:
      

                @Eli: Can you help with the testing?
            

        It seems the interface is only implement the isolated case, I
        remember
        

        that you have proposed that for some overlap case?
        

      Hi, yes.  It got a bit more complicated so I want to do this
      

      incrementally.  First enable the easiest cases, then add APIs to
      manage
      

      the system's default group, make type='both' allocation work on
      

      CDP-enabled hosts, add APIs for modifying cachetunes for live and
      

      stopped domains, add support for memory bandwidth allocation, and
      so
      

      on.  This is too much stuff to add in one go.
      

      I guess I forgot to add this info to the cover letter (I think I
      did at
      

      least for the previous version).
      

      I also wasted some time on the tests, some of them are not even in
      the
      

      patchset, have a look at previous version if you want to see them.
      

    ok, sorry for not watching for libvirt list for some time..

    
      I have not see the whole patch set yet,
        but I have some quick testing on
        

        you patch, will try to find more time to review patches
        (Currently I am
        

        maintain another daemon software which is dedicated for RDT
        feature
        

        called RMD)
        

        Only the issue 1 is the true issue, for the others, I think they
        should
        

        be discussed, or be treat as the 'known issue'.
        

        My env:
        

        L1d cache:             32K
        

        L1i cache:             32K
        

        L2 cache:              256K
        

        L3 cache:              56320K
        

        NUMA node0 CPU(s):     0-21,44-65
        

        NUMA node1 CPU(s):     22-43,66-87
        

        virsh capabilities:
        

        171 <cache>
        

        172       <bank id='0' level='3' type='both' size='55'
        unit='MiB'
        

        cpus='0-21,44-65'>
        

        173         <control granularity='2816' unit='KiB'
        type='both'
        

        maxAllocs='16'/>
        

        174 </bank>
        

        175       <bank id='1' level='3' type='both' size='55'
        unit='MiB'
        

        cpus='22-43,66-87'>
        

        176         <control granularity='2816' unit='KiB'
        type='both'
        

        maxAllocs='16'/>
        

        177 </bank>
        

        178     </cache>
        

        *Issue:
        

        *1. Doesn't support asynchronous cache allocation. e.g, I need
        provide
        

        all cache allocation require ways, but I am only care about the
        

        allocation on one of the cache id, cause the VM won't be
        schedule to
        

        another cache (socket).
        

      Oh, really?  This is not written in the kernel documentation. 
      Can't the
      

      unspecified caches just inherit the setting from the default
      group?
      

      That would make sense.  It would also automatically adjust if the
      

      default system one is changed.
      

    Maybe I express myself not clearly, yes the caches will be added to
    default resource group

    Do
      you have any contact to anyone working on the RDT in the kernel? 
      I
      

      think this would save time and effort to anyone who will be using
      the
      

      feature.
      

    Sure, fenghua.yu@xxxxxxxxx and Tony Luck
      <tony.luck@xxxxxxxxx>

      
      kernel doc
https://github.com/torvalds/linux/blob/master/Documentation/x86/intel_rdt_ui.txt

      
      So I got this error if I define the domain
        like this:
        

          <vcpu placement='static'>6</vcpu>
        

          <cputune>
        

            <emulatorpin cpuset='0,37-38,44,81-82'/>
        

            <cachetune vcpus='0-4'>
        

        *      <cache id='0' level='3' type='both' size='2816'
        unit='KiB'/>
        

               ^^^ not provide cache id='1'
        

        *    </cachetune>
        

        root@s2600wt:~# virsh start kvm-cat
        

        error: Failed to start domain kvm-cat
        

        error: Cannot write into schemata file
        

        '/sys/fs/resctrl/qemu-qemu-13-kvm-cat-0-4/schemata': Invalid
        argument
        

      Oh, I have to figure out why is there 'qemu-qemu' :D
      

      This behavior is not correct.
        

        I expect the CBM will be look like:
        

        root@s2600wt:/sys/fs/resctrl# cat qemu-qemu-14-kvm-cat-0-4/*
        

        000000,00000000,00000000
        

        L3:0=80;1=fffff *(no matter what it is, cause my VM won't be
        schedule on
        

        it, ether I have deinfe the vcpu->cpu pining or, I assume
        that kernel
        

        won't schedule it to cache 1)
        

      Well, it matters.  It would have to have all zeros there so that
      that
      

      part of the cache is not occupied.
      

    Well, the hardware won't allow you to specify 0 ways , at least 1
    (some of the platform it's 2 ways)

    From my previous experence, I set it to fffff (it will be treat as 0
    in the code)

    
    it's decided by min_cbm_bits

    
    see
https://github.com/torvalds/linux/blob/master/Documentation/x86/intel_rdt_ui.txt#L48:14


      *Or at least, restrict xml when I define
        this domain, tell me I need to
        

        provide all cache ids (even if I have 4 cache but I only run my
        VM on
        

        'cache 0')
        

        *
        

      We could do that.  It would allow us to make this better (or lift
      the
      

      restriction) in case this is "fixed" in the kernel.
      

      Or at least in the future we could do this to meet the users
      half-way:
      

      - Any vcpus that have cachetune enabled for them must also be
      pinned
      

      - Users need to specify allocations for all cache ids that the
      vcpu
      

       might run on (according to the pinning acquired from before), for
      all
      

       others we'd just simply set it to all zeros or the same bitmask
      as the
      

       system's default group.
      

      But for now we could just copy the system's setting to unspecified
      

      caches or request the user to specify everything.
      

      *2. cache way fragment (no good answers)
        

        I see that for now we allocate cache ways start from the low
        bits, newly
        

        created VM will allocate cache from the next way, if some of the
        VM
        

        (allocated ways in the middle, eg it's schemata is 00100)
        destroyed, and
        

        that slot (1 cache way) may not fit others and it will be
        wasted, But,
        

        how can we handle this, seems no good way, rearrangement? That
        will lead
        

        cache missing in a time window I think.
        

      Avoiding fragmentation is not a simple thing.  It's impossible to
      do
      

      without any moving, which might be unwanted.  This will be solved
      by
      

      providing an API that will tell you move the allocation if you so
      

      desire.  For now I at least try allocating the smallest region
      into
      

      which the requested allocation fits, so that the unallocated parts
      are
      

      as big as possible.
      

    Agree

    
      3. The admin/user should manually operate
        the default resource group,
        

        that's is to say, after resctrl is mounted, the admin/user
        should
        

        manually change the schemata of default group. Will libvirt
        provide
        

        interface/API to handle it?
        

      Yes, this is planned.
      

      4. Will provide some APIs like
        `FreeCacheWay` to end user to see how
        

        many cache ways could be allocated on the host?
        

      Yes, this should be provided by an API as well.
      

          For other users/orchestrator (nova),
        they may need to know if a VM
        

        can schedule on the host, but the cache ways is not liner, it
        may have
        

        fragment.
        

        5, What if other application want to have some shared cache ways
        with
        

        some of the VM?
        

            Libvirt for now try to read all of the resource group
        (instead of
        

        maintain the consumed cache ways itself), so if another resource
        group
        

        was created under /sys/fs/resctl, and the schemata of it is
        "FFFFF",
        

        then libvirt will report not enough room for new VM. But the
        user
        

        actually want to have another Appliation(e.g. ovs, dpdk pmds)
        share
        

        cache ways with the VM created by libvirt.
        

      Adding support for shared allocations is planned as I said before,
      

      however this is something that will be needed to be taken care of
      

      differently anyway.  I don't know how specific the use case would
      be,
      

      but let's say you want to have 8 cache ways allocated for the VM,
      but
      

      share only 4 of them with some DPDK PMD.  You can't use "shared"
      because
      

      that would just take some 8 bits even when some of them might be
      shared
      

      with the system's default group.  Moreover it means that the
      allocation
      

      can be shared with machines ran in the future.  So in this case
      you need
      

      to have the 8 bits exclusively allocated and then (only after the
      

      machine is started) pin the PMD process to those 4 cache ways.
      

      For the missing issue from the other email:
      

      If the host enabled CDP, which is to see
        the host will report l3 cache type
        

        code and data. when user don't want code/data cache ways
        allocated
        

        separated, for current implement, it will report not support
        `both` type
        

        l3 cache.
        

      But we can improve this as make code and
        data schemata the same
        

        e.g, if host enabled CDP, but user request 2 `both` type l3
        cache.
        

      We can write the schemata looks like:
        

      L3DATA:0=3
        

        L3CODE:0=3
        

      Yes, that's what we want to achieve, but again, in a future
      patchset.
      

      Hope that answers your questions.  Thanks for trying it out, it is
      

      really complicated to develop something like this without the
      actual
      

      hardware to test it on.
      

    Yep.

    
      Have a nice day,
      

      Martin
      

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list