Re: [PATCH v5 4/7] kvm: x86/pmu: Introduce masked events to the pmu event filter

Aaron Lewis <aaronlewis@xxxxxxxxxx> · Fri, 21 Oct 2022 01:33:51 +0000

On Wed, Oct 19, 2022 at 1:06 AM Aaron Lewis <aaronlewis@xxxxxxxxxx> wrote:
>
> > > I think setting it up this way makes the data coming from userspace
> > > less error prone and easier for them to work with because the lists
> > > are being separated for them into include and exclude, and the order
> > > they build it in doesn't matter.
> >
> > The above said, I agree that building the separate lists would be obnoxious for
> > userspace.  Userspace would either need to know the length of the "includes" list
> > up front, or would have to build the "excludes" list separately and then fold it
> > in at the end.  Unless we might run out of bits, lets keep the flag approach for
> > the uAPI.
> >
> > > Isn't having the include list pointer exposed to userspace a little
> > > awkward?  As is I'm not sure why userspace would care about it, and
> > > when we load the filter we'd essentially ignore it.
> >
> > I don't follow.  I assume userspace cares about specifying "includes", otherwise
> > why provide that functionality?
>
> Yes, userspace cares about includes, but I'm not sure they will care
> about the "include" pointer because we would ignore it when loading a
> filter (there's no information it provides we can get from the other
> fields) and it would be cumbersome for them to use as you pointed out.
> So, I was just guessing that they wouldn't use it.
>
> >
> > I'm not concerned about the setup cost to create the sub-lists, my thoughts are
> > purely from a "how easy is it to understand" perspective, closely followed by
> > "how performant is the runtime code".  IIUC, the runtime performance is basically
> > equivalent since the comparison function will need to mask off bits either way.
> >
> > For the forward+backward, IMO it's easy to understand with a single comment above
> > the forward+backward walks:
> >
> >         /*
> >          * Entries are sorted by eventsel, walk the list in both directions to
> >          * process all entries with the target eventsel.
> >          */
> >
> > Sorting should probably have a comment too, but critically, understanding how the
> > list is sorted doesn't require full understanding of how lookups are processed.
> > E.g. a comment like this would suffice for sorting:
> >
> >         /*
> >          * Sort entries by eventsel so that all entries for a given eventsel can
> >          * be processed efficiently during filter.
> >          */
> >
> > The sub-list on the other hand requires comments to explain the sub-list concept,
> > document the indexing code (and its comparator?), the #define for the head bits
> > (and/or struct field), and the walk code code that processes the sublist.
> >
> > In addition to having more things to document, the comments will be spread out,
> > and IMO it's difficult to understand each individual piece without having a good
> > grasp of the overall implementation.  Using "head" instead of "index" would help
> > quite a bit for the lookup+walk, but I think the extra pieces will still leave
> > readers wondering "why?.  E.g. Why is there a separate comparator for sorting
> > vs. lookup?  Why is there a "head" field?  What problem do sub-lists solve?
> >
>
> I'll drop indexing.
>
> >
> > I don't see why encoding is necessary to achieve a common internal implementation
> > though.  To make sure we're talking about the same thing, by "encoding" I mean
> > having different layouts for the same data in the uAPI struct than the in-kernel
> > struct.  I'm perfectly ok having bits in the uAPI struct that are dropped when
> > building the in-kernel lists, e.g. the EXCLUDE bit.  Ditto for having bits in the
> > in-kernel struct that don't exist in the uAPI struct, e.g. the "head" (index) of
> > the sublist if we go that route.
> >
> > My objection to "encoding" is moving the eventsel bits around.  AFAICT, it's not
> > strictly necessary, and similar to the sub-list approach, that raises that question
> > of "why?".  In all cases, aren't the "eventsel" bits always contained in bits 35:32
> > and 7:0?  The comparator can simply mask off all other bits in order to find a
> > filter with the specified eventsel, no?
>
> Strictly speaking, yes... as long as you are aware of some boundaries
> and work within them.
>
> Problem:
> --------
>
> Given the event select + unit mask pairs: {0x101, 0x1}, {0x101, 0x2},
> {0x102, 0x1}
>
> If we use the architectural layout we get: 0x10011, 0x10021, 0x10012.
>
> Sorted by filter_event_cmp(), i.e. event select, it's possible to get:
> 0x10012, 0x10011, 0x10021.
>
> Then if a search for {0x101, 0x2} with cmp_u64() is done it wouldn't
> find anything because 0x10021 is deemed less than 0x10011 in this
> list.  That's why adding find_filter_index(..., cmp_u64) > 0 doesn't
> work to special case legacy events.
>
> Similarly, if sorted by cmp_u64() we get: 0x10021, 0x10012, 0x10011.
>
> Then a search with filter_event_cmp() for 0x10021 would fail as well.
>
> Possible workaround #1:
> -----------------------
>
> If we use the internal layout (removed index).
>
> struct kvm_pmu_filter_entry {
>         union {
>                 u64 raw;
>                 struct {
>                         u64 mask:8;
>                         u64 match:8; // doubles as umask, i.e.
> encode_filter_entry()
>                         u64 event_select:12;
>                         u64 exclude:1;
>                         u64 rsvd:35;
>                 };
>         };
> };
>
> Because the event_select is contiguous, all of its bits are above
> match, and the fields in kvm_pmu_filter_entry are ordered from most
> important to least important from a sort perspective, a sort with
> cmp_u64() gives us: 0x10210, 0x10120, 0x10110.
>
> This order will work for both legacy events and masked events.  So, as
> you suggested, if we wanted to add the following to
> is_gp_event_allowed() we can special case legacy events and early out:
>
>         if (!(filter->flags & KVM_PMU_EVENT_FLAG_MASKED_EVENTS))
>                 return find_filter_index(..., cmp_u64) > 0;
>
> Or we can leave it as is and the mask events path will come up with
> the same result.
>
> Possible workaround #2:
> -----------------------
>
> If we use the architectural layout for masked events, e.g.:
>
> struct kvm_pmu_event_filter_entry {
>         union {
>                 __u64 raw;
>                 struct {
>                         __u64 select_lo:8;
>                         __u64 match:8;
>                         __u64 rsvd1:16;
>                         __u64 select_hi:4;
>                         __u64 rsvd2:19;
>                         __u64 mask:8;
>                         __u64 exclude:1;
>                 };
>         }
> }
>
> This becomes more restrictive, but for our use case I believe it will
> work.  The problem with this layout in general is where the event
> select ends up.  You can't put anything below 'select_lo', and
> 'select_lo' and 'select_hi' straddle the unit mask, so you just have
> less control over the order things end up.  That said, if our goal is
> to just sort by exclude + event select for masked events we can do
> that with filter_event_cmp() on the layout above.  If we want to find
> a legacy event we can use cmp_u64() on the layout above as well.  The
> only caveat is you can't mix and match during lookup like you can in
> workaround #1.  You have to sort and search with the same comparer.
>
> If we convert legacy events to masked events I'm pretty sure we can
> still have a common code path.
>
> >
> > > > We might need separate logic for the existing non-masked mechanism, but again
> > > > that's only more code, IMO it's not more complex.  E.g. I believe that the legacy
> > > > case can be handled with a dedicated:
> > > >
> > > >         if (!(filter->flags & KVM_PMU_EVENT_FLAG_MASKED_EVENTS))
> > > >                 return find_filter_index(..., cmp_u64) > 0;
> > > >
> > >
> > > I'd rather not drop encoding, however, if we did, sorting and
> > > searching would both have to be special cased.  Masked events and
> > > non-masked events are not compatible with each other that way. If the
> > > list is sorted for masked events, you can't do a cmp_u64 search on it
> > > and expect to find anything.
> >
> > I'm not sure I follow.  As above, doesn't the search for masked vs. non-masked only
> > differ on the comparator?
> >
> > > I suppose you could if using the struct overlay and we included match (which
> > > doubles as the unit mask for non-masked events) in the comparer and dropped
> > > indexing.
> >
> > Why does "match" need to be in the comparer?  I thought the sub-lists are created
> > for each "eventsel"?
> >
> > > But that only works because the struct overlay keeps the event select
> > > contiguous.  The fact that the architectural format has the event select
> > > straddle the unit mask prevents that from working.
> > >
> > > > Oh, and as a bonus of splitting include vs. exclude, the legacy case effectively
> > > > optimizes exclude since the length of the exclude array will be '0'.
> > >
> > > This should apply to both implementations.
> >
> > If everything is in a single list, doesn't the exclude lookup need to search the
> > entire thing to determine that there are no entries?
>
> Yes.  To do what I'm suggesting would require creating an internal
> representation of the struct kvm_pmu_event_filter to track the include
> and exclude portions of the list.  But that should be fairly
> straightforward to set up when loading the filter, then we would be
> able to search the individual lists rather than the whole thing.
> Having said that, I'm not saying this is necessary, all I'm saying is
> that it wouldn't be hard to do.
>
> > ...
> >
> > > > >  struct kvm_pmu_filter_entry {
> > > > >       union {
> > > > >               u64 raw;
> > > > >               struct {
> > > > > +                     u64 mask:8;
> > > > > +                     u64 match:8;
> > > > > +                     u64 event_index:12;
> > > >
> > > > This is broken.  There are 2^12 possible event_select values, but event_index is
> > > > the index into the full list of events, i.e. is bounded only by nevents, and so
> > > > this needs to be stored as a 32-bit value.  E.g. if userspace creates a filter
> > > > with 2^32-2 entries for eventsel==0, then the index for eventsel==1 will be
> > > > 2^32-1 even though there are only two event_select values in the entire list.
> > > >
> > >
> > > nevents <= KVM_PMU_EVENT_FILTER_MAX_EVENTS (300), so there is plenty
> > > of space in event_index to cover any valid index.  Though a moot point
> > > if we cut it.
> >
> > Oooh.  In that case, won't 9 bits suffice?  Using 12 implies there's a connection
> > to eventsel, which is misleading and confusing.
>
> Yes, 9 would suffice.  I gave some buffer because I liked the idea of
> round numbers and I had a lot of space in the struct.  I didn't see it
> implying a connection to event_select, but in hindsight I can see why
> you may have thought that.
>
> >
> > Regardless, if we keep the indexing, there should be a BUILD_BUG_ON() to assert
> > that the "head" (event_index) field is larger enough to hold the max number of
> > events.  Not sure if there's a way to get the number of bits, i.e. might require
> > a separate #define to get the compile-time assert. :-/

Here's what I came up with.  Let me know if this is what you were thinking:

arch/x86/include/uapi/asm/kvm.h

#define KVM_PMU_EVENT_FLAG_MASKED_EVENTS BIT(0)
#define KVM_PMU_EVENT_FLAGS_VALID_MASK (KVM_PMU_EVENT_FLAG_MASKED_EVENTS)

/*
 * Masked event layout.
 * Bits   Description
 * ----   -----------
 * 7:0    event select (low bits)
 * 15:8   umask match
 * 31:16  unused
 * 35:32  event select (high bits)
 * 36:54  unused
 * 55     exclude bit
 * 63:56  umask mask
 */

#define KVM_PMU_ENCODE_MASKED_ENTRY(event_select, mask, match, exclude) \
        (((event_select) & 0xFFULL) | (((event_select) & 0XF00ULL) << 24) | \
        (((mask) & 0xFFULL) << 56) | \
        (((match) & 0xFFULL) << 8) | \
        ((__u64)(!!(exclude)) << 55))

#define KVM_PMU_MASKED_ENTRY_EVENT_SELECT \
        (GENMASK_ULL(7, 0) | GENMASK_ULL(35, 32))
#define KVM_PMU_MASKED_ENTRY_UMASK_MASK         (GENMASK_ULL(63, 56))
#define KVM_PMU_MASKED_ENTRY_UMASK_MATCH        (GENMASK_ULL(15, 8))
#define KVM_PMU_MASKED_ENTRY_EXCLUDE            (BIT_ULL(55))
#define KVM_PMU_MASKED_ENTRY_UMASK_MASK_SHIFT   (56)

arch/x86/include/asm/kvm_host.h

struct kvm_x86_pmu_event_filter {
        __u32 action;
        __u32 nevents;
        __u32 fixed_counter_bitmap;
        __u32 flags;
        __u32 nr_includes;
        __u32 nr_excludes;
        __u64 *includes;
        __u64 *excludes;
        __u64 events[];
};

- struct kvm_pmu_event_filter __rcu *pmu_event_filter;
+ struct kvm_x86_pmu_event_filter __rcu *pmu_event_filter;

arch/x86/kvm/pmu.c

static int filter_sort_cmp(const void *pa, const void *pb)
{
        u64 a = *(u64 *)pa & (KVM_PMU_MASKED_ENTRY_EVENT_SELECT |
                              KVM_PMU_MASKED_ENTRY_EXCLUDE);
        u64 b = *(u64 *)pb & (KVM_PMU_MASKED_ENTRY_EVENT_SELECT |
                              KVM_PMU_MASKED_ENTRY_EXCLUDE);

        return (a > b) - (a < b);
}

/*
 * For the event filter, searching is done on the 'includes' list and
 * 'excludes' list separately rather than on the 'events' list (which
 * has both).  As a result the exclude bit can be ignored.
 */
static int filter_event_cmp(const void *pa, const void *pb)
{
        u64 a = *(u64 *)pa & KVM_PMU_MASKED_ENTRY_EVENT_SELECT;
        u64 b = *(u64 *)pb & KVM_PMU_MASKED_ENTRY_EVENT_SELECT;

        return (a > b) - (a < b);
}

static int find_filter_index(u64 *events, u64 nevents, u64 key)
{
        u64 *fe = bsearch(&key, events, nevents, sizeof(events[0]),
                          filter_event_cmp);

        if (!fe)
                return -1;

        return fe - events;
}

static bool is_filter_entry_match(u64 filter_event, u64 umask)
{
        u64 mask = filter_event >> (KVM_PMU_MASKED_ENTRY_UMASK_MASK_SHIFT - 8);
        u64 match = filter_event & KVM_PMU_MASKED_ENTRY_UMASK_MATCH;

        BUILD_BUG_ON((KVM_PMU_ENCODE_MASKED_ENTRY(0, 0xff, 0, false) >>
                     (KVM_PMU_MASKED_ENTRY_UMASK_MASK_SHIFT - 8)) !=
                     ARCH_PERFMON_EVENTSEL_UMASK);

        return (umask & mask) == match;
}

static bool filter_contains_match(u64 *events, u64 nevents, u64 eventsel)
{
        u64 event_select = eventsel & kvm_pmu_ops.EVENTSEL_EVENT;
        u64 umask = eventsel & ARCH_PERFMON_EVENTSEL_UMASK;
        int i, index;

        index = find_filter_index(events, nevents, event_select);
        if (index < 0)
                return false;

        /*
         * Entries are sorted by the event select.  Walk the list in both
         * directions to process all entries with the targeted event select.
         */
        for (i = index; i < nevents; i++) {
                if (filter_event_cmp(&events[i], &event_select) != 0)
                        break;

                if (is_filter_entry_match(events[i], umask))
                        return true;
        }

        for (i = index - 1; i >= 0; i--) {
                if (filter_event_cmp(&events[i], &event_select) != 0)
                        break;

                if (is_filter_entry_match(events[i], umask))
                        return true;
        }

        return false;
}

static bool is_gp_event_allowed(struct kvm_x86_pmu_event_filter *filter,
                                u64 eventsel)
{
        if (filter_contains_match(filter->includes,
filter->nr_includes, eventsel) &&
            !filter_contains_match(filter->excludes,
filter->nr_excludes, eventsel))
                return filter->action == KVM_PMU_EVENT_ALLOW;

        return filter->action == KVM_PMU_EVENT_DENY;
}

< All the code above here is for filtering the guest eventsel. >
< All the code below here is for validating and loading the filter. >

static bool is_filter_valid(struct kvm_x86_pmu_event_filter *filter)
{
        u64 mask;
        int i;

        /* To maintain backwards compatibility only validate masked events. */
        if (filter->flags & KVM_PMU_EVENT_FLAG_MASKED_EVENTS) {
                mask = kvm_pmu_ops.EVENTSEL_EVENT |
                       KVM_PMU_MASKED_ENTRY_UMASK_MASK |
                       KVM_PMU_MASKED_ENTRY_UMASK_MATCH |
                       KVM_PMU_MASKED_ENTRY_EXCLUDE;

                for (i = 0; i < filter->nevents; i++) {
                        if (filter->events[i] & ~mask)
                                return false;
                }
        }

        return true;
}

static void prepare_filter_events(struct kvm_x86_pmu_event_filter *filter)
{
        int i, j;

        if (filter->flags & KVM_PMU_EVENT_FLAG_MASKED_EVENTS)
                return;

        for (i = 0, j = 0; i < filter->nevents; i++) {
                /*
                 * Skip events that are impossible to match against a guest
                 * event.  When filtering, only the event select + unit mask
                 * of the guest event is used.
                 */
                if (filter->events[i] & ~(kvm_pmu_ops.EVENTSEL_EVENT |
                                          ARCH_PERFMON_EVENTSEL_UMASK))
                        continue;

                /*
                 * Convert userspace events to a common in-kernel event so
                 * only one code path is needed to support both events.  For
                 * the in-kernel events use masked events because they are
                 * flexible enough to handle both cases.  To convert to masked
                 * events all that's needed is to add the umask_mask.
                 */
                filter->events[j++] =
                        filter->events[i] |
                        (0xFFULL << KVM_PMU_MASKED_ENTRY_UMASK_MASK_SHIFT);
        }

        filter->nevents = j;
}

static void setup_filter_lists(struct kvm_x86_pmu_event_filter *filter)
{
        int i;

        for (i = 0; i < filter->nevents; i++) {
                if(filter->events[i] & KVM_PMU_MASKED_ENTRY_EXCLUDE)
                        break;
        }

        filter->nr_includes = i;
        filter->nr_excludes = filter->nevents - filter->nr_includes;
        filter->includes = filter->events;
        filter->excludes = filter->events + filter->nr_includes;
}

 int kvm_vm_ioctl_set_pmu_event_filter(struct kvm *kvm, void __user *argp)
 {
-       struct kvm_pmu_event_filter tmp, *filter;
+       struct kvm_pmu_event_filter __user *user_filter = argp;
+       struct kvm_x86_pmu_event_filter *filter;
+       struct kvm_pmu_event_filter tmp;
        size_t size;
        int r;

-       if (copy_from_user(&tmp, argp, sizeof(tmp)))
+       if (copy_from_user(&tmp, user_filter, sizeof(tmp)))
                return -EFAULT;

        if (tmp.action != KVM_PMU_EVENT_ALLOW &&
            tmp.action != KVM_PMU_EVENT_DENY)
                return -EINVAL;

-       if (tmp.flags != 0)
+       if (tmp.flags & ~KVM_PMU_EVENT_FLAGS_VALID_MASK)
                return -EINVAL;

        if (tmp.nevents > KVM_PMU_EVENT_FILTER_MAX_EVENTS)
                return -E2BIG;

        size = struct_size(filter, events, tmp.nevents);
-       filter = kmalloc(size, GFP_KERNEL_ACCOUNT);
+       filter = kzalloc(size, GFP_KERNEL_ACCOUNT);
        if (!filter)
                return -ENOMEM;

+       filter->action = tmp.action;
+       filter->nevents = tmp.nevents;
+       filter->fixed_counter_bitmap = tmp.fixed_counter_bitmap;
+       filter->flags = tmp.flags;
+
        r = -EFAULT;
-       if (copy_from_user(filter, argp, size))
+       if (copy_from_user(filter->events, user_filter->events,
+                          sizeof(filter->events[0]) * filter->nevents))
                goto cleanup;

-       /* Restore the verified state to guard against TOCTOU attacks. */
-       *filter = tmp;
+       r = -EINVAL;
+       if (!is_filter_valid(filter))
+               goto cleanup;

-       remove_impossible_events(filter);
+       prepare_filter_events(filter);

        /*
-        * Sort the in-kernel list so that we can search it with bsearch.
+        * Sort entries by event select so that all entries for a given
+        * event select can be processed efficiently during filtering.
         */
-       sort(&filter->events, filter->nevents, sizeof(__u64), cmp_u64, NULL);
+       sort(&filter->events, filter->nevents, sizeof(filter->events[0]),
+            filter_sort_cmp, NULL);
+
+       setup_filter_lists(filter);