On 26/04/2020 16:30, Jason Gunthorpe wrote: > On Sun, Apr 26, 2020 at 09:42:27AM +0300, Gal Pressman wrote: >> On 24/04/2020 21:26, Jason Gunthorpe wrote: >>> On Fri, Apr 24, 2020 at 06:25:54PM +0300, Gal Pressman wrote: >>>> On 24/04/2020 17:59, Jason Gunthorpe wrote: >>>>> On Mon, Apr 20, 2020 at 09:22:12AM +0300, Gal Pressman wrote: >>>>>> Add a new stat that counts mmap failures, which might help when >>>>>> debugging different issues. >>>>>> >>>>>> Reviewed-by: Firas JahJah <firasj@xxxxxxxxxx> >>>>>> Reviewed-by: Yossi Leybovich <sleybo@xxxxxxxxxx> >>>>>> Signed-off-by: Gal Pressman <galpress@xxxxxxxxxx> >>>>>> drivers/infiniband/hw/efa/efa.h | 3 ++- >>>>>> drivers/infiniband/hw/efa/efa_verbs.c | 9 +++++++-- >>>>>> 2 files changed, 9 insertions(+), 3 deletions(-) >>>>>> >>>>>> diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h >>>>>> index aa7396a1588a..77c9ff798117 100644 >>>>>> +++ b/drivers/infiniband/hw/efa/efa.h >>>>>> @@ -1,6 +1,6 @@ >>>>>> /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */ >>>>>> /* >>>>>> - * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved. >>>>>> + * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved. >>>>>> */ >>>>>> >>>>>> #ifndef _EFA_H_ >>>>>> @@ -40,6 +40,7 @@ struct efa_sw_stats { >>>>>> atomic64_t reg_mr_err; >>>>>> atomic64_t alloc_ucontext_err; >>>>>> atomic64_t create_ah_err; >>>>>> + atomic64_t mmap_err; >>>>>> }; >>>>>> >>>>>> /* Don't use anything other than atomic64 */ >>>>>> diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c >>>>>> index b555845d6c14..75eef1ec2474 100644 >>>>>> +++ b/drivers/infiniband/hw/efa/efa_verbs.c >>>>>> @@ -44,7 +44,8 @@ struct efa_user_mmap_entry { >>>>>> op(EFA_CREATE_CQ_ERR, "create_cq_err") \ >>>>>> op(EFA_REG_MR_ERR, "reg_mr_err") \ >>>>>> op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \ >>>>>> - op(EFA_CREATE_AH_ERR, "create_ah_err") >>>>>> + op(EFA_CREATE_AH_ERR, "create_ah_err") \ >>>>>> + op(EFA_MMAP_ERR, "mmap_err") >>>>>> >>>>>> #define EFA_STATS_ENUM(ename, name) ename, >>>>>> #define EFA_STATS_STR(ename, name) [ename] = name, >>>>>> @@ -1569,6 +1570,7 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext, >>>>>> ibdev_dbg(&dev->ibdev, >>>>>> "pgoff[%#lx] does not have valid entry\n", >>>>>> vma->vm_pgoff); >>>>>> + atomic64_inc(&dev->stats.sw_stats.mmap_err); >>>>>> return -EINVAL; >>>>>> } >>>>>> entry = to_emmap(rdma_entry); >>>>>> @@ -1604,12 +1606,14 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext, >>>>>> err = -EINVAL; >>>>>> } >>>>>> >>>>>> - if (err) >>>>>> + if (err) { >>>>>> ibdev_dbg( >>>>>> &dev->ibdev, >>>>>> "Couldn't mmap address[%#llx] length[%#zx] mmap_flag[%d] err[%d]\n", >>>>>> entry->address, rdma_entry->npages * PAGE_SIZE, >>>>>> entry->mmap_flag, err); >>>>>> + atomic64_inc(&dev->stats.sw_stats.mmap_err); >>>>> >>>>> Really? Isn't this something that is only possible with a buggy >>>>> rdma-core provider? Why count it? >>>> >>>> Though unlikely, it could happen, otherwise this error flow wouldn't exist in >>>> the first place. >>>> >>>> If for some reason a customer app steps on a bug we're not aware of, this >>>> counter could serve as a red flag. >>> >>> But there are lots of cases where a buggy provider can cause error >>> exits, why choose this one to count against all the others? >> >> It's not one against all others, most if not all of our userspace facing API >> error flows have a similar counter. > > Hurm, seems a bit strange, but OK > >> And TBH, I think that the mmap flow is quite convoluted with the cookie response >> from the crate verb, so it deserves a counter IMO. > > How so? Userspace takes the u64 from the command and pass it to mmap, > what is convoluted? It's the only flow that involves two phases/userspace calls and not a single command-response call, so it's a bit more error prone. Convoluted was a bit harsh :).