Hi Felix & Oak, over the weekend I had the idea that we could just use the shadow BOs to have the current state in a page fault. They are GTT BOs and CPU accessible anyway. Regards, Christian. Am 08.09.2018 um 09:34 schrieb Christian König: > Hi Felix, > >> But why do you want to update page tables when there is no more user >> mode context that cares about them? Is this just to allow pending work >> to complete after the user mode context is gone? To prevent hangs? > > The problem I'm see is that we still don't have a reliable way of > killing a command submission while making sure that no more faults of > it are in the flight. > > E.g. the HWS has a function to kill a running job on the hardware, but > that only works for compute and there is no way of telling when that > operation has finished. > > But Oak has already convinced me that we should probably work on that > instead of trying to keep the VM alive as long as possible. > > Regards, > Christian. > > Am 08.09.2018 um 02:27 schrieb Felix Kuehling: >> Hi Christian, >> >> I'm not sure I get your point here. >> >> As I understand it, the amdgpu_vm structure represents the user mode VM >> context. It has the pointers to the root page directory and the rest of >> the page table hierarchy. If there is no amdgpu_vm, there is no user >> mode context that cares about the page tables. >> >> If we leave the page table pointers in the amdgpu_vm, then handling >> faults after the VM is destroyed is pointless. We can't find the page >> tables and we can't update them, so there is nothing we can do in >> response to VM faults. >> >> So I'm guessing that you want to move the page table pointers into a >> different structure that exists half-independently of the VM. It would >> be created when the VM is created (that's where we currently allocate >> the PASID) but can be destroyed slightly later. >> >> But why do you want to update page tables when there is no more user >> mode context that cares about them? Is this just to allow pending work >> to complete after the user mode context is gone? To prevent hangs? >> >> Regards, >>   Felix >> >> On 2018-09-10 07:20 AM, Christian König wrote: >>>> Am I right that the handling of page fault need a valid VM? >>> No, exactly that view is incorrect. >>> >>> The VM is meant to be a memory management structure of page tables and >>> is completely pointless fault processing because it represent the >>> future state of the page tables and not the current one. >>> >>> What we need for fault processing is a new structure build around >>> PASIDs which is feed by the with addresses when page tables are moved >>> around. >>> >>> Alternatively I hope that we can use the SDMA to walk the page tables >>> and update the required entries by just using the address. >>> >>> Regards, >>> Christian. >>> >>> Am 07.09.2018 um 16:55 schrieb Zeng, Oak: >>>> Hi Christian, >>>> >>>> >>>> What is the value of delay the destroy of hash to after vm is >>>> destroyed? Since vm is destroyed, you basically donâ??t have enough >>>> information to paging in the correct page to gpuvm. Am I right that >>>> the handling of page fault need a valid VM? For example, you need the >>>> VM to get the corresponding physical address of the faulty page. >>>> >>>> >>>> After vm is destroyed, the retry interrupt can be directly discard as >>>> you canâ??t find vm through pasid. You can think this is also one kind >>>> of prescreen. >>>> >>>> >>>> So I am stand for the point that, there is no need to delay the >>>> destroy of hash to after vm is destroyed. Prescreening hash can be >>>> destroyed at the time of vm_fini. >>>> >>>> >>>> Thanks, >>>> >>>> Oak >>>> >>>> >>>> *From:*Christian König <ckoenig.leichtzumerken at gmail.com> >>>> *Sent:* Friday, September 07, 2018 4:39 AM >>>> *To:* Zeng, Oak <Oak.Zeng at amd.com>; Oak Zeng <zengshanjun at gmail.com>; >>>> amd-gfx at lists.freedesktop.org >>>> *Cc:* Zeng, Oak <Oak.Zeng at amd.com> >>>> *Subject:* Re: [PATCH 1/2] drm/amdgpu: Moved fault hash table to >>>> amdgpu vm >>>> >>>> >>>> Hi Oak, >>>> >>>> yeah, but this is still a NAK. Making the hash per PASID was not a >>>> suggestion but rather a must have. >>>> >>>> The VM structures must be destroyed while the processing is still >>>> ongoing, or otherwise we would not have a clean OOM handling. >>>> >>>> The IDR for PASID lockup can be put into amdgpu_ids.c, you can just >>>> replace the IDA already used there with an IDR. >>>> >>>> Since the PASID is already freed up delayed we should have the grace >>>> period for interrupt processing included. If that still isn't >>>> sufficient we can still add some delayed work for this. >>>> >>>> Regards, >>>> Christian. >>>> >>>> Am 07.09.2018 um 06:16 schrieb ozeng: >>>> >>>>     Hi Christian, >>>> >>>>     In this implementation, fault hash is made per vm, not per pasid >>>>     as suggested, based on below considerations: >>>> >>>>       * Delay the destroy of hash introduces more effort like how to >>>>         set the proper grace period after which no retry interrupt >>>>         will be generated. We want to avoid those complication if we >>>>         can. >>>>       * The problem of the per vm implementation is that it is hard >>>>         to delay the destroy of fault hash, because when vm is >>>>         destroyed, prescreen function won't be able to retrieve the >>>>         fault hash. But in this case, the prescreen function can >>>>         either let the interrupt go through (to gmc) or ignore it. >>>>         From this perspective, it is not very necessary to delay the >>>>         destroy of hash table. >>>>       * On the other hand, since ASICs after vega10 have the HW CAM >>>>         feature. So the SW IV prescreen is only useful for vega10. >>>>         For all the HW implemented CAM, we can consider the delayed >>>>         CAM acknowledgment. >>>>       * Making it per pasid need to introduce another idr to >>>>         correlate pasid and hash table - where to put the idr? You >>>>         will have to make it a global variable which is not very >>>>         desirable. >>>> >>>>     The main purpose of this patch is to solve the inter-process lock >>>>     issue (when hash table is full), while still keep codes simple. >>>> >>>>     Also with this patch, the faults kfifo is not necessary any more. >>>>     See patch 2. >>>> >>>>     Regards, >>>> >>>>     Oak >>>> >>>> >>>>     On 09/06/2018 11:28 PM, Oak Zeng wrote: >>>> >>>>         In stead of share one fault hash table per device, make it >>>> >>>>         per vm. This can avoid inter-process lock issue when fault >>>> >>>>         hash table is full. >>>> >>>> >>>>         Change-Id: I5d1281b7c41eddc8e26113e010516557588d3708 >>>> >>>>         Signed-off-by: Oak Zeng <Oak.Zeng at amd.com> >>>> <mailto:Oak.Zeng at amd.com> >>>> >>>>         Suggested-by: Christian Konig <Christian.Koenig at amd.com> >>>> <mailto:Christian.Koenig at amd.com> >>>> >>>>         Suggested-by: Felix Kuehling <Felix.Kuehling at amd.com> >>>> <mailto:Felix.Kuehling at amd.com> >>>> >>>>         --- >>>> >>>>          drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c | 75 >>>> ------------------------ >>>> >>>>          drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h | 10 ---- >>>> >>>>          drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 102 >>>> ++++++++++++++++++++++++++++++++- >>>> >>>>          drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 12 ++++ >>>> >>>>          drivers/gpu/drm/amd/amdgpu/vega10_ih.c | 38 +++++------- >>>> >>>>          5 files changed, 127 insertions(+), 110 deletions(-) >>>> >>>> >>>>         diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c >>>> >>>>         index 06373d4..4ed8621 100644 >>>> >>>>         --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c >>>> >>>>         +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.c >>>> >>>>         @@ -197,78 +197,3 @@ int amdgpu_ih_process(struct >>>> amdgpu_device *adev) >>>> >>>>           return IRQ_HANDLED; >>>> >>>>          } >>>> >>>> >>>>         -/** >>>> >>>>         - * amdgpu_ih_add_fault - Add a page fault record >>>> >>>>         - * >>>> >>>>         - * @adev: amdgpu device pointer >>>> >>>>         - * @key: 64-bit encoding of PASID and address >>>> >>>>         - * >>>> >>>>         - * This should be called when a retry page fault >>>> interrupt is >>>> >>>>         - * received. If this is a new page fault, it will be >>>> added to a hash >>>> >>>>         - * table. The return value indicates whether this is a >>>> new fault, or >>>> >>>>         - * a fault that was already known and is already being >>>> handled. >>>> >>>>         - * >>>> >>>>         - * If there are too many pending page faults, this will >>>> fail. Retry >>>> >>>>         - * interrupts should be ignored in this case until there >>>> is enough >>>> >>>>         - * free space. >>>> >>>>         - * >>>> >>>>         - * Returns 0 if the fault was added, 1 if the fault was >>>> already known, >>>> >>>>         - * -ENOSPC if there are too many pending faults. >>>> >>>>         - */ >>>> >>>>         -int amdgpu_ih_add_fault(struct amdgpu_device *adev, u64 key) >>>> >>>>         -{ >>>> >>>>         - unsigned long flags; >>>> >>>>         - int r = -ENOSPC; >>>> >>>>         - >>>> >>>>         - if (WARN_ON_ONCE(!adev->irq.ih.faults)) >>>> >>>>         -         /* Should be allocated in <IP>_ih_sw_init on >>>> GPUs that >>>> >>>>         -          * support retry faults and require retry >>>> filtering. >>>> >>>>         -          */ >>>> >>>>         -         return r; >>>> >>>>         - >>>> >>>>         - spin_lock_irqsave(&adev->irq.ih.faults->lock, flags); >>>> >>>>         - >>>> >>>>         - /* Only let the hash table fill up to 50% for best >>>> performance */ >>>> >>>>         - if (adev->irq.ih.faults->count >= (1 << >>>> (AMDGPU_PAGEFAULT_HASH_BITS-1))) >>>> >>>>         -         goto unlock_out; >>>> >>>>         - >>>> >>>>         - r = chash_table_copy_in(&adev->irq.ih.faults->hash, >>>> key, NULL); >>>> >>>>         - if (!r) >>>> >>>>         -         adev->irq.ih.faults->count++; >>>> >>>>         - >>>> >>>>         - /* chash_table_copy_in should never fail unless we're >>>> losing count */ >>>> >>>>         - WARN_ON_ONCE(r < 0); >>>> >>>>         - >>>> >>>>         -unlock_out: >>>> >>>>         - spin_unlock_irqrestore(&adev->irq.ih.faults->lock, flags); >>>> >>>>         - return r; >>>> >>>>         -} >>>> >>>>         - >>>> >>>>         -/** >>>> >>>>         - * amdgpu_ih_clear_fault - Remove a page fault record >>>> >>>>         - * >>>> >>>>         - * @adev: amdgpu device pointer >>>> >>>>         - * @key: 64-bit encoding of PASID and address >>>> >>>>         - * >>>> >>>>         - * This should be called when a page fault has been >>>> handled. Any >>>> >>>>         - * future interrupt with this key will be processed as a new >>>> >>>>         - * page fault. >>>> >>>>         - */ >>>> >>>>         -void amdgpu_ih_clear_fault(struct amdgpu_device *adev, >>>> u64 key) >>>> >>>>         -{ >>>> >>>>         - unsigned long flags; >>>> >>>>         - int r; >>>> >>>>         - >>>> >>>>         - if (!adev->irq.ih.faults) >>>> >>>>         -         return; >>>> >>>>         - >>>> >>>>         - spin_lock_irqsave(&adev->irq.ih.faults->lock, flags); >>>> >>>>         - >>>> >>>>         - r = chash_table_remove(&adev->irq.ih.faults->hash, key, >>>> NULL); >>>> >>>>         - if (!WARN_ON_ONCE(r < 0)) { >>>> >>>>         -         adev->irq.ih.faults->count--; >>>> >>>>         - WARN_ON_ONCE(adev->irq.ih.faults->count < 0); >>>> >>>>         - } >>>> >>>>         - >>>> >>>>         - spin_unlock_irqrestore(&adev->irq.ih.faults->lock, flags); >>>> >>>>         -} >>>> >>>>         diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h >>>> >>>>         index a23e1c0..f411ffb 100644 >>>> >>>>         --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h >>>> >>>>         +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ih.h >>>> >>>>         @@ -32,13 +32,6 @@ struct amdgpu_device; >>>> >>>>          #define AMDGPU_IH_CLIENTID_LEGACY 0 >>>> >>>>          #define AMDGPU_IH_CLIENTID_MAX SOC15_IH_CLIENTID_MAX >>>> >>>> >>>>         -#define AMDGPU_PAGEFAULT_HASH_BITS 8 >>>> >>>>         -struct amdgpu_retryfault_hashtable { >>>> >>>>         - DECLARE_CHASH_TABLE(hash, AMDGPU_PAGEFAULT_HASH_BITS, >>>> 8, 0); >>>> >>>>         - spinlock_t    lock; >>>> >>>>         - int           count; >>>> >>>>         -}; >>>> >>>>         - >>>> >>>>          /* >>>> >>>>          * R6xx+ IH ring >>>> >>>>          */ >>>> >>>>         @@ -57,7 +50,6 @@ struct amdgpu_ih_ring { >>>> >>>>           bool                  use_doorbell; >>>> >>>>           bool                  use_bus_addr; >>>> >>>>           dma_addr_t            rb_dma_addr; /* only used when >>>> use_bus_addr = true */ >>>> >>>>         - struct amdgpu_retryfault_hashtable *faults; >>>> >>>>          }; >>>> >>>> >>>>          #define AMDGPU_IH_SRC_DATA_MAX_SIZE_DW 4 >>>> >>>>         @@ -95,7 +87,5 @@ int amdgpu_ih_ring_init(struct >>>> amdgpu_device *adev, unsigned ring_size, >>>> >>>>                           bool use_bus_addr); >>>> >>>>          void amdgpu_ih_ring_fini(struct amdgpu_device *adev); >>>> >>>>          int amdgpu_ih_process(struct amdgpu_device *adev); >>>> >>>>         -int amdgpu_ih_add_fault(struct amdgpu_device *adev, u64 >>>> key); >>>> >>>>         -void amdgpu_ih_clear_fault(struct amdgpu_device *adev, >>>> u64 key); >>>> >>>> >>>>          #endif >>>> >>>>         diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>>> >>>>         index 1d7e3c1..8b220e9 100644 >>>> >>>>         --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>>> >>>>         +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >>>> >>>>         @@ -2692,6 +2692,22 @@ void amdgpu_vm_adjust_size(struct >>>> amdgpu_device *adev, uint32_t min_vm_size, >>>> >>>>                    adev->vm_manager.fragment_size); >>>> >>>>          } >>>> >>>> >>>>         +static struct amdgpu_retryfault_hashtable >>>> *init_fault_hash(void) >>>> >>>>         +{ >>>> >>>>         + struct amdgpu_retryfault_hashtable *fault_hash; >>>> >>>>         + >>>> >>>>         + fault_hash = kmalloc(sizeof(*fault_hash), GFP_KERNEL); >>>> >>>>         + if (!fault_hash) >>>> >>>>         +         return fault_hash; >>>> >>>>         + >>>> >>>>         + INIT_CHASH_TABLE(fault_hash->hash, >>>> >>>>         +                 AMDGPU_PAGEFAULT_HASH_BITS, 8, 0); >>>> >>>>         + spin_lock_init(&fault_hash->lock); >>>> >>>>         + fault_hash->count = 0; >>>> >>>>         + >>>> >>>>         + return fault_hash; >>>> >>>>         +} >>>> >>>>         + >>>> >>>>          /** >>>> >>>>          * amdgpu_vm_init - initialize a vm instance >>>> >>>>          * >>>> >>>>         @@ -2780,6 +2796,12 @@ int amdgpu_vm_init(struct >>>> amdgpu_device *adev, struct amdgpu_vm *vm, >>>> >>>>                   vm->pasid = pasid; >>>> >>>>           } >>>> >>>> >>>>         + vm->fault_hash = init_fault_hash(); >>>> >>>>         + if (!vm->fault_hash) { >>>> >>>>         +         r = -ENOMEM; >>>> >>>>         +         goto error_free_root; >>>> >>>>         + } >>>> >>>>         + >>>> >>>>           INIT_KFIFO(vm->faults); >>>> >>>>           vm->fault_credit = 16; >>>> >>>> >>>>         @@ -2973,7 +2995,7 @@ void amdgpu_vm_fini(struct >>>> amdgpu_device *adev, struct amdgpu_vm *vm) >>>> >>>> >>>>           /* Clear pending page faults from IH when the VM is >>>> destroyed */ >>>> >>>>           while (kfifo_get(&vm->faults, &fault)) >>>> >>>>         -         amdgpu_ih_clear_fault(adev, fault); >>>> >>>>         +         amdgpu_vm_clear_fault(vm->fault_hash, fault); >>>> >>>> >>>>           if (vm->pasid) { >>>> >>>>                   unsigned long flags; >>>> >>>>         @@ -2983,6 +3005,9 @@ void amdgpu_vm_fini(struct >>>> amdgpu_device *adev, struct amdgpu_vm *vm) >>>> >>>> spin_unlock_irqrestore(&adev->vm_manager.pasid_lock, flags); >>>> >>>>           } >>>> >>>> >>>>         + kfree(vm->fault_hash); >>>> >>>>         + vm->fault_hash = NULL; >>>> >>>>         + >>>> >>>>           drm_sched_entity_destroy(&vm->entity); >>>> >>>> >>>>           if (!RB_EMPTY_ROOT(&vm->va.rb_root)) { >>>> >>>>         @@ -3183,3 +3208,78 @@ void amdgpu_vm_set_task_info(struct >>>> amdgpu_vm *vm) >>>> >>>>                   } >>>> >>>>           } >>>> >>>>          } >>>> >>>>         + >>>> >>>>         +/** >>>> >>>>         + * amdgpu_vm_add_fault - Add a page fault record to fault >>>> hash table >>>> >>>>         + * >>>> >>>>         + * @fault_hash: fault hash table >>>> >>>>         + * @key: 64-bit encoding of PASID and address >>>> >>>>         + * >>>> >>>>         + * This should be called when a retry page fault >>>> interrupt is >>>> >>>>         + * received. If this is a new page fault, it will be >>>> added to a hash >>>> >>>>         + * table. The return value indicates whether this is a >>>> new fault, or >>>> >>>>         + * a fault that was already known and is already being >>>> handled. >>>> >>>>         + * >>>> >>>>         + * If there are too many pending page faults, this will >>>> fail. Retry >>>> >>>>         + * interrupts should be ignored in this case until there >>>> is enough >>>> >>>>         + * free space. >>>> >>>>         + * >>>> >>>>         + * Returns 0 if the fault was added, 1 if the fault was >>>> already known, >>>> >>>>         + * -ENOSPC if there are too many pending faults. >>>> >>>>         + */ >>>> >>>>         +int amdgpu_vm_add_fault(struct >>>> amdgpu_retryfault_hashtable *fault_hash, u64 key) >>>> >>>>         +{ >>>> >>>>         + unsigned long flags; >>>> >>>>         + int r = -ENOSPC; >>>> >>>>         + >>>> >>>>         + if (WARN_ON_ONCE(!fault_hash)) >>>> >>>>         +         /* Should be allocated in amdgpu_vm_init >>>> >>>>         +          */ >>>> >>>>         +         return r; >>>> >>>>         + >>>> >>>>         + spin_lock_irqsave(&fault_hash->lock, flags); >>>> >>>>         + >>>> >>>>         + /* Only let the hash table fill up to 50% for best >>>> performance */ >>>> >>>>         + if (fault_hash->count >= (1 << >>>> (AMDGPU_PAGEFAULT_HASH_BITS-1))) >>>> >>>>         +         goto unlock_out; >>>> >>>>         + >>>> >>>>         + r = chash_table_copy_in(&fault_hash->hash, key, NULL); >>>> >>>>         + if (!r) >>>> >>>>         +         fault_hash->count++; >>>> >>>>         + >>>> >>>>         + /* chash_table_copy_in should never fail unless we're >>>> losing count */ >>>> >>>>         + WARN_ON_ONCE(r < 0); >>>> >>>>         + >>>> >>>>         +unlock_out: >>>> >>>>         + spin_unlock_irqrestore(&fault_hash->lock, flags); >>>> >>>>         + return r; >>>> >>>>         +} >>>> >>>>         + >>>> >>>>         +/** >>>> >>>>         + * amdgpu_vm_clear_fault - Remove a page fault record >>>> >>>>         + * >>>> >>>>         + * @fault_hash: fault hash table >>>> >>>>         + * @key: 64-bit encoding of PASID and address >>>> >>>>         + * >>>> >>>>         + * This should be called when a page fault has been >>>> handled. Any >>>> >>>>         + * future interrupt with this key will be processed as a new >>>> >>>>         + * page fault. >>>> >>>>         + */ >>>> >>>>         +void amdgpu_vm_clear_fault(struct >>>> amdgpu_retryfault_hashtable *fault_hash, u64 key) >>>> >>>>         +{ >>>> >>>>         + unsigned long flags; >>>> >>>>         + int r; >>>> >>>>         + >>>> >>>>         + if (!fault_hash) >>>> >>>>         +         return; >>>> >>>>         + >>>> >>>>         + spin_lock_irqsave(&fault_hash->lock, flags); >>>> >>>>         + >>>> >>>>         + r = chash_table_remove(&fault_hash->hash, key, NULL); >>>> >>>>         + if (!WARN_ON_ONCE(r < 0)) { >>>> >>>>         +         fault_hash->count--; >>>> >>>>         +         WARN_ON_ONCE(fault_hash->count < 0); >>>> >>>>         + } >>>> >>>>         + >>>> >>>>         + spin_unlock_irqrestore(&fault_hash->lock, flags); >>>> >>>>         +} >>>> >>>>         diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >>>> >>>>         index e275ee7..6eb1da1 100644 >>>> >>>>         --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >>>> >>>>         +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >>>> >>>>         @@ -178,6 +178,13 @@ struct amdgpu_task_info { >>>> >>>>           pid_t  tgid; >>>> >>>>          }; >>>> >>>> >>>>         +#define AMDGPU_PAGEFAULT_HASH_BITS 8 >>>> >>>>         +struct amdgpu_retryfault_hashtable { >>>> >>>>         + DECLARE_CHASH_TABLE(hash, AMDGPU_PAGEFAULT_HASH_BITS, >>>> 8, 0); >>>> >>>>         + spinlock_t    lock; >>>> >>>>         + int           count; >>>> >>>>         +}; >>>> >>>>         + >>>> >>>>          struct amdgpu_vm { >>>> >>>>           /* tree of virtual addresses mapped */ >>>> >>>>           struct rb_root_cached va; >>>> >>>>         @@ -240,6 +247,7 @@ struct amdgpu_vm { >>>> >>>>           struct ttm_lru_bulk_move lru_bulk_move; >>>> >>>>           /* mark whether can do the bulk move */ >>>> >>>>           bool                  bulk_moveable; >>>> >>>>         + struct amdgpu_retryfault_hashtable *fault_hash; >>>> >>>>          }; >>>> >>>> >>>>          struct amdgpu_vm_manager { >>>> >>>>         @@ -355,4 +363,8 @@ void amdgpu_vm_set_task_info(struct >>>> amdgpu_vm *vm); >>>> >>>>          void amdgpu_vm_move_to_lru_tail(struct amdgpu_device *adev, >>>> >>>>                                  struct amdgpu_vm *vm); >>>> >>>> >>>>         +int amdgpu_vm_add_fault(struct >>>> amdgpu_retryfault_hashtable *fault_hash, u64 key); >>>> >>>>         + >>>> >>>>         +void amdgpu_vm_clear_fault(struct >>>> amdgpu_retryfault_hashtable *fault_hash, u64 key); >>>> >>>>         + >>>> >>>>          #endif >>>> >>>>         diff --git a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c >>>> b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c >>>> >>>>         index 5ae5ed2..acbe5a7 100644 >>>> >>>>         --- a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c >>>> >>>>         +++ b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c >>>> >>>>         @@ -265,35 +265,36 @@ static bool >>>> vega10_ih_prescreen_iv(struct amdgpu_device *adev) >>>> >>>>                   return true; >>>> >>>>           } >>>> >>>> >>>>         - addr = ((u64)(dw5 & 0xf) << 44) | ((u64)dw4 << 12); >>>> >>>>         - key = AMDGPU_VM_FAULT(pasid, addr); >>>> >>>>         - r = amdgpu_ih_add_fault(adev, key); >>>> >>>>         - >>>> >>>>         - /* Hash table is full or the fault is already being >>>> processed, >>>> >>>>         -  * ignore further page faults >>>> >>>>         -  */ >>>> >>>>         - if (r != 0) >>>> >>>>         -         goto ignore_iv; >>>> >>>>         - >>>> >>>>           /* Track retry faults in per-VM fault FIFO. */ >>>> >>>>           spin_lock(&adev->vm_manager.pasid_lock); >>>> >>>>           vm = idr_find(&adev->vm_manager.pasid_idr, pasid); >>>> >>>>         + addr = ((u64)(dw5 & 0xf) << 44) | ((u64)dw4 << 12); >>>> >>>>         + key = AMDGPU_VM_FAULT(pasid, addr); >>>> >>>>           if (!vm) { >>>> >>>>                   /* VM not found, process it normally */ >>>> >>>> spin_unlock(&adev->vm_manager.pasid_lock); >>>> >>>>         -         amdgpu_ih_clear_fault(adev, key); >>>> >>>>                   return true; >>>> >>>>         + } else { >>>> >>>>         +         r = amdgpu_vm_add_fault(vm->fault_hash, key); >>>> >>>>         + >>>> >>>>         +         /* Hash table is full or the fault is already >>>> being processed, >>>> >>>>         +          * ignore further page faults >>>> >>>>         +          */ >>>> >>>>         +         if (r != 0) { >>>> >>>>         + spin_unlock(&adev->vm_manager.pasid_lock); >>>> >>>>         +                 goto ignore_iv; >>>> >>>>         +         } >>>> >>>>           } >>>> >>>>           /* No locking required with single writer and single >>>> reader */ >>>> >>>>           r = kfifo_put(&vm->faults, key); >>>> >>>>           if (!r) { >>>> >>>>                   /* FIFO is full. Ignore it until there is space */ >>>> >>>>         +         amdgpu_vm_clear_fault(vm->fault_hash, key); >>>> >>>> spin_unlock(&adev->vm_manager.pasid_lock); >>>> >>>>         -         amdgpu_ih_clear_fault(adev, key); >>>> >>>>                   goto ignore_iv; >>>> >>>>           } >>>> >>>>         - spin_unlock(&adev->vm_manager.pasid_lock); >>>> >>>> >>>>         + spin_unlock(&adev->vm_manager.pasid_lock); >>>> >>>>           /* It's the first fault for this address, process it >>>> normally */ >>>> >>>>           return true; >>>> >>>> >>>>         @@ -386,14 +387,6 @@ static int vega10_ih_sw_init(void >>>> *handle) >>>> >>>>           adev->irq.ih.use_doorbell = true; >>>> >>>>           adev->irq.ih.doorbell_index = AMDGPU_DOORBELL64_IH << 1; >>>> >>>> >>>>         - adev->irq.ih.faults = >>>> kmalloc(sizeof(*adev->irq.ih.faults), GFP_KERNEL); >>>> >>>>         - if (!adev->irq.ih.faults) >>>> >>>>         -         return -ENOMEM; >>>> >>>>         - INIT_CHASH_TABLE(adev->irq.ih.faults->hash, >>>> >>>>         -                  AMDGPU_PAGEFAULT_HASH_BITS, 8, 0); >>>> >>>>         - spin_lock_init(&adev->irq.ih.faults->lock); >>>> >>>>         - adev->irq.ih.faults->count = 0; >>>> >>>>         - >>>> >>>>           r = amdgpu_irq_init(adev); >>>> >>>> >>>>           return r; >>>> >>>>         @@ -406,9 +399,6 @@ static int vega10_ih_sw_fini(void >>>> *handle) >>>> >>>>           amdgpu_irq_fini(adev); >>>> >>>>           amdgpu_ih_ring_fini(adev); >>>> >>>> >>>>         - kfree(adev->irq.ih.faults); >>>> >>>>         - adev->irq.ih.faults = NULL; >>>> >>>>         - >>>> >>>>           return 0; >>>> >>>>          } >>>> >>>> >>>> >>>> >>>> >>>> >>>>     _______________________________________________ >>>> >>>>     amd-gfx mailing list >>>> >>>>     amd-gfx at lists.freedesktop.org >>>> <mailto:amd-gfx at lists.freedesktop.org> >>>> >>>>     https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>> >>>> >>> >>> >>> _______________________________________________ >>> amd-gfx mailing list >>> amd-gfx at lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > _______________________________________________ > amd-gfx mailing list > amd-gfx at lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/amd-gfx