Re: [PATCH] drm/amdgpu: Add autodump debugfs node for gpu reset v4

Christian König <christian.koenig@xxxxxxx> · Fri, 15 May 2020 08:51:17 +0200



    The first application to open the
      autodump node gets the right to use it.

      
      All others only get -EBUSY until the first application is done
      with the hardware.

      
      Christian.

      
      Am 15.05.20 um 04:40 schrieb Zhao, Jiange:

    
        [AMD Official Use Only - Internal Distribution Only]

      
          Hi Dennis,
        
          
          This node/feature is for UMR extension. It is designed for a
          single user.
        
          
          Jiange

        
        From: Li,
            Dennis <Dennis.Li@xxxxxxx>

            Sent: Thursday, May 14, 2020 11:15 PM

            To: Koenig, Christian
            <Christian.Koenig@xxxxxxx>; Zhao, Jiange
            <Jiange.Zhao@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
            <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>

            Cc: Deucher, Alexander
            <Alexander.Deucher@xxxxxxx>; Pelloux-prayer,
            Pierre-eric <Pierre-eric.Pelloux-prayer@xxxxxxx>;
            Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Liu, Monk
            <Monk.Liu@xxxxxxx>; Zhang, Hawking
            <Hawking.Zhang@xxxxxxx>

            Subject: RE: [PATCH] drm/amdgpu: Add autodump debugfs
            node for gpu reset v4
           
        
            [AMD
                Official Use Only - Internal Distribution Only]
             
            Hi,
                Jiange,
                 
                How to handle the case that multi-apps do the auto dump?
                This patch seems not multi-process safety.
              
             
            Best
                Regards
            Dennis
                Li
            
              
                From: amd-gfx
                    <amd-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx>
                    On Behalf Of Christian König

                    Sent: Thursday, May 14, 2020 4:29 PM

                    To: Zhao, Jiange <Jiange.Zhao@xxxxxxx>;
                    amd-gfx@xxxxxxxxxxxxxxxxxxxxx

                    Cc: Deucher, Alexander
                    <Alexander.Deucher@xxxxxxx>; Pelloux-prayer,
                    Pierre-eric
                    <Pierre-eric.Pelloux-prayer@xxxxxxx>;
                    Kuehling, Felix <Felix.Kuehling@xxxxxxx>; Liu,
                    Monk <Monk.Liu@xxxxxxx>; Zhang, Hawking
                    <Hawking.Zhang@xxxxxxx>

                    Subject: Re: [PATCH] drm/amdgpu: Add autodump
                    debugfs node for gpu reset v4
              
            
              Hi Jiange,

                
                it probably won't hurt, but I would just drop that. You
                need roughly 4 billion GPU resets until the UINT_MAX-1
                becomes zero again.

                
                Christian.

                
                Am 14.05.20 um 09:14 schrieb Zhao, Jiange:
            
            
              [AMD Official Use Only - Internal
                  Distribution Only]
               
              
                  Hi
                      Christian,
                
                
                  wait_for_completion_interruptible_timeout()
                    would decrease autodump.dumping.done to UINT_MAX-1.
                
                
                  complete_all() here would
                    restore autodump.dumping to the state as in
                    amdgpu_debugfs_autodump_init().
                
                
                  I want to make sure every
                    open() deals with the same situation.
                
                
                  Jiange
                
                
                  From: Christian König 
                      <ckoenig.leichtzumerken@xxxxxxxxx>

                    Sent: Thursday, May 14, 2020 3:01 PM

                    To: Zhao, Jiange <Jiange.Zhao@xxxxxxx>;
                    amd-gfx@xxxxxxxxxxxxxxxxxxxxx
                    
                      <amd-gfx@xxxxxxxxxxxxxxxxxxxxx>

                    Cc: Pelloux-prayer, Pierre-eric 
                      <Pierre-eric.Pelloux-prayer@xxxxxxx>;
                    Zhao, Jiange 
                      <Jiange.Zhao@xxxxxxx>; Kuehling, Felix 
                      <Felix.Kuehling@xxxxxxx>; Deucher,
                    Alexander 
                      <Alexander.Deucher@xxxxxxx>; Koenig,
                    Christian 
                      <Christian.Koenig@xxxxxxx>; Liu, Monk <Monk.Liu@xxxxxxx>;
                    Zhang, Hawking
                    <Hawking.Zhang@xxxxxxx>

                    Subject: Re: [PATCH] drm/amdgpu: Add autodump
                    debugfs node for gpu reset v4
                  
                  
                    Am
                      14.05.20 um 07:29 schrieb 
                        jianzh@xxxxxxx:

                      > From: Jiange Zhao <Jiange.Zhao@xxxxxxx>

                      >

                      > When GPU got timeout, it would notify an
                      interested part

                      > of an opportunity to dump info before actual
                      GPU reset.

                      >

                      > A usermode app would open 'autodump' node
                      under debugfs system

                      > and poll() for readable/writable. When a GPU
                      reset is due,

                      > amdgpu would notify usermode app through
                      wait_queue_head and give

                      > it 10 minutes to dump info.

                      >

                      > After usermode app has done its work, this
                      'autodump' node is closed.

                      > On node closure, amdgpu gets to know the dump
                      is done through

                      > the completion that is triggered in
                      release().

                      >

                      > There is no write or read callback because
                      necessary info can be

                      > obtained through dmesg and umr. Messages back
                      and forth between

                      > usermode app and amdgpu are unnecessary.

                      >

                      > v2: (1) changed 'registered' to
                      'app_listening'

                      >      (2) add a mutex in open() to prevent
                      race condition

                      >

                      > v3 (chk): grab the reset lock to avoid race
                      in autodump_open,

                      >            rename debugfs file to
                      amdgpu_autodump,

                      >            provide autodump_read as well,

                      >            style and code cleanups

                      >

                      > v4: add 'bool app_listening' to differentiate
                      situations, so that

                      >      the node can be reopened; also, there is
                      no need to wait for

                      >      completion when no app is waiting for a
                      dump.

                      >

                      > v5: change 'bool app_listening' to 'enum
                      amdgpu_autodump_state'

                      >      add 'app_state_mutex' for race
                      conditions:

                      >        (1)Only 1 user can open this file node

                      >        (2)wait_dump() can only take effect
                      after poll() executed.

                      >        (3)eliminated the race condition
                      between release() and

                      >           wait_dump()

                      >

                      > v6: removed 'enum amdgpu_autodump_state' and
                      'app_state_mutex'

                      >      removed state checking in
                      amdgpu_debugfs_wait_dump

                      >      Improve on top of version 3 so that the
                      node can be reopened.

                      >

                      > v7: move reinit_completion into open() so
                      that only one user

                      >      can open it.

                      >

                      > Signed-off-by: Jiange Zhao <Jiange.Zhao@xxxxxxx>

                      > ---

                      >   drivers/gpu/drm/amd/amdgpu/amdgpu.h        
                      |  2 +

                      >   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
                      | 79 ++++++++++++++++++++-

                      >   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h
                      |  6 ++

                      >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
                      |  2 +

                      >   4 files changed, 88 insertions(+), 1
                      deletion(-)

                      >

                      > diff --git
                      a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
                      b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

                      > index 2a806cb55b78..9e8eeddfe7ce 100644

                      > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h

                      > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h

                      > @@ -992,6 +992,8 @@ struct amdgpu_device {

                      >        char                           
                      product_number[16];

                      >        char                           
                      product_name[32];

                      >        char                           
                      serial[16];

                      > +

                      > +     struct amdgpu_autodump         
                      autodump;

                      >   };

                      >   

                      >   static inline struct amdgpu_device
                      *amdgpu_ttm_adev(struct ttm_bo_device *bdev)

                      > diff --git
                      a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
                      b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c

                      > index 1a4894fa3693..efee3f1adecf 100644

                      > ---
                      a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c

                      > +++
                      b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c

                      > @@ -27,7 +27,7 @@

                      >   #include <linux/pci.h>

                      >   #include <linux/uaccess.h>

                      >   #include <linux/pm_runtime.h>

                      > -

                      > +#include <linux/poll.h>

                      >   #include <drm/drm_debugfs.h>

                      >   

                      >   #include "amdgpu.h"

                      > @@ -74,8 +74,83 @@ int
                      amdgpu_debugfs_add_files(struct amdgpu_device
                      *adev,

                      >        return 0;

                      >   }

                      >   

                      > +int amdgpu_debugfs_wait_dump(struct
                      amdgpu_device *adev)

                      > +{

                      > +#if defined(CONFIG_DEBUG_FS)

                      > +     unsigned long timeout = 600 * HZ;

                      > +     int ret;

                      > +

                      > +    
                      wake_up_interruptible(&adev->autodump.gpu_hang);

                      > +

                      > +     ret =
wait_for_completion_interruptible_timeout(&adev->autodump.dumping,
                      timeout);

                      > +    
                      complete_all(&adev->autodump.dumping);

                      
                      Sorry that I'm mentioning this only now. But what
                      is this complete_all() 

                      here good for?

                      
                      I mean we already waited for completion, didn't
                      we?

                      
                      Christian.

                      
                      > +     if (ret == 0) {

                      > +             pr_err("autodump: timeout, move
                      on to gpu recovery\n");

                      > +             return -ETIMEDOUT;

                      > +     }

                      > +#endif

                      > +     return 0;

                      > +}

                      > +

                      >   #if defined(CONFIG_DEBUG_FS)

                      >   

                      > +static int
                      amdgpu_debugfs_autodump_open(struct inode *inode,
                      struct file *file)

                      > +{

                      > +     struct amdgpu_device *adev =
                      inode->i_private;

                      > +     int ret;

                      > +

                      > +     file->private_data = adev;

                      > +

                      > +     mutex_lock(&adev->lock_reset);

                      > +     if (adev->autodump.dumping.done) {

                      > +            
                      reinit_completion(&adev->autodump.dumping);

                      > +             ret = 0;

                      > +     } else {

                      > +             ret = -EBUSY;

                      > +     }

                      > +     mutex_unlock(&adev->lock_reset);

                      > +

                      > +     return ret;

                      > +}

                      > +

                      > +static int
                      amdgpu_debugfs_autodump_release(struct inode
                      *inode, struct file *file)

                      > +{

                      > +     struct amdgpu_device *adev =
                      file->private_data;

                      > +

                      > +    
                      complete_all(&adev->autodump.dumping);

                      > +     return 0;

                      > +}

                      > +

                      > +static unsigned int
                      amdgpu_debugfs_autodump_poll(struct file *file,
                      struct poll_table_struct *poll_table)

                      > +{

                      > +     struct amdgpu_device *adev =
                      file->private_data;

                      > +

                      > +     poll_wait(file,
                      &adev->autodump.gpu_hang, poll_table);

                      > +

                      > +     if (adev->in_gpu_reset)

                      > +             return POLLIN | POLLRDNORM |
                      POLLWRNORM;

                      > +

                      > +     return 0;

                      > +}

                      > +

                      > +static const struct file_operations
                      autodump_debug_fops = {

                      > +     .owner = THIS_MODULE,

                      > +     .open = amdgpu_debugfs_autodump_open,

                      > +     .poll = amdgpu_debugfs_autodump_poll,

                      > +     .release =
                      amdgpu_debugfs_autodump_release,

                      > +};

                      > +

                      > +static void
                      amdgpu_debugfs_autodump_init(struct amdgpu_device
                      *adev)

                      > +{

                      > +    
                      init_completion(&adev->autodump.dumping);

                      > +    
                      complete_all(&adev->autodump.dumping);

                      > +    
                      init_waitqueue_head(&adev->autodump.gpu_hang);

                      > +

                      > +     debugfs_create_file("amdgpu_autodump",
                      0600,

                      > +            
                      adev->ddev->primary->debugfs_root,

                      > +             adev,
                      &autodump_debug_fops);

                      > +}

                      > +

                      >   /**

                      >    * amdgpu_debugfs_process_reg_op - Handle
                      MMIO register reads/writes

                      >    *

                      > @@ -1434,6 +1509,8 @@ int
                      amdgpu_debugfs_init(struct amdgpu_device *adev)

                      >   

                      >        amdgpu_ras_debugfs_create_all(adev);

                      >   

                      > +     amdgpu_debugfs_autodump_init(adev);

                      > +

                      >        return amdgpu_debugfs_add_files(adev,
                      amdgpu_debugfs_list,

                      >                                       
                      ARRAY_SIZE(amdgpu_debugfs_list));

                      >   }

                      > diff --git
                      a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h
                      b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h

                      > index de12d1101526..2803884d338d 100644

                      > ---
                      a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h

                      > +++
                      b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.h

                      > @@ -31,6 +31,11 @@ struct amdgpu_debugfs {

                      >        unsigned                num_files;

                      >   };

                      >   

                      > +struct amdgpu_autodump {

                      > +     struct completion              
                      dumping;

                      > +     struct wait_queue_head         
                      gpu_hang;

                      > +};

                      > +

                      >   int amdgpu_debugfs_regs_init(struct
                      amdgpu_device *adev);

                      >   int amdgpu_debugfs_init(struct
                      amdgpu_device *adev);

                      >   void amdgpu_debugfs_fini(struct
                      amdgpu_device *adev);

                      > @@ -40,3 +45,4 @@ int
                      amdgpu_debugfs_add_files(struct amdgpu_device
                      *adev,

                      >   int amdgpu_debugfs_fence_init(struct
                      amdgpu_device *adev);

                      >   int amdgpu_debugfs_firmware_init(struct
                      amdgpu_device *adev);

                      >   int amdgpu_debugfs_gem_init(struct
                      amdgpu_device *adev);

                      > +int amdgpu_debugfs_wait_dump(struct
                      amdgpu_device *adev);

                      > diff --git
                      a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
                      b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

                      > index cc41e8f5ad14..545beebcf43e 100644

                      > ---
                      a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

                      > +++
                      b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

                      > @@ -3927,6 +3927,8 @@ static int
                      amdgpu_device_pre_asic_reset(struct amdgpu_device
                      *adev,

                      >        int i, r = 0;

                      >        bool need_full_reset  =
                      *need_full_reset_arg;

                      >   

                      > +     amdgpu_debugfs_wait_dump(adev);

                      > +

                      >        /* block all schedulers and reset
                      given job's ring */

                      >        for (i = 0; i < AMDGPU_MAX_RINGS;
                      ++i) {

                      >                struct amdgpu_ring *ring =
                      adev->rings[i];
                  
                
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx