Re: [RFC 0/4] GPU/CPU timestamps correlation for relating OA samples with system events

Lionel Landwerlin <lionel.g.landwerlin@xxxxxxxxx> · Thu, 28 Dec 2017 17:13:01 +0000



    On 26/12/17 05:32, Sagar Arun Kamble
      wrote:

    
      On 12/22/2017 3:46 PM, Lionel
        Landwerlin wrote:

      
        On 22/12/17 09:30, Sagar Arun
          Kamble wrote:

        
          On 12/21/2017 6:29 PM, Lionel
            Landwerlin wrote:

          
            Some more findings I made while
              playing with this series & GPUTop.

              Turns out the 2ms drift per second is due to timecounter.
              Adding the delta this way :

              
              https://github.com/djdeath/linux/commit/7b002cb360483e331053aec0f98433a5bd5c5c3f#diff-9b74bd0cfaa90b601d80713c7bd56be4R607

              
              Eliminates the drift.
          
          I see two imp. changes 1. approximation of start time during
          init_timecounter 2. overflow handling in delta accumulation.

          With these incorporated, I guess timecounter should also work
          in same fashion.

        
        I think the arithmetic in timecounter is inherently lossy and
        that's why we're seeing a drift.
      Could you share details about platform, scenario in which 2ms
      drift per second is being seen with timecounter.

      I did not observe this on SKL.

    
    The 2ms drift was on SKL GT4.

    
    With the patch above, I'm seeing only a ~40us drift over ~7seconds
    of recording both perf tracepoints & i915 perf reports.

    I'm tracking the kernel tracepoints adding gem requests and the i915
    perf reports.

    Here a screenshot at the beginning of the 7s recording :
    https://i.imgur.com/hnexgjQ.png (you can see the gem request add
    before the work starts in the i915 perf reports).

    At the end of the recording, the gem requests appear later than the
    work in the i915 perf report : https://i.imgur.com/oCd0C9T.png

    
    I'll try to prepare some IGT tests that show the drift using perf
    & i915 perf, so we can run those on different platforms.

    I tend to mostly test on a SKL GT4 & KBL GT2, but BXT definitely
    needs more attention...

    
       Could
        we be using it wrong?

        
      if we use two changes highlighted above with timecounter maybe we
      will get same results as your current implementation.

       In
        the patch above, I think there is still a drift because of the
        potential fractional part loss at every delta we add.

        But it should only be a fraction of a nanosecond multiplied by
        the number of reports over a period of time.

        With a report every 1us, that should still be much less than a
        1ms of drift over 1s.

        
      timecounter interface takes care of fractional parts so that
      should help us.

      we can either go with timecounter or our own implementation
      provided conversions are precise.

    
    Looking at clocks_calc_mult_shift(), it seems clear to me that there
    is less precision when using timecounter :

    
     /*

      * Find the conversion shift/mult pair which has the best

      * accuracy and fits the maxsec conversion range:

      */

    
    On the other hand, there is a performance penalty for doing a div64
    for every report.

    
       We
        can probably do better by always computing the clock using the
        entire delta rather than the accumulated delta.

        
      issue is that the reported clock cycles in the OA report is 32bits
      LSB of GPU TS whereas counter is 36bits. Hence we will need to

      accumulate the delta. ofc there is assumption that two reports
      can't be spaced with count value of 0xffffffff apart.

    
    You're right :)

    I thought maybe we could do this : 

    
    Look at teduhe opening period parameter, if it's superior to the
    period of timestamps wrapping, make sure we schle some work on
    kernel context to generate a context switch report (like at least
    once every 6 minutes on gen9).

    
             Timelines of perf i915
              tracepoints & OA reports now make a lot more sense.

              
              There is still the issue that reading the CPU clock &
              the RCS timestamp is inherently not atomic. So there is a
              delta there.

              I think we should add a new i915 perf record type to
              express the delta that we measure this way :

              
              https://github.com/djdeath/linux/commit/7b002cb360483e331053aec0f98433a5bd5c5c3f#diff-9b74bd0cfaa90b601d80713c7bd56be4R2475

              
              So that userspace knows there might be a global offset
              between the 2 times and is able to present it.

            
          agree on this. Delta ns1-ns0 can be interpreted as max drift.

          
             Measurement on my KBL system
              were in the order of a few microseconds (~30us).

              I guess we might be able to setup the correlation point
              better (masking interruption?) to reduce the delta.

            
          already using spin_lock. Do you mean NMI?

        
        I don't actually know much on this point.

        if spin_lock is the best we can do, then that's it :)

        
              Thanks,

              
              -

              Lionel

              
              On 07/12/17 00:57, Robert Bragg wrote:

            
                  On Thu, Dec 7, 2017 at 12:48
                    AM, Robert Bragg <robert@xxxxxxxxxxxxx>
                    wrote:

                    
                             at least from what I wrote back then
                              it looks like I was seeing a drift of a
                              few milliseconds per second on SKL. I
                              vaguely recall it being much worse given
                              the frequency constants we had for
                              Haswell.

                            
                    Sorry I didn't actually re-read my own message
                      properly before referencing it :) Apparently the
                      2ms per second drift was for Haswell, so
                      presumably not quite so bad for SKL. 

                    
                    - Robert

                    
              _______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

            
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx