Re: [RFC] A new SPI API for fast, low-latency regmap peripheral access

David Jander <david@xxxxxxxxxxx> · Thu, 19 May 2022 16:33:27 +0200

On Thu, 19 May 2022 14:14:16 +0200
Andrew Lunn <andrew@xxxxxxx> wrote:

> > > Or otherwise make it unobtrusive (eg, with similar techniques to those
> > > used by the networking API).  
> > 
> > I just tried this out by re-writing the statistics code using u64_stats_sync
> > and per-cpu statistics, which get totaled on sysfs read access as Andrew Lunn
> > suggested.
> > The results are truly amazing!
> > 
> > The overhead caused by statistics in my test dropped from 43us to just 1-2us.  
> 
> When you are potentially dealing with 10 million packets a second, you
> cannot spend long on each individual packet incrementing a counter...

I guess that is a different category of hardware you are talking about
there... ;-)

> > This was tested on a 64-bit machine though, so I don't know how it will affect
> > 32-bit systems. Nor do I have an easy means of testing this. Any ideas?  
> 
> It does make a difference. On a 64 system, you can increment a counter
> in a single instruction so you either see the old value, or the new
> value. With 32 bit systems, which needs multiple instructions to
> increment the counter, so the code takes are you cannot see anything
> odd when it needs to overflow from the lower 32bits into the upper 32
> bits. So 32bit systems will be a little bit more expensive. However,
> not by a lot.

Ok, good to know.

> > Also, I have converted all the struct spi_statistics members to u64_stats_t.
> > It was easier to test this way. Some of the original types were unsigned long,
> > which can have different sizes on 64bit or 32bit systems... is that
> > intentional?  
> 
> You can keep with uint32, if you want to, and keep with the sequence
> counter style locking. For networking, 32bit counters can wrap around
> pretty fast, so the main counters are 64 bit. But the concept works
> O.K. for smaller types.

Since I am already moving the stats to per-cpu structs, wouldn't atomic_inc()
be sufficient and even faster for uint32 stats on 32bit systems?
I still would like to hear Mark's idea on the exact types. All bytes stats
were ULL and transfer counter values were UL. Possibly the reason behind this
was to make the transfer counters as efficient to increment as possible for
the given platform, as should be the case for "unsigned long".
If we move to per-cpu and seqcount, the kernel APIs (u64_stats_t) make us chose
explicitly u64 or u32 though. On most 64bit platforms, all those stats are
64bit anyway, while on most 32bit platforms the UL stats will likely be 32bit
while the ULL are 64bit. I am inclined to make them all u64, but might decide
otherwise if I can detect a performance difference on 32bit systems.
What types should I chose?

Best regards,

-- 
David Jander