Am 10.06.22 um 01:20 schrieb Lucas De Marchi:
Instead of always falling back to memcpy_fromio() for any size, prefer
using read{b,w,l}(). When reading struct members it's common to read
individual integer variables individually. Going through memcpy_fromio()
for each of them poses a high penalty.
Employ a similar trick as __seqprop() by using _Generic() to generate
only the specific call based on a type-compatible variable.
For a pariticular i915 workload producing GPU context switches,
__get_engine_usage_record() is particularly hot since the engine usage
is read from device local memory with dgfx, possibly multiple times
since it's racy. Test execution time for this test shows a ~12.5%
improvement with DG2:
Before:
nrepeats = 1000; min = 7.63243e+06; max = 1.01817e+07;
median = 9.52548e+06; var = 526149;
After:
nrepeats = 1000; min = 7.03402e+06; max = 8.8832e+06;
median = 8.33955e+06; var = 333113;
Other things attempted that didn't prove very useful:
1) Change the _Generic() on x86 to just dereference the memory address
2) Change __get_engine_usage_record() to do just 1 read per loop,
comparing with the previous value read
3) Change __get_engine_usage_record() to access the fields directly as it
was before the conversion to iosys-map
(3) did gave a small improvement (~3%), but doesn't seem to scale well
to other similar cases in the driver.
Additional test by Chris Wilson using gem_create from igt with some
changes to track object creation time. This happens to accidentaly
stress this code path:
Pre iosys_map conversion of engine busyness:
lmem0: Creating 262144 4KiB objects took 59274.2ms
Unpatched:
lmem0: Creating 262144 4KiB objects took 108830.2ms
With readl (this patch):
lmem0: Creating 262144 4KiB objects took 61348.6ms
s/readl/READ_ONCE/
lmem0: Creating 262144 4KiB objects took 61333.2ms
So we do take a little bit more time than before the conversion, but
that is due to other factors: bringing the READ_ONCE back would be as
good as just doing this conversion.
Signed-off-by: Lucas De Marchi <lucas.demarchi@xxxxxxxxx>
---
If this is acceptable we should probably add the write counterpart, too.
Sending here only the read for now since this fixes the issue we are
seeing and to gather feedback.
As far as I can see looks sane to me, but the kernel test robot tears
the patch apart.
Probably just a typo somewhere in the 32bit handling.
Apart from that looks good to me.
Regards,
Christian.
include/linux/iosys-map.h | 26 ++++++++++++++++++++++----
1 file changed, 22 insertions(+), 4 deletions(-)
diff --git a/include/linux/iosys-map.h b/include/linux/iosys-map.h
index e69a002d5aa4..4ae3e459419e 100644
--- a/include/linux/iosys-map.h
+++ b/include/linux/iosys-map.h
@@ -333,6 +333,20 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
memset(dst->vaddr + offset, value, len);
}
+#ifdef CONFIG_64BIT
+#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_) \
+ u64: val_ = readq(vaddr_iomem_),
+#else
+#define __iosys_map_u64_case(val_, vaddr_iomem_)
+#endif
+
+#define __iosys_map_rd_io(val__, vaddr_iomem__, type__) _Generic(val__, \
+ u8: val__ = readb(vaddr_iomem__), \
+ u16: val__ = readw(vaddr_iomem__), \
+ u32: val__ = readl(vaddr_iomem__), \
+ __iosys_map_rd_io_u64_case(val__, vaddr_iomem__) \
+ default: memcpy_fromio(&(val__), vaddr_iomem__, sizeof(val__)))
+
/**
* iosys_map_rd - Read a C-type value from the iosys_map
*
@@ -346,10 +360,14 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
* Returns:
* The value read from the mapping.
*/
-#define iosys_map_rd(map__, offset__, type__) ({ \
- type__ val; \
- iosys_map_memcpy_from(&val, map__, offset__, sizeof(val)); \
- val; \
+#define iosys_map_rd(map__, offset__, type__) ({ \
+ type__ val; \
+ if ((map__)->is_iomem) { \
+ __iosys_map_rd_io(val, (map__)->vaddr_iomem + offset__, type__);\
+ } else { \
+ memcpy(&val, (map__)->vaddr + offset__, sizeof(val)); \
+ } \
+ val; \
})
/**