Re: [RFC] libdrm_intel: Rework BO allocs to avoid rounding up to bucket size

"Siluvery, Arun" <arun.siluvery@xxxxxxxxxxxxxxx> · Fri, 29 Aug 2014 11:45:13 +0100

On 29/08/2014 11:16, Chris Wilson wrote:
On Fri, Aug 29, 2014 at 11:02:01AM +0100, Arun Siluvery wrote:
From: Garry Lancaster <garry.lancaster@xxxxxxxxx>

libdrm includes a scheme where freed buffer objects (BOs)
are held in a cache. This allows incoming allocation requests to be
serviced by re-using an old BO, instead of requiring a new
object to be allocated. This is a performance enhancement.
The cache is divided into "buckets". Each bucket holds unused
BOs of a pre-determined size. When a BO allocation request is seen,
the bucket for BOs of this size or larger is selected. Any BO
currently in the bucket will be re-used for the allocation. If the
bucket was empty, a new BO is created. However, the BO is created
with the size determined by the selected bucket (i.e. the size is
rounded up to the bucket size), rather than being created with the
originally requested size. This is so that when the BO is freed,
it can be released into the bucket and re-used by any other allocation
which selects the same bucket.

Depending upon the size of the allocation, this rounding up can
result in a significant wastage of memory when allocating a BO. For
example, a BO request just over 132K allocated during GLES context
creation was rounded up to the next bucket size of 160K. Such wastage
can be critical on devices with low memory.

This commit reworks the BO allocation code. On a BO allocation request,
if the selected bucket contains any BOs, each of them is checked to
see if any is large enough to fulfill the allocation request. If not,
a new BO is created, but (due to the new check) it is no longer
necessary to round up its size to match the size determined by the
selected bucket.

So, previously, buckets contained BOs that were all the same size. But now
the BOs in a bucket can be different sizes: in the range from the size of the
next smaller, nominal, bucket size to the current, nominal, bucket size.

On a 1GB system, the following reductions in BO memory usage were seen:

BaseMark X 1.0:                324.4MB -> 306.0MB (-18.4MB;  5.7% saving)
BaseMark X 1.1 Medium Quality: 206.9MB -> 201.2MB (- 5.7MB;  2.8% saving)
GFXBench 3.0 TRex:             216.6MB -> 200.0MB (-16.6MB;  8.3% saving)
GFXBench 3.0 Manhattan:        281.4MB -> 246.8MB (-34.6MB; 12.3% saving)

No performance change was seen on BaseMarkX. GFXBench 3.0 showed small
performance increases (~0.5fps on Manhattan, ~1-2fps on TRex) which may be
due to reduced activity of the OOM killer.

The principle for rounding up was to increase the cache hit rate and
thereby reduce allocations. Might be interesting to know whether the
number of bo allocated also changes. If not, the argument is that the
working set is pretty stable and has a natural set of sizes which it
reuses. A counter example might then be uxa, glamor, compositors which
off-the-top-of-my-head would have more variable object sizes.

Reducing the impact of thrashing should itself be measurable, and a
useful statistic to track.

As a corollary to exact allocations, you can then reduce the number of
buckets again (the number was increased to allow finer-grained
allocations). Again, it is hard to judge whether handing back larger
objects will lead to memory wastage. So yet another statistic to track
is requested versus allocated memory sizes.

Reducing number of buckets would lead to more wastage of memory right?

The current bucket sizes are,
Bucket[0]: 4K
Bucket[1]: 8K
Bucket[2]: 12K
Bucket[3]: 16K
Bucket[4]: 20K
Bucket[5]: 24K
Bucket[6]: 28K
Bucket[7]: 32K
Bucket[8]: 40K
Bucket[9]: 48K
Bucket[10]: 56K
Bucket[11]: 64K
Bucket[12]: 80K
Bucket[13]: 96K
Bucket[14]: 112K
Bucket[15]: 128K
Bucket[16]: 160K
Bucket[17]: 192K
Bucket[18]: 224K
Bucket[19]: 256K
...
...

If there are more objects with size 132K we would end up allocating 
160K. We can track requested vs allocated but that depends on the 
application and usage, what would be the best measure to track this? I 
mean we measure over a given time or any other criteria?

Also it is important to state what type of system you are measuring the
impact of allocations for -- the behaviour of a cache miss is
dramatically different between LLC and non-LLC systems.

The current data is from a non-LLC system.

regards
Arun

-Chris

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/intel-gfx