Re: [RFC PATCH v3 0/7] DAMON based tiered memory management for CXL

Honggyu Kim <honggyu.kim@xxxxxx> · Mon, 8 Apr 2024 22:41:04 +0900

Hi Gregory,

On Fri, 5 Apr 2024 12:56:14 -0400 Gregory Price <gregory.price@xxxxxxxxxxxx> wrote:
> On Fri, Apr 05, 2024 at 03:08:49PM +0900, Honggyu Kim wrote:
> > There was an RFC IDEA "DAMOS-based Tiered-Memory Management" previously
> > posted at [1].
> > 
> >   1. YCSB zipfian distribution read only workload
> >   memory pressure with cold memory on node0 with 512GB of local DRAM.
> >   =============+================================================+=========
> >                |       cold memory occupied by mmap and memset  |
> >                |   0G  440G  450G  460G  470G  480G  490G  500G |
> >   =============+================================================+=========
> >   Execution time normalized to DRAM-only values                 | GEOMEAN
> >   -------------+------------------------------------------------+---------
> >   DRAM-only    | 1.00     -     -     -     -     -     -     - | 1.00
> >   CXL-only     | 1.22     -     -     -     -     -     -     - | 1.22
> >   default      |    -  1.12  1.13  1.14  1.16  1.19  1.21  1.21 | 1.17 
> >   DAMON tiered |    -  1.04  1.03  1.04  1.06  1.05  1.05  1.05 | 1.05 
> >   =============+================================================+=========
> >   CXL usage of redis-server in GB                               | AVERAGE
> >   -------------+------------------------------------------------+---------
> >   DRAM-only    |  0.0     -     -     -     -     -     -     - |  0.0
> >   CXL-only     | 52.6     -     -     -     -     -     -     - | 52.6
> >   default      |    -  20.4  27.0  33.1  39.5  45.6  50.5  50.3 | 38.1
> >   DAMON tiered |    -   0.1   0.3   0.8   0.6   0.7   1.3   0.9 |  0.7
> >   =============+================================================+=========
> > 
> > Each test result is based on the exeuction environment as follows.
> > 
> >   DRAM-only   : redis-server uses only local DRAM memory.
> >   CXL-only    : redis-server uses only CXL memory.
> >   default     : default memory policy(MPOL_DEFAULT).
> >                 numa balancing disabled.
> >   DAMON tiered: DAMON enabled with DAMOS_MIGRATE_COLD for DRAM nodes and
> >                 DAMOS_MIGRATE_HOT for CXL nodes.
> > 
> > The above result shows the "default" execution time goes up as the size
> > of cold memory is increased from 440G to 500G because the more cold
> > memory used, the more CXL memory is used for the target redis workload
> > and this makes the execution time increase.
> > 
> > However, "DAMON tiered" result shows less slowdown because the
> > DAMOS_MIGRATE_COLD action at DRAM node proactively demotes pre-allocated
> > cold memory to CXL node and this free space at DRAM increases more
> > chance to allocate hot or warm pages of redis-server to fast DRAM node.
> > Moreover, DAMOS_MIGRATE_HOT action at CXL node also promotes hot pages
> > of redis-server to DRAM node actively.
> > 
> > As a result, it makes more memory of redis-server stay in DRAM node
> > compared to "default" memory policy and this makes the performance
> > improvement.
> > 
> > The following result of latest distribution workload shows similar data.
> > 
> >   2. YCSB latest distribution read only workload
> >   memory pressure with cold memory on node0 with 512GB of local DRAM.
> >   =============+================================================+=========
> >                |       cold memory occupied by mmap and memset  |
> >                |   0G  440G  450G  460G  470G  480G  490G  500G |
> >   =============+================================================+=========
> >   Execution time normalized to DRAM-only values                 | GEOMEAN
> >   -------------+------------------------------------------------+---------
> >   DRAM-only    | 1.00     -     -     -     -     -     -     - | 1.00
> >   CXL-only     | 1.18     -     -     -     -     -     -     - | 1.18
> >   default      |    -  1.18  1.19  1.18  1.18  1.17  1.19  1.18 | 1.18 
> >   DAMON tiered |    -  1.04  1.04  1.04  1.05  1.04  1.05  1.05 | 1.04 
> >   =============+================================================+=========
> >   CXL usage of redis-server in GB                               | AVERAGE
> >   -------------+------------------------------------------------+---------
> >   DRAM-only    |  0.0     -     -     -     -     -     -     - |  0.0
> >   CXL-only     | 52.6     -     -     -     -     -     -     - | 52.6
> >   default      |    -  20.5  27.1  33.2  39.5  45.5  50.4  50.5 | 38.1
> >   DAMON tiered |    -   0.2   0.4   0.7   1.6   1.2   1.1   3.4 |  1.2
> >   =============+================================================+=========
> > 
> > In summary of both results, our evaluation shows that "DAMON tiered"
> > memory management reduces the performance slowdown compared to the
> > "default" memory policy from 17~18% to 4~5% when the system runs with
> > high memory pressure on its fast tier DRAM nodes.
> > 
> > Having these DAMOS_MIGRATE_HOT and DAMOS_MIGRATE_COLD actions can make
> > tiered memory systems run more efficiently under high memory pressures.
> > 
> 
> Hi,
> 
> It's hard to determine from your results whether the performance
> mitigation is being caused primarily by MIGRATE_COLD freeing up space
> for new allocations, or from some combination of HOT/COLD actions
> occurring during execution but after the database has already been
> warmed up.

Thanks for the question.  I didn't include all the details for the
evaluation result, but this is a chance to share more in details.

I would say the mitigation comes from both.  DAMOS_MIGRATE_COLD demotes
some cold data to CXL so redis can allocate more data on the fast DRAM
during launching time as the mmap+memset and redis launching takes
several minutes.  But it also promotes some redis data while running.

> Do you have test results which enable only DAMOS_MIGRATE_COLD actions
> but not DAMOS_MIGRATE_HOT actions? (and vice versa)
> 
> The question I have is exactly how often is MIGRATE_HOT actually being
> utilized, and how much data is being moved. Testing MIGRATE_COLD only
> would at least give a rough approximation of that.

To explain this, I better share more test results.  In the section of
"Evaluation Workload", the test sequence can be summarized as follows.

  *. "Turn on DAMON."
  1. Allocate cold memory(mmap+memset) at DRAM node, then make the
     process sleep.
  2. Launch redis-server and load prebaked snapshot image, dump.rdb.
     (85GB consumed: 52GB for anon and 33GB for file cache)
  3. Run YCSB to make zipfian distribution of memory accesses to
     redis-server, then measure execution time.
  4. Repeat 4 over 50 times to measure the average execution time for
     each run.
  5. Increase the cold memory size then repeat goes to 2.

I didn't want to make the evaluation too long in the cover letter, but
I have also evaluated another senario, which lazyly enabled DAMON just
before YCSB run at step 4.  I will call this test as "DAMON lazy".  This
is missing part from the cover letter.

  1. Allocate cold memory(mmap+memset) at DRAM node, then make the
     process sleep.
  2. Launch redis-server and load prebaked snapshot image, dump.rdb.
     (85GB consumed: 52GB for anon and 33GB for file cache)
  *. "Turn on DAMON."
  4. Run YCSB to make zipfian distribution of memory accesses to
     redis-server, then measure execution time.
  5. Repeat 4 over 50 times to measure the average execution time for
     each run.
  6. Increase the cold memory size then repeat goes to 2.

In the "DAMON lazy" senario, DAMON started monitoring late so the
initial redis-server placement is same as "default", but started to
demote cold data and promote redis data just before YCSB run.

The full test result is as follows.

  1. YCSB zipfian distribution read only workload
  memory pressure with cold memory on node0 with 512GB of local DRAM.
  =============+================================================+=========
               |       cold memory occupied by mmap and memset  |
               |   0G  440G  450G  460G  470G  480G  490G  500G |
  =============+================================================+=========
  Execution time normalized to DRAM-only values                 | GEOMEAN
  -------------+------------------------------------------------+---------
  DRAM-only    | 1.00     -     -     -     -     -     -     - | 1.00
  CXL-only     | 1.22     -     -     -     -     -     -     - | 1.22
  default      |    -  1.12  1.13  1.14  1.16  1.19  1.21  1.21 | 1.17
  DAMON tiered |    -  1.04  1.03  1.04  1.06  1.05  1.05  1.05 | 1.05
  DAMON lazy   |    -  1.04  1.05  1.05  1.06  1.06  1.07  1.07 | 1.06
  =============+================================================+=========
  CXL usage of redis-server in GB                               | AVERAGE
  -------------+------------------------------------------------+---------
  DRAM-only    |  0.0     -     -     -     -     -     -     - |  0.0
  CXL-only     | 52.6     -     -     -     -     -     -     - | 52.6
  default      |    -  20.4  27.0  33.1  39.5  45.6  50.5  50.3 | 38.1
  DAMON tiered |    -   0.1   0.3   0.8   0.6   0.7   1.3   0.9 |  0.7
  DAMON lazy   |    -   2.9   3.1   3.7   4.7   6.6   8.2   9.7 |  5.6
  =============+================================================+=========
  Migration size in GB by DAMOS_MIGRATE_COLD(demotion) and      |
  DAMOS_MIGRATE_HOT(promotion)                                  | AVERAGE
  -------------+------------------------------------------------+---------
  DAMON tiered |                                                |
  - demotion   |    -   522   510   523   520   513   558   558 |  529
  - promotion  |    -   0.1   1.3   6.2   8.1   7.2    22    17 |  8.8
  DAMON lazy   |                                                |
  - demotion   |    -   288   277   322   343   315   312   320 |  311
  - promotion  |    -    33    44    41    55    73    89   101 |  5.6
  =============+================================================+=========

I have included "DAMON lazy" result and also the migration size by new
DAMOS migrate actions.  Please note that demotion size is way higher
than promotion because promotion target is only for redis data, but
demotion target includes huge cold memory allocated by mmap + memset.
(there could be some ping-pong issue though.)

As you mentioned, "DAMON tiered" case gets more benefit because new
redis allocations go to DRAM more than "default", but it also gets
benefit from promotion when it is under higher memory pressure as shown
in 490G and 500G cases.  It promotes 22GB and 17GB of redis data to DRAM
from CXL.

In the case of "DAMON lazy", it shows more promotion size as expected
and it gets increases as memory pressure goes higher from left to right.

I will share "latest" workload result as well and it shows similar
tendency.

  2. YCSB latest distribution read only workload
  memory pressure with cold memory on node0 with 512GB of local DRAM.
  =============+================================================+=========
               |       cold memory occupied by mmap and memset  |
               |   0G  440G  450G  460G  470G  480G  490G  500G |
  =============+================================================+=========
  Execution time normalized to DRAM-only values                 | GEOMEAN
  -------------+------------------------------------------------+---------
  DRAM-only    | 1.00     -     -     -     -     -     -     - | 1.00
  CXL-only     | 1.18     -     -     -     -     -     -     - | 1.18
  default      |    -  1.18  1.19  1.18  1.18  1.17  1.19  1.18 | 1.18 
  DAMON tiered |    -  1.04  1.04  1.04  1.05  1.04  1.05  1.05 | 1.04 
  DAMON lazy   |    -  1.05  1.05  1.06  1.06  1.07  1.06  1.07 | 1.06
  =============+================================================+=========
  CXL usage of redis-server in GB                               | AVERAGE
  -------------+------------------------------------------------+---------
  DRAM-only    |  0.0     -     -     -     -     -     -     - |  0.0
  CXL-only     | 52.6     -     -     -     -     -     -     - | 52.6
  default      |    -  20.5  27.1  33.2  39.5  45.5  50.4  50.5 | 38.1
  DAMON tiered |    -   0.2   0.4   0.7   1.6   1.2   1.1   3.4 |  1.2
  DAMON lazy   |    -   5.3   4.1   3.9   6.4   8.8  10.1  11.3 |  7.1
  =============+================================================+=========
  Migration size in GB by DAMOS_MIGRATE_COLD(demotion) and      |
  DAMOS_MIGRATE_HOT(promotion)                                  | AVERAGE
  -------------+------------------------------------------------+---------
  DAMON tiered |                                                |
  - demotion   |    -   493   478   487   516   510   540   512 |  505
  - promotion  |    -   0.1   0.2   8.2   5.6   4.0   5.9    29 |  7.5
  DAMON lazy   |                                                |
  - demotion   |    -   315   318   293   290   308   322   286 |  305
  - promotion  |    -    36    45    38    56    74    91    99 |   63
  =============+================================================+=========

> Additionally, do you have any data on workloads that exceed the capacity
> of the DRAM tier?  Here you say you have 512GB of local DRAM, but only
> test a workload that caps out at 500G.  Have you run a test of, say,
> 550GB to see the effect of DAMON HOT/COLD migration actions when DRAM
> capacity is exceeded?

I didn't want to remove DRAM from my server so kept using 512GB of DRAM,
but I couldn't make a single workload that consumes more than the DRAM
size.

I wanted to use more realistic workload rather than micro benchmarks.
And the core concept of this test is to cover realisitic senarios with
the system wide view.  I think if the system has 512GB of local DRAM,
then it wouldn't be possible to make the entire 512GB of DRAM hot and
it'd have some amount of cold memory, which can be the target of
demotion.  Then we can find some workload that is actively used and
promote it as much as possible.  That's why I made the promotion policy
aggressively.

> Can you also provide the DRAM-only results for each test?  Presumably,
> as workload size increases from 440G to 500G, the system probably starts
> using some amount of swap/zswap/whatever.  It would be good to know how
> this system compares to swap small amounts of overflow.

It looks like my explanation doesn't correctly inform you.   The size
from 440GB to 500GB is for pre allocated cold data to give memory
pressure on the system so that redis-server cannot be fully allocated at
fast DRAM, then partially allocated at CXL memory as well.

And my evaluation environment doesn't have swap space to focus on
migration rather than swap.

> 
> ~Gregory

I hope my explanation is helpful for you to understand.  Please let me
know if you have more questions.

Thanks,
Honggyu