Re: [PATCH v5 2/9] mm: swap: mTHP allocate swap entries from nonfull list

Chris Li <chrisl@xxxxxxxxxx> · Mon, 26 Aug 2024 14:26:19 -0700

On Mon, Aug 19, 2024 at 1:11 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> > BTW, what is your take on my  previous analysis of the current SSD
> > prefer write new cluster can wear out the SSD faster?
>
> No.  I don't agree with you on that.  However, my knowledge on SSD
> wearing out algorithm is quite limited.

Hi Ying,

Can you please clarify. You said you have limited knowledge on SSD
wearing internals. Does that mean you have low confidence in your
verdict?

I would like to understand your reasoning for the disagreement.
Starting from which part of my analysis you are disagreeing with.

At the same time, we can consult someone who works in the SSD space
and understand the SSD internal wearing better.

I see this is a serious issue for using SSD as swapping for data
center usage cases. In your laptop usage case, you are not using the
LLM training 24/7 right? So it still fits the usage model of the
occasional user of the swap file. It might not be as big a deal. In
the data center workload, e.g. Google's swap write 24/7. The amount of
data swapped out is much higher than typical laptop usage as well.
There the SSD wearing out issue would be much higher because the SSD
is under constant write and much bigger swap usage.

I am claiming that *some* SSD would have a higher internal write
amplification factor if doing random 4K write all over the drive, than
random 4K write to a small area of the drive.
I do believe having a different swap out policy controlling preferring
old vs new clusters is beneficial to the data center SSD swap usage
case.
It come downs to:
1) SSD are slow to erase. So most of the SSD performance erases at a
huge erase block size.
2) SSD remaps the logical block address to the internal erase block.
Most of the new data rewritten, regardless of the logical block
address of the SSD drive, grouped together and written to the erase
block.
3) When new data is overridden to the old logical data address, SSD
firmware marks those over-written data as obsolete. The discard
command has the similar effect without introducing new data.
4) When the SSD driver runs out of new erase block, it would need to
GC the old fragmented erased block and pertectial write out of old
data to make room for new erase block. Where the discard command can
be beneficial. It tells the SSD firmware which part of the old data
the GC process can just ignore and skip rewriting.

GC of the obsolete logical blocks is a general hard problem for the SSD.

I am not claiming every SSD has this kind of behavior, but it is
common enough to be worth providing an option.

> > I think it might be useful to provide users an option to choose to
> > write a non full list first. The trade off is more friendly to SSD
> > wear out than preferring to write new blocks. If you keep doing the
> > swap long enough, there will be no new free cluster anyway.
>
> It depends on workloads.  Some workloads may demonstrate better spatial
> locality.

Yes, agree that it may happen or may not happen depending on the
workload . The random distribution swap entry is a common pattern we
need to consider as well. The odds are against us. As in the quoted
email where I did the calculation, the odds of getting the whole
cluster free in the random model is very low, 4.4E10-15 even if we are
only using 1/16 swap entries in the swapfile.

Chris

>
> > The example I give in this email:
> >
> > https://lore.kernel.org/linux-mm/CACePvbXGBNC9WzzL4s2uB2UciOkV6nb4bKKkc5TBZP6QuHS_aQ@xxxxxxxxxxxxxx/
> >
> > Chris
> >>
> >> >                       /*
> >> > @@ -967,6 +995,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> >> >       ci = lock_cluster(si, offset);
> >> >       memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> >> >       ci->count = 0;
> >> > +     ci->order = 0;
> >> >       ci->flags = 0;
> >> >       free_cluster(si, ci);
> >> >       unlock_cluster(ci);
> >> > @@ -2922,6 +2951,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
> >> >       INIT_LIST_HEAD(&p->free_clusters);
> >> >       INIT_LIST_HEAD(&p->discard_clusters);
> >> >
> >> > +     for (i = 0; i < SWAP_NR_ORDERS; i++)
> >> > +             INIT_LIST_HEAD(&p->nonfull_clusters[i]);
> >> > +
> >> >       for (i = 0; i < swap_header->info.nr_badpages; i++) {
> >> >               unsigned int page_nr = swap_header->info.badpages[i];
> >> >               if (page_nr == 0 || page_nr > swap_header->info.last_page)
>
> --
> Best Regards,
> Huang, Ying