On Mon, Aug 19, 2024 at 1:11 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > > BTW, what is your take on my previous analysis of the current SSD > > prefer write new cluster can wear out the SSD faster? > > No. I don't agree with you on that. However, my knowledge on SSD > wearing out algorithm is quite limited. Hi Ying, Can you please clarify. You said you have limited knowledge on SSD wearing internals. Does that mean you have low confidence in your verdict? I would like to understand your reasoning for the disagreement. Starting from which part of my analysis you are disagreeing with. At the same time, we can consult someone who works in the SSD space and understand the SSD internal wearing better. I see this is a serious issue for using SSD as swapping for data center usage cases. In your laptop usage case, you are not using the LLM training 24/7 right? So it still fits the usage model of the occasional user of the swap file. It might not be as big a deal. In the data center workload, e.g. Google's swap write 24/7. The amount of data swapped out is much higher than typical laptop usage as well. There the SSD wearing out issue would be much higher because the SSD is under constant write and much bigger swap usage. I am claiming that *some* SSD would have a higher internal write amplification factor if doing random 4K write all over the drive, than random 4K write to a small area of the drive. I do believe having a different swap out policy controlling preferring old vs new clusters is beneficial to the data center SSD swap usage case. It come downs to: 1) SSD are slow to erase. So most of the SSD performance erases at a huge erase block size. 2) SSD remaps the logical block address to the internal erase block. Most of the new data rewritten, regardless of the logical block address of the SSD drive, grouped together and written to the erase block. 3) When new data is overridden to the old logical data address, SSD firmware marks those over-written data as obsolete. The discard command has the similar effect without introducing new data. 4) When the SSD driver runs out of new erase block, it would need to GC the old fragmented erased block and pertectial write out of old data to make room for new erase block. Where the discard command can be beneficial. It tells the SSD firmware which part of the old data the GC process can just ignore and skip rewriting. GC of the obsolete logical blocks is a general hard problem for the SSD. I am not claiming every SSD has this kind of behavior, but it is common enough to be worth providing an option. > > I think it might be useful to provide users an option to choose to > > write a non full list first. The trade off is more friendly to SSD > > wear out than preferring to write new blocks. If you keep doing the > > swap long enough, there will be no new free cluster anyway. > > It depends on workloads. Some workloads may demonstrate better spatial > locality. Yes, agree that it may happen or may not happen depending on the workload . The random distribution swap entry is a common pattern we need to consider as well. The odds are against us. As in the quoted email where I did the calculation, the odds of getting the whole cluster free in the random model is very low, 4.4E10-15 even if we are only using 1/16 swap entries in the swapfile. Chris > > > The example I give in this email: > > > > https://lore.kernel.org/linux-mm/CACePvbXGBNC9WzzL4s2uB2UciOkV6nb4bKKkc5TBZP6QuHS_aQ@xxxxxxxxxxxxxx/ > > > > Chris > >> > >> > /* > >> > @@ -967,6 +995,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) > >> > ci = lock_cluster(si, offset); > >> > memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); > >> > ci->count = 0; > >> > + ci->order = 0; > >> > ci->flags = 0; > >> > free_cluster(si, ci); > >> > unlock_cluster(ci); > >> > @@ -2922,6 +2951,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, > >> > INIT_LIST_HEAD(&p->free_clusters); > >> > INIT_LIST_HEAD(&p->discard_clusters); > >> > > >> > + for (i = 0; i < SWAP_NR_ORDERS; i++) > >> > + INIT_LIST_HEAD(&p->nonfull_clusters[i]); > >> > + > >> > for (i = 0; i < swap_header->info.nr_badpages; i++) { > >> > unsigned int page_nr = swap_header->info.badpages[i]; > >> > if (page_nr == 0 || page_nr > swap_header->info.last_page) > > -- > Best Regards, > Huang, Ying