Chris Li <chrisl@xxxxxxxxxx> writes: > On Mon, Aug 19, 2024 at 1:11 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> > BTW, what is your take on my previous analysis of the current SSD >> > prefer write new cluster can wear out the SSD faster? >> >> No. I don't agree with you on that. However, my knowledge on SSD >> wearing out algorithm is quite limited. > > Hi Ying, > > Can you please clarify. You said you have limited knowledge on SSD > wearing internals. Does that mean you have low confidence in your > verdict? Yes. > I would like to understand your reasoning for the disagreement. > Starting from which part of my analysis you are disagreeing with. > > At the same time, we can consult someone who works in the SSD space > and understand the SSD internal wearing better. I think that is a good idea. > I see this is a serious issue for using SSD as swapping for data > center usage cases. In your laptop usage case, you are not using the > LLM training 24/7 right? So it still fits the usage model of the > occasional user of the swap file. It might not be as big a deal. In > the data center workload, e.g. Google's swap write 24/7. The amount of > data swapped out is much higher than typical laptop usage as well. > There the SSD wearing out issue would be much higher because the SSD > is under constant write and much bigger swap usage. > > I am claiming that *some* SSD would have a higher internal write > amplification factor if doing random 4K write all over the drive, than > random 4K write to a small area of the drive. > I do believe having a different swap out policy controlling preferring > old vs new clusters is beneficial to the data center SSD swap usage > case. > It come downs to: > 1) SSD are slow to erase. So most of the SSD performance erases at a > huge erase block size. > 2) SSD remaps the logical block address to the internal erase block. > Most of the new data rewritten, regardless of the logical block > address of the SSD drive, grouped together and written to the erase > block. > 3) When new data is overridden to the old logical data address, SSD > firmware marks those over-written data as obsolete. The discard > command has the similar effect without introducing new data. > 4) When the SSD driver runs out of new erase block, it would need to > GC the old fragmented erased block and pertectial write out of old > data to make room for new erase block. Where the discard command can > be beneficial. It tells the SSD firmware which part of the old data > the GC process can just ignore and skip rewriting. > > GC of the obsolete logical blocks is a general hard problem for the SSD. > > I am not claiming every SSD has this kind of behavior, but it is > common enough to be worth providing an option. > >> > I think it might be useful to provide users an option to choose to >> > write a non full list first. The trade off is more friendly to SSD >> > wear out than preferring to write new blocks. If you keep doing the >> > swap long enough, there will be no new free cluster anyway. >> >> It depends on workloads. Some workloads may demonstrate better spatial >> locality. > > Yes, agree that it may happen or may not happen depending on the > workload . The random distribution swap entry is a common pattern we > need to consider as well. The odds are against us. As in the quoted > email where I did the calculation, the odds of getting the whole > cluster free in the random model is very low, 4.4E10-15 even if we are > only using 1/16 swap entries in the swapfile. Do you have real workloads? For example, some trace? -- Best Regards, Huang, Ying