On Tue, May 07, 2019 at 05:31:54PM +0200, Oleg Nesterov wrote: > I am not going to comment the intent, but to be honest I am skeptical too. The general sentiment has been that this is a really bad idea, but I'm just a frustrated Android user who wants his phone to not require mountains of zRAM only to still manage memory poorly. Until I can go out and buy a non-Pixel phone that uses PSI to make these decisions (and does a good job of it), I'm going to stick to my hacky little driver for my personal devices. Many others who like to hack their Android devices to make them last longer will probably find value in this as well, since there are millions of people who use devices that'll never seen any of PSI ported to their ancient 3.x kernels. And yes, I know this would never be accepted to upstream in a million years. I mostly wanted some code review and guidance, since mm code is pretty tricky :) > On 05/06, Sultan Alsawaf wrote: > > > > +static unsigned long find_victims(struct victim_info *varr, int *vindex, > > + int vmaxlen, int min_adj, int max_adj) > > +{ > > + unsigned long pages_found = 0; > > + int old_vindex = *vindex; > > + struct task_struct *tsk; > > + > > + for_each_process(tsk) { > > + struct task_struct *vtsk; > > + unsigned long tasksize; > > + short oom_score_adj; > > + > > + /* Make sure there's space left in the victim array */ > > + if (*vindex == vmaxlen) > > + break; > > + > > + /* Don't kill current, kthreads, init, or duplicates */ > > + if (same_thread_group(tsk, current) || > > + tsk->flags & PF_KTHREAD || > > + is_global_init(tsk) || > > + vtsk_is_duplicate(varr, *vindex, tsk)) > > + continue; > > + > > + vtsk = find_lock_task_mm(tsk); > > Did you test this patch with lockdep enabled? > > If I read the patch correctly, lockdep should complain. vtsk_is_duplicate() > ensures that we do not take the same ->alloc_lock twice or more, but lockdep > can't know this. Yeah, lockdep is fine with this, at least on 4.4. > > +static void scan_and_kill(unsigned long pages_needed) > > +{ > > + static DECLARE_WAIT_QUEUE_HEAD(victim_waitq); > > + struct victim_info victims[MAX_VICTIMS]; > > + int i, nr_to_kill = 0, nr_victims = 0; > > + unsigned long pages_found = 0; > > + atomic_t victim_count; > > + > > + /* > > + * Hold the tasklist lock so tasks don't disappear while scanning. This > > + * is preferred to holding an RCU read lock so that the list of tasks > > + * is guaranteed to be up to date. Keep preemption disabled until the > > + * SIGKILLs are sent so the victim kill process isn't interrupted. > > + */ > > + read_lock(&tasklist_lock); > > + preempt_disable(); > > read_lock() disables preemption, every task_lock() too, so this looks > unnecessary. Good point. > > + for (i = 1; i < ARRAY_SIZE(adj_prio); i++) { > > + pages_found += find_victims(victims, &nr_victims, MAX_VICTIMS, > > + adj_prio[i], adj_prio[i - 1]); > > + if (pages_found >= pages_needed || nr_victims == MAX_VICTIMS) > > + break; > > + } > > + > > + /* > > + * Calculate the number of tasks that need to be killed and quickly > > + * release the references to those that'll live. > > + */ > > + for (i = 0, pages_found = 0; i < nr_victims; i++) { > > + struct victim_info *victim = &victims[i]; > > + struct task_struct *vtsk = victim->tsk; > > + > > + /* The victims' mm lock is taken in find_victims; release it */ > > + if (pages_found >= pages_needed) { > > + task_unlock(vtsk); > > + continue; > > + } > > + > > + /* > > + * Grab a reference to the victim so it doesn't disappear after > > + * the tasklist lock is released. > > + */ > > + get_task_struct(vtsk); > > The comment doesn't look correct. the victim can't dissapear until task_unlock() > below, it can't pass exit_mm(). I was always unsure about this and decided to hold a reference to the task_struct to be safe. Thanks for clearing that up. > > + pages_found += victim->size; > > + nr_to_kill++; > > + } > > + read_unlock(&tasklist_lock); > > + > > + /* Kill the victims */ > > + victim_count = (atomic_t)ATOMIC_INIT(nr_to_kill); > > + for (i = 0; i < nr_to_kill; i++) { > > + struct victim_info *victim = &victims[i]; > > + struct task_struct *vtsk = victim->tsk; > > + > > + pr_info("Killing %s with adj %d to free %lu kiB\n", vtsk->comm, > > + vtsk->signal->oom_score_adj, > > + victim->size << (PAGE_SHIFT - 10)); > > + > > + /* Configure the victim's mm to notify us when it's freed */ > > + vtsk->mm->slmk_waitq = &victim_waitq; > > + vtsk->mm->slmk_counter = &victim_count; > > + > > + /* Accelerate the victim's death by forcing the kill signal */ > > + do_send_sig_info(SIGKILL, SIG_INFO_TYPE, vtsk, true); > ^^^^ > this should be PIDTYPE_TGID Thanks, I didn't realize the last argument to do_send_sig_info changed in newer kernels. The compiler didn't complain, so it went over my head. > > + > > + /* Finally release the victim's mm lock */ > > + task_unlock(vtsk); > > + } > > + preempt_enable_no_resched(); > > See above. And I don't understand how can _no_resched() really help... Yeah, good point. > > + > > + /* Try to speed up the death process now that we can schedule again */ > > + for (i = 0; i < nr_to_kill; i++) { > > + struct task_struct *vtsk = victims[i].tsk; > > + > > + /* Increase the victim's priority to make it die faster */ > > + set_user_nice(vtsk, MIN_NICE); > > + > > + /* Allow the victim to run on any CPU */ > > + set_cpus_allowed_ptr(vtsk, cpu_all_mask); > > + > > + /* Finally release the victim reference acquired earlier */ > > + put_task_struct(vtsk); > > + } > > + > > + /* Wait until all the victims die */ > > + wait_event(victim_waitq, !atomic_read(&victim_count)); > > Can't we avoid the new slmk_waitq/slmk_counter members in mm_struct? > > I mean, can't we export victim_waitq and victim_count and, say, set/test > MMF_OOM_VICTIM. In fact I think you should try to re-use mark_oom_victim() > at least. This makes the patch less portable across different kernel versions, which is kind of one of its major goals. Thanks for the code review, Oleg. Sultan