On Thu, 5 May 2016 11:07:50 +0200 Jan Kara <jack@xxxxxxx> wrote: > On Thu 05-05-16 10:24:33, Michal Hocko wrote: > > > +/* > > > + * Check whether the request to writeback some pages can be merged with some > > > + * other request which is already pending. If yes, merge it and return true. > > > + * If no, return false. > > > + */ > > > +static bool wb_merge_request(struct bdi_writeback *wb, long nr_pages, > > > + struct super_block *sb, bool range_cyclic, > > > + enum wb_reason reason) > > > +{ > > > + struct wb_writeback_work *work; > > > + bool merged = false; > > > + > > > + spin_lock_bh(&wb->work_lock); > > > + list_for_each_entry(work, &wb->work_list, list) { > > > > Is the lenght of the list bounded somehow? In other words is it possible > > that the spinlock would be held for too long to traverse the whole list? > > I was thinking about this as well. With the merging enabled, the number of > entries queued from wb_start_writeback() is essentially limited by the > number of writeback reasons and there's only a couple of those. What is > more questionable is the number of entries queued from > __writeback_inodes_sb_nr(). Generally there should be a couple at maximum > either but it is hard to give any guarantee since e.g. filesystems use this > function to reduce amount of delay-allocated data when they are running out > of space. Hum, maybe we could limit the merging to scan only the last say > 16 entries. That should give good results in most cases... Thoughts? If it's possible to cause a search complexity meltdown, someone will find a way :( Is there any reason why the requests coming out of writeback_inodes_sb_nr() cannot also be merged? Your wb_merge_request() doesn't check ->tagged_writepages? Why is ->for_sync handled differently? Needs a comment. Suggest turning this into a separate function. Build a local wb_writeback_work in wb_start_writeback(), do: /* comment goes here */ if (new->reason != old->reason) return false; /* comment goes here */ if (new->range_cyclic != old->range_cyclic) retun false; return true; then copy wb_start_writeback()'s local wb_writeback_work into the newly-allocated one if needed (kmemdup()). Or just pass a billion args into that comparison function. bdi_split_work_to_wbs() does GFP_ATOMIC as well. Problem? (Why the heck don't we document the *reasons* for these things, sigh). I suspect it would be best to be proactive here and use some smarter data structure. It appears that all the wb_writeback_work fields except sb can be squeezed into a single word so perhaps a radix-tree. Or hash them all together and use a chained array or something. Maybe fiddle at it for an hour or so, see how it's looking? It's a lot of fuss to avoid one problematic kmalloc(), sigh. We really don't want there to be *any* pathological workload which results in merging failures - if that's the case then someone will hit it. They'll experience the ooms (perhaps) and the search complexity issues (for sure). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>