Hi Jason, On 2017/10/18 22:03, Jason Baron wrote: > > > On 10/17/2017 11:37 AM, Davidlohr Bueso wrote: >> On Fri, 13 Oct 2017, Jason Baron wrote: >> >>> The ep_poll_safewake() function is used to wakeup potentially nested >>> epoll >>> file descriptors. The function uses ep_call_nested() to prevent entering >>> the same wake up queue more than once, and to prevent excessively deep >>> wakeup paths (deeper than EP_MAX_NESTS). However, this is not necessary >>> since we are already preventing these conditions during EPOLL_CTL_ADD. >>> This >>> saves extra function calls, and avoids taking a global lock during the >>> ep_call_nested() calls. >> >> This makes sense. >> >>> >>> I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC >>> case, since ep_call_nested() keeps track of the nesting level, and >>> this is >>> required by the call to spin_lock_irqsave_nested(). It would be nice to >>> remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC case as >>> well, however its not clear how to simply pass the nesting level through >>> multiple wake_up() levels without more surgery. In any case, I don't >>> think >>> CONFIG_DEBUG_LOCK_ALLOC is generally used for production. This patch, >>> also >>> apparently fixes a workload at Google that Salman Qazi reported by >>> completely removing the poll_safewake_ncalls->lock from wakeup paths. >> >> I'm a bit curious about the workload (which uses lots of EPOLL_CTL_ADDs) as >> I was tackling the nested epoll scaling issue with loop and readywalk lists >> in mind. >>> > > I'm not sure the details of the workload - perhaps Salman can elaborate > further about it. > > It would seem that the safewake would potentially be the most contended > in general in the nested case, because generally you have a few epoll > fds attached to lots of sources doing wakeups. So those sources are all > going to conflict on the safewake lock. The readywalk is used when > performing a 'nested' poll and in general this is likely going to be > called on a few epoll fds. That said, we should remove it too. I will > post a patch to remove it. > > The loop lock is used during EPOLL_CTL_ADD to check for loops and deep > wakeup paths and so I would expect this to be less common, but I > wouldn't doubt there are workloads impacted by it. We can potentially, I > think remove this one too - and the global 'epmutex'. I posted some > ideas a while ago on it: > > http://lkml.iu.edu/hypermail//linux/kernel/1501.1/05905.html > > We can work through these ideas or others... I have tested these patches on a simple non-nested epoll workload and found that there are performance regression (-0.5% which is better than -1.0% from my patch set) under normal nginx conf and performance improvement (+3%, and the number of my patch set is +4%) under the one-fd-per-request conf. The reason for performance improvement is clear, however, I haven't figured out the reason for the performance regression (maybe cache related ?) It seems that your patch set will work fine both under normal and nested epoll workload, so I don't plan to continue my patch set which only works for normal epoll workload. I have one question about your patch set. It is about the newly added fields in struct file. I had tried to move these fields into struct eventpoll, so only epoll fds will be added into a disjoint set and the target fd will be linked to a disjoint set dynamically, but that method doesn't work out because there are multiple target fds and the order in which these target fds are added to a disjoint set will lead to deadlock. So do you have other ideas to reduce the size of these fields ? Thanks, Tao The following lines are the result of performance result: baseline: result for v4.14 ep_cc: result for v4.14 + ep_cc patch set my: result for v4.14 + my patch set The first 15 columns are the results from wrk and their unit is Requests/sec, the lats two columns are their average value and standard derivation. * normal scenario: nginx + wrk baseline 376704.74 377223.88 378096.67 379484.45 379199.04 379526.21 378008.82 \ 379959.98 377634.47 379127.27 377622.92 379442.12 378994.44 376000.08 377046.58 \ 378271.4447 1210.673592 ep_cc 373376.08 376897.65 373493.65 376154.44 377374.36 379080.88 375124.44 \ 376404.16 376539.28 375141.84 379075.32 377214.39 374524.52 375605.87 372167.8 \ 375878.312 1983.190539 my 373067.96 372369 375317.61 375564.66 373841.88 373699.88 376802.91 369988.97 \ 376080.45 371580.64 374265.34 376370.34 377033.75 372786.91 375521.61 \ 374286.1273 2059.92093 * one-fd-per-request scenario: nginx + wrk baseline 124509.77 106542.46 118074.46 111434.02 117628.9 108260.56 119037.91 \ 114413.75 108706.82 116277.99 111588.4 118216.56 121267.63 110950.4 116455.43 \ 114891.004 5168.30288 ep_cc 124209.45 118297.8 120258.1 122785.95 118744.64 118471.76 117621.07 \ 117525.14 114797.14 115348.47 117154.02 117208.37 118618.92 118578.8 118945.52 \ 118571.01 2437.050182 my 125090.99 121408.92 116444.09 120055.4 119399.24 114938.04 122889.56 \ 114363.49 120934.15 119052.03 118060.47 129976.45 121108.08 114849.93 114810.64 \ 119558.7653 4338.18401 > Thanks, > > -Jason > > >>> Signed-off-by: Jason Baron <jbaron@xxxxxxxxxx> >>> Cc: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx> >>> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> >>> Cc: Salman Qazi <sqazi@xxxxxxxxxx> >> >> Acked-by: Davidlohr Bueso <dbueso@xxxxxxx> >> >>> --- >>> fs/eventpoll.c | 47 ++++++++++++++++++----------------------------- >>> 1 file changed, 18 insertions(+), 29 deletions(-) >> >> Yay for getting rid of some of the callback hell. >> >> Thanks, >> Davidlohr > > . >