On 01/18/2018 06:00 AM, Hou Tao wrote: > Hi Jason, > > On 2017/10/18 22:03, Jason Baron wrote: >> >> >> On 10/17/2017 11:37 AM, Davidlohr Bueso wrote: >>> On Fri, 13 Oct 2017, Jason Baron wrote: >>> >>>> The ep_poll_safewake() function is used to wakeup potentially nested >>>> epoll >>>> file descriptors. The function uses ep_call_nested() to prevent entering >>>> the same wake up queue more than once, and to prevent excessively deep >>>> wakeup paths (deeper than EP_MAX_NESTS). However, this is not necessary >>>> since we are already preventing these conditions during EPOLL_CTL_ADD. >>>> This >>>> saves extra function calls, and avoids taking a global lock during the >>>> ep_call_nested() calls. >>> >>> This makes sense. >>> >>>> >>>> I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC >>>> case, since ep_call_nested() keeps track of the nesting level, and >>>> this is >>>> required by the call to spin_lock_irqsave_nested(). It would be nice to >>>> remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC case as >>>> well, however its not clear how to simply pass the nesting level through >>>> multiple wake_up() levels without more surgery. In any case, I don't >>>> think >>>> CONFIG_DEBUG_LOCK_ALLOC is generally used for production. This patch, >>>> also >>>> apparently fixes a workload at Google that Salman Qazi reported by >>>> completely removing the poll_safewake_ncalls->lock from wakeup paths. >>> >>> I'm a bit curious about the workload (which uses lots of EPOLL_CTL_ADDs) as >>> I was tackling the nested epoll scaling issue with loop and readywalk lists >>> in mind. >>>> >> >> I'm not sure the details of the workload - perhaps Salman can elaborate >> further about it. >> >> It would seem that the safewake would potentially be the most contended >> in general in the nested case, because generally you have a few epoll >> fds attached to lots of sources doing wakeups. So those sources are all >> going to conflict on the safewake lock. The readywalk is used when >> performing a 'nested' poll and in general this is likely going to be >> called on a few epoll fds. That said, we should remove it too. I will >> post a patch to remove it. >> >> The loop lock is used during EPOLL_CTL_ADD to check for loops and deep >> wakeup paths and so I would expect this to be less common, but I >> wouldn't doubt there are workloads impacted by it. We can potentially, I >> think remove this one too - and the global 'epmutex'. I posted some >> ideas a while ago on it: >> >> http://lkml.iu.edu/hypermail//linux/kernel/1501.1/05905.html >> >> We can work through these ideas or others... > > I have tested these patches on a simple non-nested epoll workload and found that > there are performance regression (-0.5% which is better than -1.0% from my patch set) > under normal nginx conf and performance improvement (+3%, and the number of my > patch set is +4%) under the one-fd-per-request conf. The reason for performance > improvement is clear, however, I haven't figured out the reason for the performance > regression (maybe cache related ?) > > It seems that your patch set will work fine both under normal and nested epoll workload, > so I don't plan to continue my patch set which only works for normal epoll workload. > I have one question about your patch set. It is about the newly added fields in > struct file. I had tried to move these fields into struct eventpoll, so only epoll fds > will be added into a disjoint set and the target fd will be linked to a disjoint set > dynamically, but that method doesn't work out because there are multiple target fds and > the order in which these target fds are added to a disjoint set will lead to deadlock. > So do you have other ideas to reduce the size of these fields ? > > Thanks, > Tao Hi, I agree that increasing the size of struct file is no go. One idea is to simply have a single pointer in 'struct file' for epoll usage, that could be dynamically allocated when a file is added to an epoll set. That would reduce the size of things from where we currently are. I'm also not sure that the target fds (or non-epoll fds) need to be added to the disjoint sets. The sets are to serialize loop checks of which the target fds can not be a part of. If there is interest in resurrecting this patchset, I can pick it back up. I hadn't been pushing it forward because I wasn't convinced it mattered on real workloads... Thanks, -Jason > > The following lines are the result of performance result: > > baseline: result for v4.14 > ep_cc: result for v4.14 + ep_cc patch set > my: result for v4.14 + my patch set > > The first 15 columns are the results from wrk and their unit is Requests/sec, > the lats two columns are their average value and standard derivation. > > * normal scenario: nginx + wrk > > baseline > 376704.74 377223.88 378096.67 379484.45 379199.04 379526.21 378008.82 \ > 379959.98 377634.47 379127.27 377622.92 379442.12 378994.44 376000.08 377046.58 \ > 378271.4447 1210.673592 > > ep_cc > 373376.08 376897.65 373493.65 376154.44 377374.36 379080.88 375124.44 \ > 376404.16 376539.28 375141.84 379075.32 377214.39 374524.52 375605.87 372167.8 \ > 375878.312 1983.190539 > > my > 373067.96 372369 375317.61 375564.66 373841.88 373699.88 376802.91 369988.97 \ > 376080.45 371580.64 374265.34 376370.34 377033.75 372786.91 375521.61 \ > 374286.1273 2059.92093 > > > * one-fd-per-request scenario: nginx + wrk > baseline > 124509.77 106542.46 118074.46 111434.02 117628.9 108260.56 119037.91 \ > 114413.75 108706.82 116277.99 111588.4 118216.56 121267.63 110950.4 116455.43 \ > 114891.004 5168.30288 > > ep_cc > 124209.45 118297.8 120258.1 122785.95 118744.64 118471.76 117621.07 \ > 117525.14 114797.14 115348.47 117154.02 117208.37 118618.92 118578.8 118945.52 \ > 118571.01 2437.050182 > > my > 125090.99 121408.92 116444.09 120055.4 119399.24 114938.04 122889.56 \ > 114363.49 120934.15 119052.03 118060.47 129976.45 121108.08 114849.93 114810.64 \ > 119558.7653 4338.18401 > >> Thanks, >> >> -Jason >> >> >>>> Signed-off-by: Jason Baron <jbaron@xxxxxxxxxx> >>>> Cc: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx> >>>> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> >>>> Cc: Salman Qazi <sqazi@xxxxxxxxxx> >>> >>> Acked-by: Davidlohr Bueso <dbueso@xxxxxxx> >>> >>>> --- >>>> fs/eventpoll.c | 47 ++++++++++++++++++----------------------------- >>>> 1 file changed, 18 insertions(+), 29 deletions(-) >>> >>> Yay for getting rid of some of the callback hell. >>> >>> Thanks, >>> Davidlohr >> >> . >> >