Re: [PATCH] nonblocking epoll_wait loop, sched events, ISER/IB polling

FUJITA Tomonori <fujita.tomonori@xxxxxxxxxxxxx> · Fri, 26 Sep 2008 19:25:18 +0900

On Thu, 25 Sep 2008 19:09:09 +0300
Alexander Nezhinsky <nezhinsky@xxxxxxxxx> wrote:

> This patch introduces a few interconnected improvements, that 
> ultimately lead to significant reduction in interrupts rate 
> for iser/ib, while adding flexibility to tgtd event processing scheme.
> 
> First, it implements a kind of "scheduler" list, to replace the
> counter events. Event handlers can schedule other events that are
> placed on the scheduler's loop. This has some small advantages in 
> itself, because the same event descriptor is used for all events 
> in the system, events are executed in order they'd been scheduled, 
> they can be removed from list, few instances of the same event 
> may be scheduled. 
> 
> But the most important change is the logic of the processing loop,
> as a whole. It goes as follows: 
> 
> The scheduler checks events on the list, does their processing, 
> which can schedule more items, but does not process the new items 
> (in order to avoid infinite/long loops).
> 
> If after processing all "old" events, some "new" ones were scheduled,
> epoll_wait() is executed in non-blocking manner. This guarantees
> that the file descriptors that became ready during the scheduler
> processing are handled, but if there no ready fds, the remaining 
> scheduler events are processed immediately.  
> 
> But if after processing the scheduler list nothing new is added, 
> then epoll_wait is blocking as in the current scheme.
> 
> This way we can be very flexible, because event handlers and deferred 
> work can not block one another. Potentially long handlers can be 
> split into shorter ones easily without blocking the entire target.
> 
> Finally, IB completion queue is the first guy who benefits, because 
> we can postpone interrupt re-arming, until no completion entries 
> remain, and to schedule CQ "draining" after all events are serviced.
> 
> When a completion event is handled, CQ is polled so that up to a 
> given number (now set to 8) of WCs are processed. 
> 
> If more completions are left on the CQ, essentially the same 
> handler is scheduled, to carry out the next round of polling, 
> but in the meanwhile, other events in the system can be serviced.
> 
> If CQ is found empty, interrupts are re-armed and a handler is
> scheduled to consume the completions which could sneak in 
> between the moment the CQ was seen empty and just before 
> the interrupts were re-armed.
> 
> Thus in iSER IB there is a marked interrupts rate reduction.
> Here are typical results:
> Current target: 62-65 KIOPS, ~110,000 interrupt/sec, CPU% ~68%
> Patched target: 65-70 KIOPS,  ~65,000 interrupt/sec, CPU% ~65%
> 
> Signed-off-by: Alexander Nezhinsky <nezhinsky@xxxxxxxxx>

Thanks,

Seems that you fixed most of the style issues but there are still some:

ERROR: trailing whitespace
#294: FILE: usr/iscsi/iscsi_rdma.c:1058:
+/* Scheduled to poll cq after a completion event has been $

ERROR: trailing whitespace
#304: FILE: usr/iscsi/iscsi_rdma.c:1068:
+   after the cq had been seen empty but just before $

ERROR: trailing whitespace
#305: FILE: usr/iscsi/iscsi_rdma.c:1069:
+   the notification interrupts were re-armed. $

ERROR: trailing whitespace
#338: FILE: usr/iscsi/iscsi_rdma.c:1102:
+^I/* if a poll was previosuly scheduled, remove it, $

total: 4 errors, 0 warnings, 565 lines checked

> diff --git a/usr/tgtd.c b/usr/tgtd.c
> index 0b1cb4c..ac3ff76 100644
> --- a/usr/tgtd.c
> +++ b/usr/tgtd.c
> @@ -38,26 +38,13 @@
>  #include "work.h"
>  #include "util.h"
>  
> -struct tgt_event {
> -	union {
> -		event_handler_t *handler;
> -		counter_event_handler_t *counter_handler;
> -	};
> -	union {
> -		int fd;
> -		int *counter;
> -	};
> -	void *data;
> -	struct list_head e_list;
> -};

exporting struct tgt_event fine for now but I don't fancy it much. If
we have proper APIs for it, we could avoid it, I guess. But I've not
see rdma code so I'm not sure.

> -static void event_loop(void)
> +void tgt_init_sched_event(struct tgt_event *evt,
> +			  sched_event_handler_t sched_handler, void *data)
>  {
> -	int nevent, i, done, timeout = TGTD_TICK_PERIOD * 1000;
> -	struct epoll_event events[1024];
> +	evt->sched_handler = sched_handler;
> +	evt->scheduled = 0;
> +	evt->data = data;
> +	INIT_LIST_HEAD(&evt->e_list);
> +}
> +
> +void tgt_add_sched_event(struct tgt_event *evt)
> +{
> +	if (!evt->scheduled) {
> +		evt->scheduled = 1;
> +		list_add_tail(&evt->e_list, &tgt_sched_events_list);
> +	}
> +}
> +
> +void tgt_remove_sched_event(struct tgt_event *evt)
> +{
> +	if (evt->scheduled) {
> +		evt->scheduled = 0;
> +		list_del_init(&evt->e_list);
> +	}
> +}
> +
> +static void tgt_exec_scheduled(void)
> +{
> +	struct list_head *last_sched;
>  	struct tgt_event *tev, *tevn;
>  
> -retry:
> -	/*
> -	 * Check the counter events to see if they have any work to run.
> -	 */
> -	do {
> -		done = 1;
> -		list_for_each_entry_safe(tev, tevn, &tgt_counter_events_list,
> -					e_list) {
> -			if (*tev->counter) {
> -				done = 0;
> -				tev->counter_handler(tev->counter, tev->data);
> -			}
> -		}
> -	} while (!done);
> +	if (list_empty(&tgt_sched_events_list))
> +		return;
> +
> +	/* execute only work scheduled till now */
> +	last_sched = tgt_sched_events_list.prev;
> +	list_for_each_entry_safe(tev, tevn, &tgt_sched_events_list, e_list) {
> +		tgt_remove_sched_event(tev);
> +		tev->sched_handler(tev);
> +		if (&tev->e_list == last_sched)
> +			break;
> +	}
> +}
> +
> +static void tgt_poll_events(int timeout)
> +{
> +	int nevent, i;
> +	struct tgt_event *tev;
> +	struct epoll_event events[1024];
>  
>  	nevent = epoll_wait(ep_fd, events, ARRAY_SIZE(events), timeout);
>  	if (nevent < 0) {
> @@ -255,9 +229,19 @@ retry:
>  		}
>  	} else
>  		schedule();
> +}
>  
> -	if (system_active)
> -		goto retry;
> +static void event_loop(void)
> +{
> +	int timeout, wait_timeout = TGTD_TICK_PERIOD * 1000;
> +
> +	while (system_active) {
> +		tgt_exec_scheduled();
> +		/* wait if no scheduled work, poll if there is */
> +		timeout = list_empty(&tgt_sched_events_list) ?
> +				wait_timeout : 0;
> +		tgt_poll_events(timeout);
> +	}

Can you avoid rewriting (or cleaning up) even_loop()?

That's not related with this patch. You want to replace the first half
(counter_events code) in the original even_loop(). I don't think that
you need to rewrite event_loop in the above way, right?

If you do something like the following, your patch is more smaller, I
think.

static void event_loop(void)
{
	int nevent, i, done, timeout = TGTD_TICK_PERIOD * 1000;
	struct epoll_event events[1024];
	struct tgt_event *tev, *tevn;

retry:
	done = schedule_event(); /* choose whatever name you like */
	if (!done)
		timeout = 0;

	nevent = epoll_wait(ep_fd, events, ARRAY_SIZE(events), timeout);
	if (nevent < 0) {
		if (errno != EINTR) {
			eprintf("%m\n");
			exit(1);
		}
	} else if (nevent) {
		for (i = 0; i < nevent; i++) {
			tev = (struct tgt_event *) events[i].data.ptr;
			tev->handler(tev->fd, events[i].events, tev->data);
		}
	} else
		schedule();

	if (system_active)
		goto retry;
}
--
To unsubscribe from this list: send the line "unsubscribe stgt" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html