On Tue, Oct 04, 2016 at 07:55:12PM -0300, Mauricio Faria de Oliveira wrote: > Hi Benjamin, Kent, and others, > > Would you please comment / answer about this possible problem? > Any feedback is appreciated. > > Since commit e1bdd5f27a5b ("aio: percpu reqs_available") the maximum > number of aio nr_events may be a function of num_possible_cpus() and > actually be /inversely proportional/ to it (i.e., more CPUs lead to > less system-wide aio nr_events). This is a problem on larger systems. > > That's because if "nr_events < num_possible_cpus() * 4" (for example > nr_events == 1) that counts as "num_possible_cpus() * 4" into aio_nr > and against aio_max_nr > > static struct kioctx *ioctx_alloc(unsigned nr_events) > ... > nr_events = max(nr_events, num_possible_cpus() * 4); > nr_events *= 2; > ... > /* limit the number of system wide aios */ > .... > if (aio_nr + nr_events > (aio_max_nr * 2UL) || > ... > err = -EAGAIN; > ... > aio_nr += ctx->max_reqs; > ... > > That problem is easily noticeable on a common POWER8 system: 160 CPUs > (2 sockets * 10 cores/socket * 8 threads/core = 160 CPUs) limits the max > AIO contexts with "io_setup(1, )" to 102 out of 64k (default ax_aio_nr): > > # cat /sys/devices/system/cpu/possible > 0-159 > > # cat /proc/sys/fs/aio-max-nr > 65536 > > # echo $(( 65536 / (160 * 4) )) > 102 > > test-case snippet & output: > > for (i = 0; i < 65536; i++) > if (rc = io_setup(1, &ioctx[i])) > break; > > printf("rc = %d, i = %d\n", rc, i); > > > rc = -11, i = 102 > > (another problem is that the sysctl aio-nr grows larger than aio-max-nr, > since it's checked against "aio_max_nr * 2") > > So, > > I've been trying to understand/fix this, but soon got stuck on options > as I didn't quite get a few points.. if you could provide some insight, > please, that would be really helpful: > > - why "num_possible_cpus() * 4", and why "max(nr_events, <it>)" ? For the scheme to work - percpu allocation of slots - we have to ensure that there aren't too many unused slots stranded on other CPUs. The stranding is limited to 1/4th of the slots as I figured any more than that could be too unpredictable - the effective maximum number of in flight iocbs would vary too much. For systems with large numbers of CPUs, what I'd prefer to do is make it per core or numa node or somesuch. But we don't have any infrastructure for that equivilant to the alloc_percpu() stuff, so that's why I didn't do it at the time. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html