Strange problems with SCHED_RR

Dietmar Maurer <dietmar@xxxxxxxxxxx> · Tue, 15 Jan 2013 15:37:44 +0000

We always run into strange problems when we enable the RT scheduler (SCHED_RR).

After some random time we get:

Jan 10 19:29:22 host1 corosync[1700]:   [TOTEM ] Retransmit List: a38e a38f a39
2 a393 a3a1 a3a7 a3a8 a3a9 a3aa a3ab a381 a382 a383 a384 a385 a386 a387 a388 a3
89 a38a a38b a38c a38d a390 a391 a394 a395
Jan 10 19:29:32 host1 corosync[1700]:   [TOTEM ] A processor failed, forming new configuration.

Any ideas?

The same node runs without problems when we use a kernel with CONFIG_RT_GROUP_SCHED disabled, or
when we start corosync with '-p'.

This happens with a RHEL6.3 (openvz) based kernel, but also with newer 3.X kernels.

But running without raised priority seem also dangerous, so I tried using (in exec/main.c):

       setpriority(PRIO_PGRP, 0, -20)

And that seems to work. I wonder if this has some serious drawbacks?

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss