Hello together,
On 03/29/2018 04:14 AM, Davidlohr Bueso wrote:
Cc'ing mtk, Manfred and linux-api.
See below.
On Thu, 15 Mar 2018, Waiman Long wrote:
On 03/15/2018 03:00 PM, Eric W. Biederman wrote:
Waiman Long <longman@xxxxxxxxxx> writes:
On 03/14/2018 08:49 PM, Eric W. Biederman wrote:
The define IPCMNI was originally the size of a statically sized
array in
the kernel and that has long since been removed. Therefore there
is no
fundamental reason for IPCMNI.
The only remaining use IPCMNI serves is as a convoluted way to format
the ipc id to userspace. It does not appear that anything except for
the CHECKPOINT_RESTORE code even cares about this variety of
assignment
and the CHECKPOINT_RESTORE code only cares about this weirdness
because
it has to restore these peculiar ids.
My assumption is that if an array is recreated, it should get a
different id.
a=semget(1234,,);
semctl(a,,IPC_RMID);
b=semget(1234,,);
now a!=b.
Rational: semop() calls only refer to the array by the id.
If there is a stale process in the system that tries to access the "old"
array and the new array has the same id, then the locking gets corrupted.
Therefore make the assignment of ipc ids match the description in
Advanced Programming in the Unix Environment and assign the next id
until INT_MAX is hit then loop around to the lower ids.
Ok, sounds good.
That way we really cycle through INT_MAX, right now a==b would happen
after 128k RMID calls.
This can be implemented trivially with the current code using
idr_alloc_cyclic.
Is there a performance impact?
Right now, the idr tree is only large if there are lots of objects.
What happens if we have only 1 object, with id=INT_MAX-1?
semop() that do not sleep are fairly fast.
The same applies for msgsnd/msgrcv, if the message is small enough.
@Davidlohr:
Do you know if there are application that frequently call semop() and it
doesn't have to sleep?
From the scalability that was pushed into the kernel, I assume that
this exists.
I have myself only checked postgresql, and postgresql always sleeps.
(and this was long ago)
To make it possible to keep checkpoint/restore working I have renamed
the sysctls from xxx_next_id to xxx_nextid. That is enough change
that
a smart CRIU implementation can see that what is exported has
changed,
and act accordingly. New kernels will be able to restore the old
id's.
This code still needs some real world testing to verify my
assumptions.
And some work with the CRIU implementations to actually add the code
that deals with the new for of id assignment.
It means that all existing checkpoint/restore application will not work
with a new kernel.
Everyone must first update the checkpoint/restore application, then
update the kernel.
Is this acceptable?
--
Manfred