Re: [RFC][PATCH] ipc: Remove IPCMNI

Manfred Spraul <manfred@xxxxxxxxxxxxxxxx> · Thu, 29 Mar 2018 10:47:45 +0200

Hello together,

On 03/29/2018 04:14 AM, Davidlohr Bueso wrote:
Cc'ing mtk, Manfred and linux-api.

See below.

On Thu, 15 Mar 2018, Waiman Long wrote:

On 03/15/2018 03:00 PM, Eric W. Biederman wrote:
Waiman Long <longman@xxxxxxxxxx> writes:

On 03/14/2018 08:49 PM, Eric W. Biederman wrote:
The define IPCMNI was originally the size of a statically sized 
array in
the kernel and that has long since been removed. Therefore there 
is no
fundamental reason for IPCMNI.

The only remaining use IPCMNI serves is as a convoluted way to format
the ipc id to userspace.  It does not appear that anything except for
the CHECKPOINT_RESTORE code even cares about this variety of 
assignment
and the CHECKPOINT_RESTORE code only cares about this weirdness 
because
it has to restore these peculiar ids.

My assumption is that if an array is recreated, it should get a 
different id.
    a=semget(1234,,);
    semctl(a,,IPC_RMID);
    b=semget(1234,,);
now a!=b.

Rational: semop() calls only refer to the array by the id.
If there is a stale process in the system that tries to access the "old" 
array and the new array has the same id, then the locking gets corrupted.
Therefore make the assignment of ipc ids match the description in
Advanced Programming in the Unix Environment and assign the next id
until INT_MAX is hit then loop around to the lower ids.

Ok, sounds good.
That way we really cycle through INT_MAX, right now a==b would happen 
after 128k RMID calls.
This can be implemented trivially with the current code using 
idr_alloc_cyclic.

Is there a performance impact?
Right now, the idr tree is only large if there are lots of objects.
What happens if we have only 1 object, with id=INT_MAX-1?

semop() that do not sleep are fairly fast.
The same applies for msgsnd/msgrcv, if the message is small enough.

@Davidlohr:
Do you know if there are application that frequently call semop() and it 
doesn't have to sleep?
From the scalability that was pushed into the kernel, I assume that 
this exists.

I have myself only checked postgresql, and postgresql always sleeps.
(and this was long ago)
To make it possible to keep checkpoint/restore working I have renamed
the sysctls from xxx_next_id to xxx_nextid.  That is enough change 
that
a smart CRIU implementation can see that what is exported has 
changed,
and act accordingly.  New kernels will be able to restore the old 
id's.

This code still needs some real world testing to verify my 
assumptions.
And some work with the CRIU implementations to actually add the code
that deals with the new for of id assignment.

It means that all existing checkpoint/restore application will not work 
with a new kernel.
Everyone must first update the checkpoint/restore application, then 
update the kernel.

Is this acceptable?

--
    Manfred
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html