KNK SS7-27 - first experiences - part 1

marcelo@xxxxxxxxxx (Marcelo Pacheco) · Tue, 25 Jun 2013 09:56:17 -0300

What code are you using ?
Is this not stock libss7 ? Stock libss7 can't decode ISUP SUS/RES like that.

In my code, I explicitly ignore ALL SUS / RES, they have no needed
processing associated with Brazilian ISUP.

Asterisk and kernel dahdi version ?

If you enable dahdi_pcap:
# dahdi_pcap -c 16 -f /tmp/mycap.ss7
Capturing protocol mtp2 on channels 16 to file /tmp/mycap.ss7
Packets captured: 7

Then you can analyze the capture in wireshark / ethereal.
But it has one bug, if you shutdown the owner of the link while
dahdi_pcap is running, the system will reset on its own.
As long as you don't leave dahdi_pcap running around, its not a problem.

On 06/25/13 09:38, Pavel Troller wrote:
> Hello Marcelo,
>   so I did some tracing. It was really hard to isolate MSUs for one particular
> connection, I had to collect them from about 5 MB file, but ok, it's done,
> and it's in total harmony with my original ideas. So, let's look at it with
> me:
>
> Initial conditions: There is a call running on LS1, DPC4097, CIC 12.
>
> Our Asterisk decided to clear this call down:
>
> [1] ISUP timer t1 (15000ms) started on CIC 12 DPC 4097
> [1] ISUP timer t5 (300000ms) started on CIC 12 DPC 4097
> [1] Len = 16 [ bc c3 0d 85 01 10 02 c0 0c 00 0c 02 00 02 81 90 ]
> [1] FSN: 67 FIB 1
> [1] BSN: 60 BIB 1
> [1] >[4097:0] MSU
> [1] [ bc c3 0d ]
> [1] 	Network Indicator: 2 Priority: 0 User Part: ISUP (5)
> [1] 	[ 85 ]
> [1] 	OPC 8 DPC 4097 SLS 12
> [1] 	[ 01 10 02 c0 ]
> [1] 		CIC: 12
> [1] 		[ 0c 00 ]
> [1] 		Message Type: REL(0x0c)
> [1] 		[ 0c ]
> [1] 		--VARIABLE LENGTH PARMS[1]--
> [1] 		Cause Indicator:
> [1] 			Coding Standard: 0
> [1] 			Location: 1
> [1] 			Cause Class: 1
> [1] 			Cause Subclass: 0
> [1] 			Cause: Normal call clearing (16)
> [1] 			[ 02 81 90 ]
> [1] 
>
> But, the remote party also decided to hang up, and our REL just crossed
> their SUS going back (please look at BSN and compare with our FSN, they
> don't know about our REL yet).
>
> [1] Len = 13 [ c0 bd 0a 85 08 40 00 c4 0c 00 0d 01 00 ]
> [1] FSN: 61 FIB 1
> [1] BSN: 64 BIB 1
> [1] <[4097:0] MSU
> [1] [ c0 bd 0a ]
> [1] 	Network Indicator: 2 Priority: 0 User Part: ISUP (5)
> [1] 	[ 85 ]
> [1] 	OPC 4097 DPC 8 SLS 12
> [1] 	[ 08 40 00 c4 ]
> [1] 		CIC: 12
> [1] 		[ 0c 00 ]
> [1] 		Message Type: SUS(0x0d)
> [1] 		[ 0d ]
> [1] 		--FIXED LENGTH PARMS[1]--
> [1] 		Suspend/Resume Indicators:
> [1] 			SUS/RES indicator: Network initiated (1)?[1] 			[ 01 ]
> [1] 
>
> And what happens now is a clear ******** BUG ******** in libss7: As RLC
> has not been received yet, the call must still be considered as active!
> But we already forgot it and now we are surprised that we got some MSU
> about it.
>
> [1] Got SUS but no call on CIC 12 PC 4097 ?[1] reseting the cic
>
> The situation is getting complicated, we are sending RSC.
>
> [1] ISUP timer t1 stopped on CIC 12 DPC: 4097
> [1] ISUP timer t5 stopped on CIC 12 DPC: 4097
> [1] ISUP timer t17 (300000ms) started on CIC 12 DPC 4097
> [1] Len = 11 [ bd c4 08 85 01 10 02 c0 0c 00 12 ]
> [1] FSN: 68 FIB 1
> [1] BSN: 61 BIB 1
> [1] >[4097:0] MSU
> [1] [ bd c4 08 ]
> [1] 	Network Indicator: 2 Priority: 0 User Part: ISUP (5)
> [1] 	[ 85 ]
> [1] 	OPC 8 DPC 4097 SLS 12
> [1] 	[ 01 10 02 c0 ]
> [1] 		CIC: 12
> [1] 		[ 0c 00 ]
> [1] 		Message Type: RSC(0x12)
> [1] 		[ 12 ]
>
> And we get a RLC. IMHO it is a RLC confirming our REL, not
> RSC (according to BSN, the peer already received all our MSUs,
> but they probably already had the RLC queued, so they sent it)
>
> [1] 
> [1] Len = 12 [ c4 be 09 85 08 40 00 c4 0c 00 10 00 ]
> [1] FSN: 62 FIB 1
> [1] BSN: 68 BIB 1
> [1] <[4097:0] MSU
> [1] [ c4 be 09 ]
> [1] 	Network Indicator: 2 Priority: 0 User Part: ISUP (5)
> [1] 	[ 85 ]
> [1] 	OPC 4097 DPC 8 SLS 12
> [1] 	[ 08 40 00 c4 ]
> [1] 		CIC: 12
> [1] 		[ 0c 00 ]
> [1] 		Message Type: RLC(0x10)
> [1] 		[ 10 ]
> [1] 
> [1] ISUP timer t17 stopped on CIC 12 DPC: 4097
> Linkset 1: Processing event: ISUP_EVENT_RLC
>
> And now, we get a second RLC, probably to our RSC. There is a jump
> in FSN because there was a MSU sent from them, which was not
> related to our call.
>
> [1] Len = 12 [ c4 c0 09 85 08 40 00 c4 0c 00 10 00 ]
> [1] FSN: 64 FIB 1
> [1] BSN: 68 BIB 1
> [1] <[4097:0] MSU
> [1] [ c4 c0 09 ]
> [1] 	Network Indicator: 2 Priority: 0 User Part: ISUP (5)
> [1] 	[ 85 ]
> [1] 	OPC 4097 DPC 8 SLS 12
> [1] 	[ 08 40 00 c4 ]
> [1] 		CIC: 12
> [1] 		[ 0c 00 ]
> [1] 		Message Type: RLC(0x10)
> [1] 		[ 10 ]
> [1] 
>
> And this RLC seems unsolicited to us, because we were taking the
> first RLC as a response to our RSC, which was not the case.
>
> [1] Got RLC but we didn't send REL/RSC on CIC 12 PC 4097 
>
> So, no MSUs received from another linksets, all is perfectly fitting
> together...
>
> This trace is a clear demonstration of an existing bug in libss7, which
> may be formulated as follows: "When we are terminating the call and sending
> REL to the remote party, we must keep the record of the connection and 
> silently accept and absorb all MSUs, which may come back, until we receive
> a RLC or T5 expires".
>
> What do you think about it ?
>
> With regards,
>   Pavel
>
>> Another possibility is you're mixing the whole thing in a single linkset
>> where you must use two linksets in the way you explained.
>>
>> Can you see those errors with just a few test calls ?
>>
>>
>> I found about 20 bugs / structural design flaws in stock libss7 / dahdi
>> mtp2 support. With my changes the mtp2/mtp3 layers are far more robust
>> than stock libss7.
>> Fixed all but a single one, related to knowing then the linkset is up or
>> down, and not trying to send isup messages, specially IAM through a down
>> linkset - all sigchans down.
>>
>> If there's a bug, use ss7 set debug on linkset X to trace ss7 messages
>> and track isup message flow.
>>
>> I used libss7 succesfully with telcobridges tmedia, digitro switches,
>> ericsson AXE, huawei NGN, Nortel DMS, several STPs, EWSS, Nec NEAX, and
>> I'm probably missing a couple switch types.
>> I never ran into SS7 / ISUP bugs of other switches, always libss7, but,
>> the nature of the bugs found are nothing like what you're reporting.
>> I started testing libss7 with those kinds of switches 5 years ago, so I
>> have a some mileage to make those statements, specially from reading and
>> understanding a large portion of the libss7 / sig_ss7 / chan_dahdi code.
>>
>> The issue you're describing is caused by Asterisk getting ss7 messages
>> that belong to another linkset or sending ss7 messages on the wrong ss7
>> link.
>> Check for UCIC or CFN ISUP responses.
>>
>>
>>
>> you need to define chan_dahdi.conf basicly like this:
>>
>> ; basic ss7 / isup parameters, usually the same for the whole libss7 setup
>> signalling=ss7
>> ss7type=itu/ansi
>> ss7_called_nai=subscriber/national/international/unknown
>> ss7_calling_nai=subscriber/national/international/unknown
>> networkindicator=national/international/...
>>
>> ; Your local pointcode
>> pointcode = X
>>
>> ; Start definition for linkset N
>> linkset = N
>>
>> adjpointcode = STP point code otherwise switch point code
>> ; Instantiate a signalling link on channel 16 belonging to linkset N,
>> with adjacency to adjpointcode
>> sigchan = 16
>> ; Define more signalling links if needed, with adjpointcode and sigchan
>>
>> defaultdpc = pointcode for ISUP messages
>> cicbeginswith= CIC of the next voice channel defined
>> ; Instantiate voice channel on linkset N, talking to PC defaultdpc, CIC
>> numbering incremented automatically
>> channel => dahdi channel range
>>
>> cicbeginswith= next CIC range, if non contiguous
>> channel => dahdi channel range
>>
>> defaultdpc = another point code belonging to the same linkset (if links
>> share signalling to multiple switches, typically links through an STP)
>> ;repeat cicbeginswith, channel
>>
>> ; Starts definition of another linkset
>> linkset = M
>> ; repeat same sequence as above
>>
>>
>> On 06/25/13 05:13, Pavel Troller wrote:
>>> Hello Marcelo,
>>>
>>>> Per usual, read the fine manual. Wait, there's no manual !
>>> You're right :-).
>>>
>>>> Since you seem to have done your part and actually knows some ss7 and
>>>> isup, here comes a hint.
>>>>
>>>> You created two or more linksets where you must have a single one.
>>>> libss7 don't have the ss7 routing feature.
>>> It seems strange to me. Let's try to explain this in more detailed way.
>>> There is 1 (one) Asterisk box.
>>> It has 2 (two) "linksets" configured, with 1 (one) signallink link per linkset.
>>> Linkset 1 is configured for one DPC and with CICs 1 - 496.
>>> Linkset 2 is configured for another (different) DPC and also with CICs 1 - 496.
>>> Both the systems connected to this Asterisk box are configured to respond
>>> directly to the linkset between them and the Asterisk, so it's sure that
>>> a MSU from DPC1 cannot come over LS2 and vice versa.
>>> I hope that this extremely simple setup is in the scope of current libss7
>>> functionality. Or am I wrong ?
>>>
>>>> In libss7 linkset concept is diferent from official ss7 linkset.
>>>>
>>>> All signalling links that carry ISUP traffic for a given set of channels
>>>> must be kept on a single linkset, as well as all ISUP channels that go
>>>> through those links.
>>> I hope that my setup is conformant with this limitation.
>>>
>>>> It looks like you're getting incoming signalling for ISUP channels that
>>>> are on another linkset.
>>> It really looks like this, but I still hope it's not the case. Please note that
>>> the traffic on the box is rather high, such an error occurs for one of, say,
>>> 10000 call attempts. I think that in case of such a fatal routing problem,
>>> which you are talking about, it wouldn't be possible to use the system
>>> regularly.
>>>
>>>> I'm sure you didn't find any libss7 bug.
>>> Really strong words! I wouldn't say it for any of my programs :-).
>>>
>>>> I have a highly customized version of libss7/dahdi/asterisk, fixing lots
>>>> of issue, but this isn't one of them.
>>> Possibly your setup/usage scenario is a bit different ?
>>>
>>>
>>>> Processed over one million call setups, with a very complex setup (6
>>>> linksets, 7 links, 6E1 on a single switch, plus another 6E1 on remote
>>>> switches using my simple STP solution, sharing the local links over SS7
>>>> over UDP - my simpler proprietary alternative to sigtran).
>>> These switches (I have two of them, but the second one is still on a regular
>>> unpatched SS7 stack) make approx. 3 millions of call setups per week. My
>>> record (without restarting/crashing Asterisk) is about 3 weeks with more than
>>> 10 millions of calls.
>>>
>>>> If you need commercial support, contact me off list.
>>> Thanks for your offer.
>>>
>>> With regards, Pavel
>>>
>>>> On 06/24/13 09:02, Pavel Troller wrote:
>>>>> Hi!
>>>>>   I would like to share my expiernce with deployment of this experimental SS7
>>>>> branch.
>>>>>   The first impressions are good, especially the timers seem to work well,
>>>>> saving many calls from being frozen.
>>>>>   However, there are still some strange things, which I would like to discuss
>>>>> here, one by one.
>>>>>   The first one is, that the channel sometimes doesn't recognize a message
>>>>> (mostly RLC), even it comes from an action initiated by the channel itself.
>>>>> Typically, the following is appearing often:
>>>>>
>>>>> [Jun 24 13:33:41] ERROR[3975]: chan_dahdi.c:14406 dahdi_ss7_error: [1] ISUP timer t17 expired on CIC 27 DPC 4097
>>>>> [1] Got RLC but we didn't send REL/RSC on CIC 27 PC 4097 reseting the cic
>>>>>
>>>>>   As I understand, there were some timeouts and now the channel tries to
>>>>> recover by sending RSC and firing T17. However, it seems that it immediately
>>>>> rejects RLC, which comes back as a response to the RSC which was just sent
>>>>> upon expiry of T17. And this appears again and again in the rhythm of T17,
>>>>> and the channel is not operational.
>>>>> ss7 show calls shows the following line for the misbehaving CIC:
>>>>>    27  4097  11  IAM                       IAM
>>>>>  
>>>>>   Or, a very similar situation:
>>>>> [2] Got SUS but no call on CIC 48 PC 4096 reseting the CIC
>>>>> [2] Got RLC but we didn't send REL/RSC on CIC 48 PC 4096 reseting the CIC
>>>>>
>>>>>   The first question is, why there was no call while SUS was received. My
>>>>> idea is, that both the parties hung up their phones in the same time and
>>>>> that the call was undergoing destruction on Asterisk side (REL just sent
>>>>> or something like this), while SUS arrived. Maybe the call was marked as
>>>>> cleared even before RLC came back ? OK, I can understand this. But
>>>>> if the CIC was reset as the first message says (i.e. RSC was sent), why the
>>>>> RLC going back is not recognized then ?
>>>>>
>>>>> Or, just now the following appeared:
>>>>>
>>>>> [1] Got ACM but we didn't send IAM on CIC 10 PC 4097 reseting the cic
>>>>> [1] Got RLC but we didn't send REL/RSC on CIC 10 PC 4097 reseting the cic
>>>>>
>>>>> Again, it's questionable, why this happened, but the second line seems
>>>>> to indicate some brokeness again.
>>>>>
>>>>> To explain: The channel is operating on a gateway equipped with 16 E1s
>>>>> and current traffic is about 10 CAPS, there are two linksets to two
>>>>> cooperating exchanges. They are EWSDs, which have very mature and stable
>>>>> SS7, so I'm almost sure that they are not making signalling errors.
>>>>>
>>>>> With regards,
>>>>>   Pavel
>>>>>
>>>>> --
>>>>> _____________________________________________________________________
>>>>> -- Bandwidth and Colocation Provided by http://www.api-digital.com --
>>>>>
>>>>> asterisk-ss7 mailing list
>>>>> To UNSUBSCRIBE or update options visit:
>>>>>    http://lists.digium.com/mailman/listinfo/asterisk-ss7
>>>>>
>>
>> -- 
>> Atenciosamente,
>>
>> Marcelo Pacheco
>> M2J Comunica??es e Inform?tica
>> Fixo: (27)2222-8118 / (27)2233-2296
>> Vivo: (27)9964-5440
>> Claro: (27)9312-5319
>> MSN: marcelo at macp.eti.br
>> E-mail: marcelo at m2j.com.br

-- 
Atenciosamente,

Marcelo Pacheco
M2J Comunica??es e Inform?tica
Fixo: (27)2222-8118 / (27)2233-2296
Vivo: (27)9964-5440
Claro: (27)9312-5319
MSN: marcelo at macp.eti.br
E-mail: marcelo at m2j.com.br