KNK SS7-27 - first experiences - part 1

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Marcelo,
  so I did some tracing. It was really hard to isolate MSUs for one particular
connection, I had to collect them from about 5 MB file, but ok, it's done,
and it's in total harmony with my original ideas. So, let's look at it with
me:

Initial conditions: There is a call running on LS1, DPC4097, CIC 12.

Our Asterisk decided to clear this call down:

[1] ISUP timer t1 (15000ms) started on CIC 12 DPC 4097
[1] ISUP timer t5 (300000ms) started on CIC 12 DPC 4097
[1] Len = 16 [ bc c3 0d 85 01 10 02 c0 0c 00 0c 02 00 02 81 90 ]
[1] FSN: 67 FIB 1
[1] BSN: 60 BIB 1
[1] >[4097:0] MSU
[1] [ bc c3 0d ]
[1] 	Network Indicator: 2 Priority: 0 User Part: ISUP (5)
[1] 	[ 85 ]
[1] 	OPC 8 DPC 4097 SLS 12
[1] 	[ 01 10 02 c0 ]
[1] 		CIC: 12
[1] 		[ 0c 00 ]
[1] 		Message Type: REL(0x0c)
[1] 		[ 0c ]
[1] 		--VARIABLE LENGTH PARMS[1]--
[1] 		Cause Indicator:
[1] 			Coding Standard: 0
[1] 			Location: 1
[1] 			Cause Class: 1
[1] 			Cause Subclass: 0
[1] 			Cause: Normal call clearing (16)
[1] 			[ 02 81 90 ]
[1] 

But, the remote party also decided to hang up, and our REL just crossed
their SUS going back (please look at BSN and compare with our FSN, they
don't know about our REL yet).

[1] Len = 13 [ c0 bd 0a 85 08 40 00 c4 0c 00 0d 01 00 ]
[1] FSN: 61 FIB 1
[1] BSN: 64 BIB 1
[1] <[4097:0] MSU
[1] [ c0 bd 0a ]
[1] 	Network Indicator: 2 Priority: 0 User Part: ISUP (5)
[1] 	[ 85 ]
[1] 	OPC 4097 DPC 8 SLS 12
[1] 	[ 08 40 00 c4 ]
[1] 		CIC: 12
[1] 		[ 0c 00 ]
[1] 		Message Type: SUS(0x0d)
[1] 		[ 0d ]
[1] 		--FIXED LENGTH PARMS[1]--
[1] 		Suspend/Resume Indicators:
[1] 			SUS/RES indicator: Network initiated (1)?[1] 			[ 01 ]
[1] 

And what happens now is a clear ******** BUG ******** in libss7: As RLC
has not been received yet, the call must still be considered as active!
But we already forgot it and now we are surprised that we got some MSU
about it.

[1] Got SUS but no call on CIC 12 PC 4097 ?[1] reseting the cic

The situation is getting complicated, we are sending RSC.

[1] ISUP timer t1 stopped on CIC 12 DPC: 4097
[1] ISUP timer t5 stopped on CIC 12 DPC: 4097
[1] ISUP timer t17 (300000ms) started on CIC 12 DPC 4097
[1] Len = 11 [ bd c4 08 85 01 10 02 c0 0c 00 12 ]
[1] FSN: 68 FIB 1
[1] BSN: 61 BIB 1
[1] >[4097:0] MSU
[1] [ bd c4 08 ]
[1] 	Network Indicator: 2 Priority: 0 User Part: ISUP (5)
[1] 	[ 85 ]
[1] 	OPC 8 DPC 4097 SLS 12
[1] 	[ 01 10 02 c0 ]
[1] 		CIC: 12
[1] 		[ 0c 00 ]
[1] 		Message Type: RSC(0x12)
[1] 		[ 12 ]

And we get a RLC. IMHO it is a RLC confirming our REL, not
RSC (according to BSN, the peer already received all our MSUs,
but they probably already had the RLC queued, so they sent it)

[1] 
[1] Len = 12 [ c4 be 09 85 08 40 00 c4 0c 00 10 00 ]
[1] FSN: 62 FIB 1
[1] BSN: 68 BIB 1
[1] <[4097:0] MSU
[1] [ c4 be 09 ]
[1] 	Network Indicator: 2 Priority: 0 User Part: ISUP (5)
[1] 	[ 85 ]
[1] 	OPC 4097 DPC 8 SLS 12
[1] 	[ 08 40 00 c4 ]
[1] 		CIC: 12
[1] 		[ 0c 00 ]
[1] 		Message Type: RLC(0x10)
[1] 		[ 10 ]
[1] 
[1] ISUP timer t17 stopped on CIC 12 DPC: 4097
Linkset 1: Processing event: ISUP_EVENT_RLC

And now, we get a second RLC, probably to our RSC. There is a jump
in FSN because there was a MSU sent from them, which was not
related to our call.

[1] Len = 12 [ c4 c0 09 85 08 40 00 c4 0c 00 10 00 ]
[1] FSN: 64 FIB 1
[1] BSN: 68 BIB 1
[1] <[4097:0] MSU
[1] [ c4 c0 09 ]
[1] 	Network Indicator: 2 Priority: 0 User Part: ISUP (5)
[1] 	[ 85 ]
[1] 	OPC 4097 DPC 8 SLS 12
[1] 	[ 08 40 00 c4 ]
[1] 		CIC: 12
[1] 		[ 0c 00 ]
[1] 		Message Type: RLC(0x10)
[1] 		[ 10 ]
[1] 

And this RLC seems unsolicited to us, because we were taking the
first RLC as a response to our RSC, which was not the case.

[1] Got RLC but we didn't send REL/RSC on CIC 12 PC 4097 

So, no MSUs received from another linksets, all is perfectly fitting
together...

This trace is a clear demonstration of an existing bug in libss7, which
may be formulated as follows: "When we are terminating the call and sending
REL to the remote party, we must keep the record of the connection and 
silently accept and absorb all MSUs, which may come back, until we receive
a RLC or T5 expires".

What do you think about it ?

With regards,
  Pavel

> Another possibility is you're mixing the whole thing in a single linkset
> where you must use two linksets in the way you explained.
> 
> Can you see those errors with just a few test calls ?
> 
> 
> I found about 20 bugs / structural design flaws in stock libss7 / dahdi
> mtp2 support. With my changes the mtp2/mtp3 layers are far more robust
> than stock libss7.
> Fixed all but a single one, related to knowing then the linkset is up or
> down, and not trying to send isup messages, specially IAM through a down
> linkset - all sigchans down.
> 
> If there's a bug, use ss7 set debug on linkset X to trace ss7 messages
> and track isup message flow.
> 
> I used libss7 succesfully with telcobridges tmedia, digitro switches,
> ericsson AXE, huawei NGN, Nortel DMS, several STPs, EWSS, Nec NEAX, and
> I'm probably missing a couple switch types.
> I never ran into SS7 / ISUP bugs of other switches, always libss7, but,
> the nature of the bugs found are nothing like what you're reporting.
> I started testing libss7 with those kinds of switches 5 years ago, so I
> have a some mileage to make those statements, specially from reading and
> understanding a large portion of the libss7 / sig_ss7 / chan_dahdi code.
> 
> The issue you're describing is caused by Asterisk getting ss7 messages
> that belong to another linkset or sending ss7 messages on the wrong ss7
> link.
> Check for UCIC or CFN ISUP responses.
> 
> 
> 
> you need to define chan_dahdi.conf basicly like this:
> 
> ; basic ss7 / isup parameters, usually the same for the whole libss7 setup
> signalling=ss7
> ss7type=itu/ansi
> ss7_called_nai=subscriber/national/international/unknown
> ss7_calling_nai=subscriber/national/international/unknown
> networkindicator=national/international/...
> 
> ; Your local pointcode
> pointcode = X
> 
> ; Start definition for linkset N
> linkset = N
> 
> adjpointcode = STP point code otherwise switch point code
> ; Instantiate a signalling link on channel 16 belonging to linkset N,
> with adjacency to adjpointcode
> sigchan = 16
> ; Define more signalling links if needed, with adjpointcode and sigchan
> 
> defaultdpc = pointcode for ISUP messages
> cicbeginswith= CIC of the next voice channel defined
> ; Instantiate voice channel on linkset N, talking to PC defaultdpc, CIC
> numbering incremented automatically
> channel => dahdi channel range
> 
> cicbeginswith= next CIC range, if non contiguous
> channel => dahdi channel range
> 
> defaultdpc = another point code belonging to the same linkset (if links
> share signalling to multiple switches, typically links through an STP)
> ;repeat cicbeginswith, channel
> 
> ; Starts definition of another linkset
> linkset = M
> ; repeat same sequence as above
> 
> 
> On 06/25/13 05:13, Pavel Troller wrote:
> > Hello Marcelo,
> >
> >> Per usual, read the fine manual. Wait, there's no manual !
> > You're right :-).
> >
> >> Since you seem to have done your part and actually knows some ss7 and
> >> isup, here comes a hint.
> >>
> >> You created two or more linksets where you must have a single one.
> >> libss7 don't have the ss7 routing feature.
> > It seems strange to me. Let's try to explain this in more detailed way.
> > There is 1 (one) Asterisk box.
> > It has 2 (two) "linksets" configured, with 1 (one) signallink link per linkset.
> > Linkset 1 is configured for one DPC and with CICs 1 - 496.
> > Linkset 2 is configured for another (different) DPC and also with CICs 1 - 496.
> > Both the systems connected to this Asterisk box are configured to respond
> > directly to the linkset between them and the Asterisk, so it's sure that
> > a MSU from DPC1 cannot come over LS2 and vice versa.
> > I hope that this extremely simple setup is in the scope of current libss7
> > functionality. Or am I wrong ?
> >
> >> In libss7 linkset concept is diferent from official ss7 linkset.
> >>
> >> All signalling links that carry ISUP traffic for a given set of channels
> >> must be kept on a single linkset, as well as all ISUP channels that go
> >> through those links.
> > I hope that my setup is conformant with this limitation.
> >
> >> It looks like you're getting incoming signalling for ISUP channels that
> >> are on another linkset.
> > It really looks like this, but I still hope it's not the case. Please note that
> > the traffic on the box is rather high, such an error occurs for one of, say,
> > 10000 call attempts. I think that in case of such a fatal routing problem,
> > which you are talking about, it wouldn't be possible to use the system
> > regularly.
> >
> >> I'm sure you didn't find any libss7 bug.
> > Really strong words! I wouldn't say it for any of my programs :-).
> >
> >> I have a highly customized version of libss7/dahdi/asterisk, fixing lots
> >> of issue, but this isn't one of them.
> > Possibly your setup/usage scenario is a bit different ?
> >
> >
> >> Processed over one million call setups, with a very complex setup (6
> >> linksets, 7 links, 6E1 on a single switch, plus another 6E1 on remote
> >> switches using my simple STP solution, sharing the local links over SS7
> >> over UDP - my simpler proprietary alternative to sigtran).
> > These switches (I have two of them, but the second one is still on a regular
> > unpatched SS7 stack) make approx. 3 millions of call setups per week. My
> > record (without restarting/crashing Asterisk) is about 3 weeks with more than
> > 10 millions of calls.
> >
> >> If you need commercial support, contact me off list.
> > Thanks for your offer.
> >
> > With regards, Pavel
> >
> >> On 06/24/13 09:02, Pavel Troller wrote:
> >>> Hi!
> >>>   I would like to share my expiernce with deployment of this experimental SS7
> >>> branch.
> >>>   The first impressions are good, especially the timers seem to work well,
> >>> saving many calls from being frozen.
> >>>   However, there are still some strange things, which I would like to discuss
> >>> here, one by one.
> >>>   The first one is, that the channel sometimes doesn't recognize a message
> >>> (mostly RLC), even it comes from an action initiated by the channel itself.
> >>> Typically, the following is appearing often:
> >>>
> >>> [Jun 24 13:33:41] ERROR[3975]: chan_dahdi.c:14406 dahdi_ss7_error: [1] ISUP timer t17 expired on CIC 27 DPC 4097
> >>> [1] Got RLC but we didn't send REL/RSC on CIC 27 PC 4097 reseting the cic
> >>>
> >>>   As I understand, there were some timeouts and now the channel tries to
> >>> recover by sending RSC and firing T17. However, it seems that it immediately
> >>> rejects RLC, which comes back as a response to the RSC which was just sent
> >>> upon expiry of T17. And this appears again and again in the rhythm of T17,
> >>> and the channel is not operational.
> >>> ss7 show calls shows the following line for the misbehaving CIC:
> >>>    27  4097  11  IAM                       IAM
> >>>  
> >>>   Or, a very similar situation:
> >>> [2] Got SUS but no call on CIC 48 PC 4096 reseting the CIC
> >>> [2] Got RLC but we didn't send REL/RSC on CIC 48 PC 4096 reseting the CIC
> >>>
> >>>   The first question is, why there was no call while SUS was received. My
> >>> idea is, that both the parties hung up their phones in the same time and
> >>> that the call was undergoing destruction on Asterisk side (REL just sent
> >>> or something like this), while SUS arrived. Maybe the call was marked as
> >>> cleared even before RLC came back ? OK, I can understand this. But
> >>> if the CIC was reset as the first message says (i.e. RSC was sent), why the
> >>> RLC going back is not recognized then ?
> >>>
> >>> Or, just now the following appeared:
> >>>
> >>> [1] Got ACM but we didn't send IAM on CIC 10 PC 4097 reseting the cic
> >>> [1] Got RLC but we didn't send REL/RSC on CIC 10 PC 4097 reseting the cic
> >>>
> >>> Again, it's questionable, why this happened, but the second line seems
> >>> to indicate some brokeness again.
> >>>
> >>> To explain: The channel is operating on a gateway equipped with 16 E1s
> >>> and current traffic is about 10 CAPS, there are two linksets to two
> >>> cooperating exchanges. They are EWSDs, which have very mature and stable
> >>> SS7, so I'm almost sure that they are not making signalling errors.
> >>>
> >>> With regards,
> >>>   Pavel
> >>>
> >>> --
> >>> _____________________________________________________________________
> >>> -- Bandwidth and Colocation Provided by http://www.api-digital.com --
> >>>
> >>> asterisk-ss7 mailing list
> >>> To UNSUBSCRIBE or update options visit:
> >>>    http://lists.digium.com/mailman/listinfo/asterisk-ss7
> >>>
> 
> 
> -- 
> Atenciosamente,
> 
> Marcelo Pacheco
> M2J Comunica??es e Inform?tica
> Fixo: (27)2222-8118 / (27)2233-2296
> Vivo: (27)9964-5440
> Claro: (27)9312-5319
> MSN: marcelo at macp.eti.br
> E-mail: marcelo at m2j.com.br



[Index of Archives]     [Asterisk App Development]     [PJ SIP]     [Gnu Gatekeeper]     [IETF Sipping]     [Info Cyrus]     [ALSA User]     [Fedora Linux Users]     [Linux SCTP]     [DCCP]     [Gimp]     [Yosemite Backpacking]     [Deep Creek Hot Springs]     [Yosemite Campsites]     [ISDN Cause Codes]     [Asterisk Books]

  Powered by Linux