This post is a deviation from my typical aviation topics towards a problem experienced in my work life. In my day job, I do not fly airplanes, I work as a Network Engineer where I work on a team that supports data and voice network infrastructure components for a company in town with a global presence. If you have said or heard the saying, “the network is slow”, I work on the team that handles redirects those problem reports daily.
This morning started off with a trouble report that outbound calls to our external conference bridge number were resulting in a “fast busy” after 2-3 minutes of being on a call. Of course there also happened to be a corporate wide “all managers” meeting this morning that was using the bridging service which increased the urgency greatly. Upon testing, it was easily reproducible. Sure enough, the call would go “fast busy” at about 2 minutes 53 seconds.
Now your first course of action is to always blame a vendor. In this case, we had the options. a) The telco carrier, b) the conference service provider or c) Cisco. There is no wrong answer here. I opted for the conference provider mostly because: 1) beating up on the telco is a routine occurrence in network engineering and 2) I had a hunch I would be talking to Cisco TAC about this issue anyway, so they would get their turn. Given the options. I decided to keep it interesting and go with option b.
Now blaming the vendor not only buys you some time to troubleshoot further internally, but you have something you can say you have done when the calls come in for a status report on the outage — which is usually about 30 seconds before you are actually made aware of the initial problem.
The first thing I noticed right away was that when we reproduced the issue, no call timer was showing up on the IP Phone display after the conferencing service answered the phone. This told me that the voice signaling platform was not aware the call was even connected. Our first clue.
At this point, I decided to see if the problem would happen if I sent these calls to another PSTN trunkgroup on another Media Gateway Control Protocol (MGCP) gateway. I made the route pattern configuration changes in Cisco Unified Communications Manager (CUCM) and made a test call, worked like it should. Well dang, this likely exonerated the conference bridge service provider. Next I redirected all the remaining call types by modifying the route patterns away from the impacted gateway to alternate gateways and this restored service for the users.
At this point, I created a test route pattern for a specific number (co-workers cell phone) in CUCM so that I could continue to troubleshoot root cause of this problem without the end-users suffering. Next it was off to the gateways to see what the carrier and the gateway were telling each other. Cue the Command Line Interface (CLI) command: debug isdn q931. Best. TDM. Debug. Ever.
Once I enabled the debug I saw the problem right away (the blog post title gives you a clue). Let’s take a closer look.
Here were the debugs of a failed call:
Aug 25 09:07:05.863 CDT: ISDN Se0/2/0:23 Q931: RX <- CALL_PROC pd = 8 callref = 0x800B
Channel ID i = 0xA98397
Exclusive, Channel 23
Aug 25 09:07:08.707 CDT: ISDN Se0/2/0:23 Q931: RX <- ALERTING pd = 8 callref = 0x800B
Progress Ind i = 0x8488 - In-band info or appropriate now available
Aug 25 09:07:10.067 CDT: ISDN Se0/2/0:23 Q931: RX <- CONNECT pd = 8 callref = 0x800B
Aug 25 09:10:03.946 CDT: ISDN Se0/2/0:23 Q931: TX -> DISCONNECT pd = 8 callref = 0x000B
Cause i = 0x8AE6 - Recovery on timer expiry
Aug 25 09:10:03.970 CDT: ISDN Se0/2/0:23 Q931: RX <- RELEASE pd = 8 callref = 0x800B
Aug 25 09:10:03.994 CDT: ISDN Se0/2/0:23 Q931: TX -> RELEASE_COMP pd = 8 callref = 0x000B
That sucks. Clearly “we” are transmitting the DISCONNECT message, but why (other then some timer expired). This means the telco carrier is likely innocent too.
Lets compare the failed debugs to the debugs of a good call:
Aug 25 08:57:39.591 CDT: ISDN Se0/3/1:23 Q931: RX <- CALL_PROC pd = 8 callref = 0xCFDD
Channel ID i = 0xA98397
Exclusive, Channel 23
Aug 25 08:57:41.515 CDT: ISDN Se0/3/1:23 Q931: RX <- ALERTING pd = 8 callref = 0xCFDD
Progress Ind i = 0x8488 - In-band info or appropriate now available
Aug 25 08:57:41.527 CDT: ISDN Se0/3/1:23 Q931: RX <- CONNECT pd = 8 callref = 0xCFDD
Aug 25 08:57:41.531 CDT: ISDN Se0/3/1:23 Q931: TX -> CONNECT_ACK pd = 8 callref = 0x4FDD
Aug 25 08:57:43.487 CDT: ISDN Se0/3/1:23 Q931: TX -> CONNECT pd = 8 callref = 0x807C
Display i = 'Voicemail'
Aug 25 08:57:43.523 CDT: ISDN Se0/3/1:23 Q931: RX <- CONNECT_ACK pd = 8 callref = 0x007C
Aug 25 08:57:44.703 CDT: ISDN Se0/3/0:23 Q931: RX <- CONNECT pd = 8 callref = 0xD072
Progress Ind i = 0x8482 - Destination address is non-ISDN
Aug 25 08:57:44.707 CDT: ISDN Se0/3/0:23 Q931: TX -> CONNECT_ACK pd = 8 callref = 0x5072
Can you spot the missing messages? Yep a missing CONNECT_ACK message (plus some additional subsequent messages). Why was the CONNECT_ACK missing? These gateways are MGCP controlled which means that CUCM controls most of the key configuration information and it is pushed to the MGCP gateway upon registration. Maybe MGCP was messed up. With calls redirected it seemed like an easy thing to try. So we “reset” the gateway. No change.
Well at this point it is time to admit defeat and collect relevant information including CUCM traces and phone-a-friend at Cisco TAC (Technical Assistance Center). I typed up my Service Request (SR) online and called TAC’s toll-free number to bump it to a Severity 2 (Network Degraded). Cisco doesn’t allow you to create anything other than a Severity 3 service request online. The contact center agent gladly changed the priority for me and routed me to an engineer. I gave the engineer a high level overview and he said, let me do some quick research; I’ve seen this before and I think it’s a bug. After a few minutes he came back and asked if we could try something. “First, place an inbound call to the MGCP gateway.” I did; worked fine. He said, “Now try your outbound call again.” Well hot dang, it worked. The CONNECT_ACK was sent and the call worked. He indicated this confirmed the bug.
The defect is known as BugID CSCto98848 (Cisco Login Required) within Cisco’s DDTS (Distributed Defect Tracking System). It does have a workaround which is good, however, it does require a CUCM patch to fully resolve it. We restored normal call routing for now using the workaround, but if MGCP registers or CCM restarts we will hit the bug again.
How did we hit this bug? Well the night before, we had performed maintenance on half of our CUCM servers where we took them out of service to add 2 GB of additional RAM. This was done in preparation for some upgrades we are doing later in the year. Nothing had changed configuration wise, but we did “mess” with the system the night before.
A few lessons learned here:
- Notice the small things. I happened to notice right away that the timer was not displaying on the IP Phone. This was a significant first clue what was going on and told us where to focus our next troubleshooting efforts.
- Know your gateway debugs. debug isdn q931 is a great Cisco IOS TDM debugging command. You don’t always need to memorize the sequence of the messages, but you can at least compare them to a known good debug to figure out the differences.
- Know alternate call routing options. Once we knew we could route calls around the problem, we took action to do so. Understanding your dialing plan and knowing how to redirect calls quickly if needed could save you some valuable network availability minutes.
- Know what has changed. We knew we did maintenance the night before. We didn’t make any configuration changes, but we did reboot three nodes in our seven node CUCM cluster. It seemed insignificant, but I told the TAC engineer that detail up front and he later thanked me for alerting him to these details. He indicated it helped him to zero in on the specific bug.
Hopefully this was a helpful post not only for how we approached this strange bug, but in some helpful things to think about when troubleshooting a voice issue. I really have enjoyed adding the voice components to my traditional data network engineering background. Voice issues are certainly different, however, a lot of the same troubleshooting steps can be applied in both. Know your systems, always have a plan B, and don’t be afraid to ask for help when you need it. Huh, all items that can apply to both aviation as well as network engineering.
Great post. At first, knowing you are a pilot and the title “No CONNECTION_ACK for you”, I assed you were talking about flights I have been on and the cancelled connecting flights.
One more reason to use H323.