This is the first in a series of (long overdue) posts related to odd bugs and behavior experienced in the Cisco Unified Border Element (CUBE) which is built into Cisco IOS. I will spare you all the details, but high level our environment looks like this:
- Cisco Unified Communications Manager (CUCM) – multisite deployment with centralized call processing with geographical diversity
- Contact Center – Cisco CVP including Call Studio, UCCE, Nuance ASR/TTS, Cisco Unified Presence Server (SIP Proxy)
- SIP Trunks with CUBE for Local/Long-Distance and Inbound Toll-Free
With our SIP trunking service for inbound toll-free there is an advanced feature you can add (for a fee) which provides you with the calling party number or billing number for any inbound toll-free call even if the caller is blocking their callerID information with *67 or other various methods. With this advanced feature added, when a call comes ingress (inbound) into CUBE the FROM field of the SIP INVITE it looks like this (yes, I’ve changed the numbers and IP addresses):
From: Anonymous <sip:firstname.lastname@example.org>
When it leaves the CUBE egress (outbound) it looks like this:
From: "anonymous" <sip:email@example.com>;tag=12345678-1A23 P-Asserted-Identity: "Anonymous " <sip:firstname.lastname@example.org>
By default, CUPS (SIP Proxy) will strip off the PAI header unless you configure the “Privacy With PAI” Service Parameter to Off. When Off, the Proxy will pass along any PAI header in INVITE requests as-is without any processing. In order to do ANI matching (the process of trying to identify a customer by the calling party number), we need to get this info into our front-end Studio application in CVP. After working with TAC, they said if we change this service parameter to on, we could proceed to enhance our scripts to use the data in the PAI field.
Change scheduled, change tested, PAI was now passing to CVP & UCCE. All was good until the next morning, calls started coming in about callers hearing “dead air” at various levels within the contact center applications and a few reports of “dropped calls”. Dropped calls are never good but we knew what had changed and the back-out was simple; edit the CUPS service parameter back. Problem resolved. Here is where the fun begins.
What was going on? This was very odd call behavior and the change had been tested with no issues the night before. Well there were two things going on:
- A Bug (for me, it’s usually always a bug)
- An imperfect test plan for our change record
To isolate this issue, it takes a good understanding how CVP/UCCE call flow works. There are a lot of moving peices with CVP & UCCE. For a very good overview and some pictures (pictures are always good), check out this post over at the Cisco Support Community. This post is a great overview of all the working parts when you are in a support role when CVP & Contact Center Enterprise is involved. When it works well, it works. When it breaks (and it will break), you best know how to troubleshoot on all these components and sub-processes very quickly.
Since the change was on CUPS, I focused my attention there. The SIP Proxy traces in CUPS are fantastic and mostly human comprehensible (always a bonus). I could see the calls were starting out fine, but most of the problems I saw were when the caller was being played ringback (that artificial ringing noise you hear when you are being transferred). Yes folks, that ringing sound you here is just a wave file sitting on an IOS gateway. After the INVITE to play ringback, I was seeing no response from the IOS gateway (CUBE in our case, but this could also be your VXML Gateway depending on how you are setup) followed by a CANCEL from CVP and a 408 (Request Timeout) error in CUPS.
Here is what a working call looks like in CUPS (SIP Proxy) for ringback:
14:23:40.379 |PID(11948) sip_sm.c(1266) Sent 1483 bytes UDP packet to 10.1.1.1:5060 INVITE sip:email@example.com:5060;transport=udp SIP/2.0 14:23:40.383 |ID(11740) sip_protocol.c(5925) Received 1131 bytes UDP packet from 10.1.1.1:58004 SIP/2.0 200 OK
Here is what that same call flow looked like after our service parameter change in CUPS to pass the PAI field:
07:13:28.577 |PID(1307) sip_sm.c(1266) Sent 1559 bytes UDP packet to 10.1.1.1:5060 INVITE sip:firstname.lastname@example.org:5060;transport=udp SIP/2.0 07:13:29.080 |ID(1309) sip_sm.c(1266) Sent 1559 bytes UDP packet to 10.1.1.1:5060 INVITE sip:email@example.com:5060;transport=udp SIP/2.0 07:13:30.130 |PID(1289) sip_sm.c(1266) Sent 1559 bytes UDP packet to 10.1.1.1:5060 INVITE sip:firstname.lastname@example.org:5060;transport=udp SIP/2.0 07:13:32.181 |PID(1266) mod_sip_rtf.c(195) rtf_add_bad_route() 07:13:32.181 |PID(1266) sip_sm.c(1266) Sent 1559 bytes UDP packet to 10.2.2.2:5060 INVITE sip:email@example.com:5060;transport=udp SIP/2.0 07:13:32.730 |PID(1315) sip_sm.c(1266) Sent 1559 bytes UDP packet to 10.2.2.2:5060 INVITE sip:firstname.lastname@example.org:5060;transport=udp SIP/2.0 07:13:33.781 |PID(1288) sip_sm.c(1266) Sent 1559 bytes UDP packet to 10.2.2.2:5060 INVITE sip:email@example.com:5060;transport=udp SIP/2.0 07:13:35.555 |PID(1312) sip_protocol.c(5925) Received 413 bytes UDP packet from <CVP IP>:7574 CANCEL sip:9191919102@<CUPS-SIP DOMAIN>;transport=udp SIP/2.0 07:13:35.556 |PID(1312) sip_sm.c(1266) Sent 364 bytes UDP packet to <CVP IP>:5060 SIP/2.0 200 OK 07:14:28.580 |PID(1267) sip_sm.c(1266) Sent 399 bytes UDP packet to <CVP IP>:5060 SIP/2.0 408 Request Timeout
The key here is to note the packet size differences. If you know your basic data networking (and voice engineers need to), you know that generally anything ~1500 bytes will be fragmented to fit into an Ethernet Frame. Armed with this information, it was time to work with Cisco TAC (yes I had done most of the work for them). Enter bug CSCtt38880, titled, “IOS gateway not handling fragmented SIP UDP message properly”.
Why did we not see this when we tested the night before? Well our test plan was somewhat flawed (actually it would have been sufficient had we not hit the bug), but we didn’t test a call all the way through to a rep. As a SIP call gets bounced around, headers get added, removed, and modified. We didn’t push the limits of 1500 bytes until the caller needed a rep. You need a rep, we play you ringback. The INVITE messages for ringback were dropped, callers heard silence, and some hung up. Those who actually stayed on the line were able to get to a rep, but people do funny things when things don’t work right.
The silver lining was that we enhanced our standard test plan and made some call flow changes to allows us to test calls all the way through our front-end CVP Studio applications all the way to queue and finally to a test contact center agent while in a maintenance window. Hopefully this helps enhance future change testing, however, typically no two outages are exactly the same.
Until the next adventure…