Recently, at work, we have had two separate instances with our SIP Service Provider where both their primary and secondary Acme Session Border Controller (SBC) clusters went into a “hung” state and we were off the air from the outside telephone world’s perspective. Despite all the provisioning precautions of having two geographically diverse carrier SBCs accessed from two geographically diverse MPLS transport circuits (used exclusively for SIP trunking) that route to two geographically diverse data centers with a dedicated CUBE router in each, we were still hosed. Doing a quick packet capture on the CUBE’s external interface we could see the provider’s SBCs were responding with SIP 503 “Service Unavailable” messages for every call attempt we made outbound. Inbound calls resulted in an “All Circuits Busy” message to callers and nothing was signaling ingress to our CUBEs from the provider.
The provider’s SBCs continued to respond to Out of Dialog (OOD) Options Pings so the dial-peers did not automatically “busy out” and thus our existing monitors didn’t alarm. For both of these outages, we had to wait until a user reported the issue before we knew to try and isolate the problem and escalate with the carrier (let me say, we were much faster the second time this happened). In the voice space of networking, the more you can be proactive the better — users get grumpy when their dial-tone doesn’t work. My quest after two of these outages was to figure out how to automate this so we can get proactively alerted if this happens again…
Here is what we saw in the packet capture for an outbound call (IP Addresses removed):
Cisco has a great white paper called “Cisco Unified Border Element (CUBE) Management and Manageability Specification” that is worth a read if you have not seen it before. It is a tad bit dated as it doesn’t cover some of the new stuff in IOS 15.2T, but it still has lots of good stuff in there. In this guide you will see that the CUBE keeps ongoing counters of all SIP error responses (4XX, 5XX & 6XX). So armed with the SNMP MIB information from this white paper (OID 22.214.171.124.126.96.36.199.188.8.131.52.7), I had the required information but now I needed to get the CUBE to tell me there was a problem. Although I had never used it in production, thanks to my background in the routing and switching space, I was aware of EEM and figured it was my best bet to build this solution.
For a while now, the Embedded Event Manager (EEM) features have quietly existed in IOS. If you spend some time researching EEM you can quickly see its great power & capabilities. Cisco describes EEM like this “Cisco IOS Embedded Event Manager (EEM) is a powerful tool integrated with Cisco IOS Software for system management from within the device itself. EEM offers the ability to monitor events and take informational, corrective, or any desired action when the monitored events occur or when a threshold is reached. Capturing the state of the router during such situations can be invaluable in taking immediate recovery actions and gathering information to perform root-cause analysis. Network availability is also improved if automatic recovery actions are performed without the need to fully reboot the routing device.”
I now had my counters and toolkit, but now I had to build my automated solution. I tried a few things at first, but didn’t have any luck getting things to work. Since I was using out-of-the-box functionality and not anything custom with TCL, I tried to open a Cisco TAC service request, but got this response:
Unfortunately, TAC does not provide customized configuration of these tools, just support of the features in the IOS itself, like when there is a bug in the feature.
Dang. Well I guess I’m on my own. A Google search of EEM will yield you a bunch of good resources, but nothing that was doing exactly what I wanted. This post on the Cisco Support Community was the most help. If you look very closely at the thread, you will catch that “You have to specify an instance for your object to be able to do a get-type of exact”. This means adding a .0 at the end of the OID. I guess this is some EEM caveat because I could query the original OID in my existing SNMP tools with no issues. This was my problem all along.
What did my final configuration look like? Here you go:
event manager applet cube-monitor-sip-503-inbound description Poll OID for SIP 503 errors ingress to CUBE every 30 seconds. If incrementing over previous poll, generate an critical level syslog alert. event snmp oid 184.108.40.206.220.127.116.11.18.104.22.168.7.0 get-type exact entry-op ge entry-val 1 entry-type increment exit-op eq exit-val 0 exit-type increment poll-interval 30 action 1.0 syslog priority critical msg "SIP 503 Service Unavailable messages are incrementing inbound. Check SIP network health."
You also need to make sure the EEM SNMP Event Detector has access to the IOS SNMP server with this command:
With this configuration in place, here is what I get in the logging buffer when the 503 errors increment:
%HA_EM-2-LOG: cube-monitor-sip-503-inbound: SIP 503 Service Unavailable messages are incrementing inbound. Check SIP network health.
The critical level syslog alert gets picked up by our existing monitors and we get a ticket sent to our group for tracking and resolution. Mission accomplished. Could I make this check more sexy and have more smarts and functionality? Yes, but for 4 lines of config, this gets the job done nicely.
- EEM Configuration for Cisco Integrated Services Router Platforms
- Cisco Unified Border Element (CUBE) Management and Manageability Specification
- Writing Embedded Event Manager Policies Using the Cisco IOS CLI
- Cisco Live Presentations:
- BRKNMS-2030 – Onboard Automation with Cisco IOS Embedded Event Manager
- BRKNMS-3021 – Advanced Cisco IOS Device Instrumentation