To quote the late Bob Ross, “It’s hard to see things when you are too close. Take a step back and look.” When you are in the middle of those early morning troubleshooting sessions, it can be easy to panic. First, you are dealing with being woken up abruptly; everything is fuzzy. Then the person on the other end of the line might not be able to explain exactly what is happening. This combination usually leads to some fun times, sarcastically speaking. The best advice is to take that step back and look at the overall picture. No matter what your troubleshooting approach might be, the goal is to have one. The worst thing you can do is aimlessly wander around hoping to bump into the solution. In this entry I will dig into my mental ticket system of random issues I’ve ran into over the years, pull out three, and cover how those issues were resolved.
A pair of Palo Alto firewalls in the data center were to be replaced with newer Palo Alto firewalls. Everything was setup in Palo’s Panorama which manages all Palo Alto firewalls. The new pair of firewalls were configured exactly the same as the old firewalls. It should be a simple one-for-one replacement. During the change window, cables were moved from the old firewalls to the new ones. During testing, services seemed to be fine internally, but then it was discovered that some services with Network Address Translations (NATs) that should be publicly accessible were not working.
My first thought was that I simply forgot to configure something on the firewalls. I did go through the config prior to deploying the new firewalls, but maybe I did forget something. Internally, I was able to reach those services, but on my external connection those public services were not reachable. Was it something with my NAT rules? It was not NAT. I went through my rules and routes one more time just to be sure. This issue seemed to be something only affecting the connections from the outside of the company. I hopped on one of the internet routers. It should hold some truth to what was going on. The routes into the company seemed fine. I took a look at the router’s ARP table. The IP addresses for the services in question were there. I took a note of the MAC address that showed up in the table for one of the non-working services. I then compared it to the MAC address of the interface on the firewall. It did not match! My old firewalls were up, but no longer configured. Only the management IP was reachable. I logged in and compared the MAC address for the egress interface on the old firewall. It matched the same MAC the internet router displayed in the ARP table. After clearing the ARP entry for the affected service on the internet routers, we were back in business. Now the new firewall’s MAC address showed up in the ARP entries.
The help-desk sends over a ticket stating the main phone number for one of the facilities was not reachable. They tried internally and externally with the same results, a voice stating the number was not reachable. There was no indication as to how long this number has been unavailable. Callers should be hearing an automated call handler.
Well, the first thing to do in this situation is to replicate the issue. You want to make sure the issue is not sporadic. If it was an intermittent issue, it can change the way you troubleshoot. I found the main line was down no matter when or where you call from. This is where having a call flow diagram helps. I took a look at the diagram and ran a few debugs on the router. Yes, running debugs in production can be useful in troubleshooting. You must always be careful not to overwhelm the device. Take a look at what would happen with different debugs. Sometimes running a conditional debug is necessary to minimize impact. I wanted to make sure the call was coming in to the site first. Also, how a call looks like or how it is manipulated would be seen in the debugs. I ran a few debug commands that I usually run when troubleshooting call issues. As the site’s WAN router also serves as a voice gateway terminating PRIs, these debugs are pretty helpful. The debug voice ccapi inout command will follow the call and keep track of it as it comes in to the site as well as what occurs with the call itself. The debug isdn q931 keeps track of the signaling as an ISDN connection is established. I turned these debugs on and called the number. The call did come into the router. This was a good sign that there were no issues with the provider or that number specifically from the outside. The public number being dialed was being transformed to an internal extension by the router. I followed the site’s call flow and saw the extension was a CTI Route Point in Call Manager. The CTI Route Point is a virtual device that can redirect calls to another system (in simple terms). What should have happened is the call would be routed to Unity which has the Call Handler for the main line. Someone had deleted the particular CTI Route Point! This was an easy fix. I just needed to recreate the CTI Route Point. However after recreating it I needed to track down who deleted it. Using Cisco’s Real-Time Monitoring Tool (RTMT), I was able to look at the Call Manager audit logs and track the person down. It had been deleted by mistake. Good laughs were had by all.
A tech at a site calls and reports users cannot reach the internet. It was working earlier in the day, then suddenly a few users started to lose connectivity.
The first thing I ask is for is who the users are that are affected. It seemed like only a few users were affected so far. The users all seemed to be in the office area. Other users around the building (on other networks) were alright, so far. I asked for the IP of one of the users so I can try an ping the user and find their switch port. The tech gives me an IP that does not match the network the users should be on. Well, that right there is a good clue to what might be going on. I asked him to verify the IP for at least one more user. It matched the same subnet of the unknown network. Users at the location should be pulling an IP address from the site’s DHCP server. It seemed that a few people were picking up IPs from an unknown server. We had no idea when or where, but now it was time to find out what was providing the IP addresses. I configured DHCP Snooping for the particular network the users were connected to. The goal was to only trust the ports where the known DHCP server traffic was coming from. All other ports would be untrusted by default, which would include the rogue DHCP server. As I was in the middle of configuration and saving the world, the tech calls me back. He simply says “I found it!”. “Who?” I ask. He goes on to explain that he just received a call from an onsite contractor who had connected in his personal router to the network and had a few questions. Apparently someone who came in for an audit connected the router in to an open port. The personal router started serving DHCP to clients. The tech apologized and handled the situation. I decided it was best to keep the DHCP Snooping config in. This actually was not the first time I’ve ran into this situation. Years before this one, a decommissioned wireless controller was plugged back into the network and started serving clients DHCP addresses. The same contractor situation has happened elsewhere since then too. DHCP Snooping took care of those problems.
Step back and take it all in. Ask the right questions. It might seem like many situations are full of chaos, but you will navigate through the fog if you do it carefully. That is the beauty of being in our field, situations like this will always happen. There is plenty of opportunity for practice. Its not like we want things to break, but it will happen, which is why we need to be prepared. Also remember, always blame DNS.