Swack’s Cisco Live To-Do List

Cisco live2

My company pays a lot of money to send me here to Cisco Live. That’s likely the case for you as well (if you’re also here). I’ve had a list at past conferences of what I wanted to accomplish but never really published it outside my head. This year I’m holding myself more accountable and putting it here.  Many are things I could do quite easily back in the office if I didn’t have distractions. Now I can focus AND talk to the smartest folks in the industry about how they do business. Here’s some of the many things I hope to accomplish this year.

1. Better understand the Catalyst 4500 series and how I can use them as an aggregation point for 10-gig connected closet switches. I’ve never really worked with them so getting a better idea of how they work, benefits and drawbacks, and deployment options is key. How else could I provide resilient aggregation for 27 network closets with 2x10G links each?

2. Learn AMAP (as much as possible) about 802.1x and how Cisco switches and phones handle it. What are the deployment methods and models? How can we use certificates or other methods like MAC Authentication Bypass (MAB) for Cisco VoIP phones where we have a client connected behind the phone? What are the capabilities of Cisco Secure ACS and Cisco Identity Services Engine (ISE) and how do they compare with other RADIUS methods such as Aruba Networks Clearpass Policy Manager (CPPM) or just a simple Windows RADIUS server?

3. Talk more in detail with Solarwinds Head Geeks and other smart engineers about how the latest version of Orion NPM Route Polling works. How can we map over 1200 locations using Orion so our retail support teams can better take advantage of Orion’s power and knowledge? How can we use Orion NPM and NCM to possibly replace our existing legacy Linux-based config generation tool for store routers and provision them in an automated way?

4. How should I troubleshoot high received errors on ASA and router interfaces (specifically 7200 series)?

5. What are my options for expanding a pair of 5548UP Nexus switches as I keep adding FEX and running out of ports? If I add another pair I add another point of management (boo!). If I replace with 5596s how do I handle the transition and what can I get for trading in the 5548s?

6. How can I get our NXOS gear properly sending syslogs to our syslog server? (I already know this is a great question for the TAC folks that are here.)

7. Learn more about how IP Address Management (IPAM) vendors can prepare us for an 802.1x deployment, especially in terms of learning our existing MAC addresses for a MAB table. I’ve heard of Infoblox and BlueCat. Any others worth looking at?

8. Get familiar with Cisco’s Next Gen Firewall capabilities and how it compares to certain competitors, particularly Palo Alto Networks.

I welcome your comments/feedback below or directly on Twitter (@swackhap).

-Swack

Advertisement

The Case of the Mysterious Disappearing VPN

Many of us in the networking world use IPSEC VPNs over the Internet. The ISP connection is, or at least can be, cheaper than alternatives like MPLS, and of course we all need to connect our networks to the Internet (unless you’re the DoD, CIA, or some other secretive organization with a classified network). This mystery begins with a VPN outage.

Refer to the reference network shown below.  For these two sites, the primary connectivity is the IPSEC VPN over the Internet. The MPLS VPN is a secondary connection.

Problem: IPSEC VPN Down
At 2:44am CT the primary 10Mbps IPSEC VPN went down, but the 3Mbps MPLS worked flawlessly after route reconvergence.  As the day progressed, the level of traffic between the two sites increased and began causing performance problems for users at Site B.

As we continued to troubleshoot what had happened, we found this syslog entry in Splunk that came from FW A:

Oct 14 02:44:33 fw.fw.fw.21 Oct 14 2010 02:44:33: %ASA4106023: Deny protocol 47 src inside:a.a.a.1 dst outside:b.b.b.254 by accessgroupinside_access_in
(Note: IP addresses have been changed here for security reasons.)

Nobody had made any changes at 2:44am. So what changed? After digging some more into our change management system, we found this change to FW A that was made back on 9/23:

BEFORE
AFTER
Last Month – 9/23/2010 12:00:18 AM
ADDS 0, DELETES 0, CHANGES 1
access-list inside_access_in extended permit gre host a.a.a.1 host b.b.b.254
access-list inside_access_in extended permit gre host a.a.a.254 host b.b.b.254
This change was logged during a nightly config backup/compare, thus the Midnight time listing. It turns out that day we added another VPN that connects from another site (we’ll call it Site C) back to Site A.  For that VPN, we chose to use a.a.a.254 as the GRE endpoint on RTR A. We prefer to use .1 addresses to manage routers, and with .1 as a GRE endpoint we can’t ping it.  Unfortunately, we didn’t realize the other VPN to Site B was active.  Apparently, the IPSEC security association (SA) remained active, as did the stateful firewall connection in FW A, until 2:44am CT.  So we ask ourselves again: What changed at that time?
Splunk to the Rescue
Diving more into the logs that we index with Splunk, we found visually when the problem started–it’s where the histogram suddenly goes from 17 events per hour to over 1500.
Clicking on the 2AM timeframe brings up many iterations of the “Deny protocol 47” message that was shown above. Immediately prior to that stream of messages we see these three events:
  • Oct 14 02:44:26 fw.fw.fw.21 Oct 14 2010 02:44:26: %ASA-3-713123: Group = [FW B InternetIP], IP = [FW B InternetIP], IKE lost contact with remote peer, deleting connection (keepalive type: DPD)
  • Oct 14 02:44:26 fw.fw.fw.21 Oct 14 2010 02:44:26: %ASA-5-713259: Group = [FW B InternetIP], IP = [FW B InternetIP], Session is being torn down. Reason: Lost Service
  • Oct 14 02:44:26 fw.fw.fw.21 Oct 14 2010 02:44:26: %ASA-4-113019: Group = [FW B InternetIP], Username = [FW B InternetIP], IP = [FW B InternetIP], Session disconnected. Session Type: IPsec, Duration: 21d 15h:00m:15s, Bytes xmt: 181785169, Bytes rcv: 3049561298, Reason: Lost Service
Correct me if I’m wrong, but it appears there may have been some connectivity problem on the Internet that happened just long enough for dead-peer-detection (DPD) to take effect and tear down the existing session. When that happened, a new IPSEC SA was created, still using the GRE endpoint of a.a.a.1. Since the firewall was previously changed to allow a.a.a.254 instead of a.a.a.1, this traffic got denied on the inside interface of FW A and prevented the GRE tunnel from coming up.
To fix, we added a rule to FW A allowing GRE from a.a.a.1 to b.b.b.254.

Mystery solved!