Thursday, April 9, 2015

Juniper SRX High Availability (HA) & clustering: Part 2

In Juniper SRX High Availability (HA) & clustering: Part 1 I discussed the theory of the key terms and concepts used in configuring reth groups, fab links and clustering in general. In this part we actually dive deep into the actual configuration.

The broad steps that will be covered in this part:

1) Setting up control links
2) Setting up fab links
3) Creating a redundancy group
4) Configuring reth interfaces
5) Enabling clustering

Pre-config steps:

1) Make sure both SRX have the same junos.
2) Both have the same additional licenses configured

Control links

In SRX 550 and SRX 650 there can be only one control link. All the control plane data is synced over this link. In the aforementioned models ge-0/0/1 of the two SRX's have to be connected with each other.

When we enable clustering, the ge-0/0/1 of the secondary SRX will automatically reconfigure itself as ge-9/0/1, This is a hardcoded value and cannot be modified.

Fab links

To configure fab links, connect ge-0/0/2 of both the SRX. You can use any port except ge-0/0/0 and ge-0/0/1.

Remember that we have not enabled clustering yet. To do that run the following commands:

On SRX A: >set chassis cluster cluster-id 1 node 0 reboot 
On SRX B: >set chassis cluster cluster-id 1 node 1 reboot

Note: 1) This has to be run in operational mode and not in configure mode.
2) You cannot assign 0 as cluster-id. The range is 1 to 15

This will enable clustering and ge-0/0/0 will become fxp0 on both the SRX. This will be used for management traffic of both the SRX.

ge-0/0/1 of both SRX will become fxp1

All the interfaces in SRX - secondary will become ge-9/0/<> (in SRX 550 and 650)

If clustering has been successful running show chassis cluster status will display redundancy group 0 information.

At this moment it is important to understand the special role of redundancy group 0.

Redundancy group 0

You can define any number of redundancy groups. Each group has the SRX's with one acting as master and the other as backup. All the properties and objects associated with a redundancy group point to one master and one backup.

For example I can map a couple of interfaces with redundancy group 1 and specify SRX B as master. For a couple of other interfaces i can specify redundancy group 2 with SRX A as the master.

Redundancy group 0 has the routing engine. The active SRX in this redundancy group determines which SRX has an active routing engine. 

As soon as you enable clustering the redundancy group 0 becomes active and need not be configured. ( you can set the priority which will be discussed later)

Setting up fab links

Fabric links are responsible for carrying the transit traffic. For example, if an uplink from an SRX goes down it will seamlessly start sending traffic to the other SRX using this link.

It is advised to connect ge-0/0/2--> ge-9/0/2 and ge-0/0/3 --> ge-9/0/3. 

Yes, fab links can be bundled. Post configuration, one link acts to send internal messages while the other is used purely to send transit traffic.

set interfaces fab0 fabric-options member-interfaces ge-0/0/2 
set interfaces fab0 fabric-options member-interfaces ge-0/0/3
set interfaces fab1 fabric-options member-interfaces ge-9/0/2 
set interfaces fab1 fabric-options member-interfaces ge-9/0/3 

NOTE: fab0 will have one/two interfaces on SRX A and fab1 will have corresponding interfaces in SRX B. Unlike aggregation where an ae0 will have interfaces from two switches.

set chassis cluster redundancy-group 0 node 0 priority 100 
set chassis cluster redundancy-group 0 node 1 priority 1

Creating a redundancy group and reth interfaces

You can assign interfaces to a particular redundancy group. In this simple scenario, we will configure the downlinks to SRX as reth0 with redundancy group 1 and reth0 in redundancy group 1.

We can put both the reth interface in the same group if we want to.We will configure so that SRX A will have primary link as its downlink while SRX B will be the primary for reth1.

set chassis cluster redundancy-group 2 node 0 priority 100 
set chassis cluster redundancy-group 2 node 1 priority 1
set chassis cluster reth-count 2
set interfaces ge-2/0/1 gigether-options redundant-parent reth0  //assigning a physical port to reth0
set interfaces ge-11/0/1 gigether-options redundant-parent reth0 //assigning a similar port in SRX B to reth0
set interfaces ge-2/0/2 gigether-options redundant-parent reth1 
set interfaces ge-11/0/2 gigether-options redundant-parent reth1
set interfaces reth0 redundant-ether-options redundancy-group 0 //mapping reth0 to redundancy group 1
set interfaces reth1 redundant-ether-options redundancy-group 1 //mapping reth1 to redundancy group 1
set security zones security-zone trust interfaces reth0.0
set security zones security-zone untrust interfaces reth1.0

We have basically configure redundancy group 0 node 0 (ie SRX A) to be primary (higher priority leads), while for redundancy group 1 node 1 (SRX B) acts as primary.

Commit your changes and clustering is enabled

Traffic analysis

All the traffic from internal network will hit SRX A since it is the primary interface for downlink. To go out it will pass through fab link, to SRX B and exit from it. This is because we have configured SRX B to be the primary for the outbound link.

Our model SRX architecture

This is just a simple configuration and complexities can arise when the topology becomes complex. Moreover interesting features like link monitoring have not been discussed. This will form part 3 of this series.

Interesting links:

Monday, May 5, 2014

Juniper SRX High Availability (HA) & clustering: Part 1

SRX is an enterprise grade firewall solution build by Juniper, one of the largest network equipment providers. In our organisation we have used SRX extensively as a firewall and IPS device. One of the great features of SRX is the ability to run it in a cluster, working in tandem.

High Availability (HA)

Failure of devices, links getting damaged or bugs crashing the OS are only a tiny fraction of things that terrorize a network/system administrator. Remember- things will fail. That is the mantra that one has to live by. Systems will go down, CPU spikes will crash your aggregate switch. For this reason the concept of HSRP(Cisco), VRRP (Open source)  was introduced in routers. The concept of Virtual chassis is extensively used in Juniper switches to bundle the switches together.

Juniper's SRX chassis cluster is a similar feature. Most of these features described in these articles have been tested on SRX 550 and 650. There are some things which differ from an SRX model to the other and therefore needs to be consulted in its manual. I have posted important links at the bottom.

This particular article will discuss the theory behind clustering and how SRX uses fab and control links to manage a failover.

When can clustering be useful:

Protecting from routing engine failures:

You can configure the routing engine to be shifted from one SRX to another. Routing failure can occur because of a buggy codebase of the JUNOS. Sudden power surges or outages can also cause routing engine failure (like pulling the plug out of SRX). High CPU spikes for a long time can also freeze it.

Upstream and downstream link failures:

Suppose you are using an SRX as an inline firewall and the upstream link to your core router fails or becomes choked? Without clustering all your traffic will start getting dropped.

Internet connectivity failure:

With ip monitoring feature (not supported in every SRX) you can have ICMP probes to multiple ips. If those probes fail the traffic get automatically shifted to the other SRX.

SRX redundancy

In the worst case if the SRX device goes down, the backup SRX will assume the mastership and the traffic will automatically get shifted to the other device. I have tested this and this hardly caused any ping loss.

Before getting into the actual configuration there are a few terms that you absolutely need to know.

Control Links:

The control plane of the two SRX is synced over this link. Different ports are designated as control ports in different SRX models. For example, in SRX 650 and 550 ge-0/0/1 is the control port. Remember this port will behave as a control port only if you enable clustering.

The two SRX's control port have to be physically connected to ge-0/0/1. Logically SRX1 will be connected from ge-0/0/1 to ge-9/0/1. This is how SRX behaves. These are hardcoded values and cannot be changed. Once they become part of the cluster, the primary SRX will continuously synchronize all the control plane information via this link.

Note that only in select models (high end SRX) can you have two links configured as control links in bundle. Otherwise you can have only one link as the control link. More about this in the next article.

FAB links:

Fab links (fabric links) are responsible for syncing the routing engine and passing the transit traffic across the SRX. Refer to the above diagram. If the link from SRX1-VC1 has failed all the traffic coming from RTR1 will reach SRX1, move via the fab link and pass downstream via SRX2.
Fortunately you can configure two data links as a bundle by defining fab0 and fab1 parameters. This will be configured in the next article.

Redundancy groups

This is where Juniper SRX's HA feature scores heavily. Redundancy group is a set of objects and properties which move to the backup SRX in case of a failure. You get the flexibility to define what is meant by a failure. It can be a link failing on either SRX or a ping check failing (known as ip-monitoring). Each redundancy group can specify which SRX node gets to be primary. For example, you can configure traffic for one subnet to go via SRX1 by making it as a primary node in a redundancy group. In another redundancy group you can specify SRX2 to be primary and associate a different subnet to it. This way traffic for subnets can be differentially preferred.

Reth interface

A reth (Redunant ETHernet) interface is a pseduo interface which has "child interface" from both nodes. This is important to maintain high availability. A reth interface belongs to a particular redundant group. All the traffic goes to the primary node of the the redundant group via the configured child interface of the corresponding reth interface. Too hard to digest?

Refer to the network diagram above.

For high availability I need atleast two interfaces, one connecting to each SRX node. Since these two interface serve a specific purpose-redundancy , Juniper classifies them as a virtual interface known as reth interface.

Remember they are not in bonding/lag. In lag all the interfaces are used simultaneously. Not in this case. Traffic will only travel via one particular interface. (We can have reth lags but that is an advanced topic).

Now how does SRX decide which interface to send traffic through? We can specify a primary node (node refers to SRX) and a secondary node. But that is the purpose of a redundancy group. Thus, we can configure a reth interface to be part of a redundancy group and it will inherit all the properties.

For example, in the network diagram if reth1 belongs to redundancy group 1 whose primary node is SRX1 all the traffic coming from RTR1 will travel the left most link and reach SRX1.

If you are still confused the configuration in the next article should solve it.

I will be using the following network topology to set up a redundant SRX cluster which will automatically failover if any of the upstream or downstream link fails. Also some basic troubleshooting steps will be discussed.

SRX provides a robust chassic clustering feature. It is fairly easy to configure once the key concepts are understood.

Important links:

1. Configuring SRX chassis clustering
2. Blog post on SRX clustering
3. Juniper techpub on SRX chassis clustering