9 Sept 2011

ActiveMQ Network Bridge to Master/Slave broker pair

This article got updated on 10/20/11 in order to include the latest changes to the failover protocol for version 5.6.

There are scenarios where in a network of brokers each broker instance needs to be highly available. For example in hub and spoke architectures it might not be tolerable to be disconnected from the hub broker for a certain amount of time. This would prevent processing messages from spokes in real-time by consumers attached to the hub broker.

Consider an ActiveMQ network of brokers configuration of two nodes where each node also needs to be highly available. To achieve High Availability, each node needs to have a slave broker attached like in the following picture:







As shown in the above figure, Master Broker 1 needs to be able to create a network bridge to either Master Broker 2 or in case it is unavailable, Slave Broker 2.
Likewise if Master Broker 1 dies, Slave Broker 1 will take over and it also needs to be able to connect to either Master or Slave of Broker 2.
That way you can achieve high availability of all broker instances in a network of brokers. You already guess that such configuration can become really complex for a larger number of brokers in the network (also depending on the cluster topology).


The configuration of Master/Slave is decently well explained in the ActiveMQ documentation. But how do we configure the brokers network connector so that the network bridge is able to failover to a slave if the master crashes?

When trying to setup a bridge from Master Broker 1 to the master/slave pair of Broker 2, a configuration like

<networkConnector name="NC_1"
    uri="static://(tcp://broker2Master:61616,tcp://broker2Slave:61616)"/>


won’t be ideal. Because of the static: url list, the network connector will try to connect to both Master Broker 2 and Slave Broker 2. It will connect to Master Broker 2 just fine but will also retry connecting to Slave Broker 2 forever. Any failed attempt will be logged inside Broker 1 with a warning

WARN | Could not start network bridge between vm://localhost and tcp://broker2Slave:61616 due to: java.net.ConnectExcption: Connection refused.


This does not only pollute the log file and makes it harder to spot any other issues in the log but it also takes some amount of CPU every couple of seconds for trying a reconnect.

Side note: When using static:// in the network connector uri, lost connections will be re-established automatically. On this regards it works like the failover protocol but it tries to connect to all urls in the list.

Instead of trying to connect to all brokers specified in the above network connector uri list (i.e. Master Broker 2 and Slave Broker 2), we only want the bridge to connect to one broker of the list. Therefore a better idea is to use the failover:// protocol inside the network connector configuration. We only want the Master Broker 1 to connect to either Master Broker 2 or Slave Broker 2 but not both. So the configuration now becomes

<networkConnector name="NC_1"
   uri="static:failover:(tcp://broker2Master:61616,tcp://broker2Slave:61616)?randomize=false"/>


By also setting randomize=false we will always try connecting to Master Broker 2 first, before attempting to connect to the Slave Broker 2. This configuration will keep trying to connect to one of the two brokers until a connection gets established. It will not try to connect to both brokers in the list!

This configuration however still has a problem.
In case of a connection loss the failover transport will keep trying to reconnect to the specified urls transparently and will not propagate any exceptions up to the higher layer. So if after a crash of the Master 2 broker the Slave 2 broker has finally started up, the tcp transport connection will get re-established transparently by the failover transport in Master Broker 1 without flagging the loss of connection to any higher layers. That implies the broker’s discovery agent that is responsible for creating the network bridge will never be notified and hence will never try to re-create the network bridge (which involves exchanging a couple of messages between brokers). 

You can confirm this in jconsole, looking at the bridge MBean. The name of the MBean will refer to the previous master broker whereas the RemoteAddress property got updated with the new master broker.
Prior to version 5.6 the bridge would most likely not work anymore. Things have improved in 5.6 thanks to the fix of AMQ-3542 and the bridge may continue to work but the original problem remains. The bridge need to get re-established after failing over.

So what is needed to re-establish the network bridge after a failover? The solution is to use the failover transport for trying to connect to the urls in the connection list without any reconnects. 
When the tcp connection of the bridge is lost, the failover protocol will not try to reconnect but raise an exception to the DiscoveryAgent. The agent will clean up its bridge and ask the underlying failover transport to re-establish a connection to either master or slave of Broker 2. Now the failover transport will try all urls in its list only once, it won't attempt any reconnects on its own. If it still can't connect, it raises an exception back to the agent again, which after a timeout will ask the failover transport again to reconnect. This continues until the failover transport succeeds in connecting to one of the specified urls. Once the tcp connection is established, the DiscoveryAgent can recreate the bridge based on the new connection.
By not letting the failover transport reconnect on its own, the DiscoveryAgent is now aware that the connection of the network bridge got lost (as the error was propagated) and will re-establish the network bridge once the connection got restored. 

The failover transport needs to be explicitly configured to not reconnect on its own but raise an exception instead. By default it will reconnect forever.
From version 5.6 onwards, you should use the failover property maxReconnectAttempts=0 for that reason.

So the network bridge configuration finally becomes (for versions 5.6 onwards):

<networkConnector name="NC_1" uri="static:failover:(tcp://broker2Master:61616,tcp://broker2Slave:61616)?randomize=false&maxReconnectAttempts=0"/>



Also from version 5.6 onwards you will be able to use "masterslave:" instead of "static:failover:()". So the above example becomes


<networkConnector name="NC_1" uri="masterslave:(tcp://broker2Master:61616,tcp://broker2Slave:61616)"/>

See AMQ-3564 for more details but "masterslave:" simply maps to 
"static:failover:()?randomize=false&maxReconnectAttempts=0"

Prior to version 5.6 you cannot really configure for no reconnects. A value of 0 means "reconnect forever". From version 5.6 the value 0 means "do not reconnect". See the updated failover transport reference.


So for versions < 5.6 instead use maxReconnectAttempts=1 and allow one reconnect, which by default happen within 10 milliseconds (unless configured otherwise using initialReconnectDelay property of the failover transport). This is generally to short for the slave broker to take over. 



The network bridge configuration for Slave Broker 1 would be the same as it also needs to connect to either Master Broker 2 or Slave Broker 2. 

Summary: When trying to setup a network bridge to a master/slave broker pair, use the failover transport with maxReconnectAttempts=0.

Side note: When the remote broker is shutdown gracefully, the network bridge will get unregistered and closed down so that the discovery agent will always be aware of it. In that case the network bridge will be re-established correctly when the remote broker gets restarted even when not setting maxReconnectAttempts. But it won’t help in case the remote broker crashes.