Monitoring Conductor and Worker Nodes - AWS Elemental Conductor Live 3

This is version 3.17 of the AWS Elemental Conductor Live 3 documentation. This is the latest version. For prior versions, see the Previous Versions section of AWS Elemental Conductor Live 3 Documentation.

Monitoring Conductor and Worker Nodes

You should monitor the nodes regularly to ensure that they are still all online.

  1. Go to the Status > Overview screen and view the Nodes information. All your nodes are typically online although you may have taken a node offline.

    
                  Image file monitor-nodes-overview.png
  2. When this screen shows failed or offline nodes, go to the Cluster > Nodes screen and identify the problem node or nodes.

    
                  Image file monitor-nodes-cluster.png
  3. Look for nodes that have a red or yellow background and an orange icon in the Status column.

  4. Choose an orange icon to go to the Status > Alerts & Messages screen to display detailed information about the alerts and messages. The Alerts & Messages screen appears with the filter set to show only the information for this channel.

  5. Review the alerts and messages to determine why the channel failed.

  6. For detailed information on dealing with problems, see the appropriate section:

Offline Nodes

Investigate an offline node if you were not expecting nodes to be offline. Try to determine why the node has been taken offline (speak to other engineers and operators) and, if necessary, take steps to bring the node back online.

Failed Worker Nodes with Worker Redundancy

When worker redundancy is implemented on the cluster and a node switches to the failed status, any channels that are running on the worker node move to a backup node, as described in “How Worker Node Failover Occurs” below.

Setting up for Notification

We recommend that you set up AWS Elemental Conductor Live 3 so that it sends you an email or hits your webserver when the following alerts or messages are raised:

  • 4009

  • 4010

  • 4018

See the AWS Elemental Conductor Live 3 Configuration Guide for information on setting up notifications.

Immediate Action

When a node goes to failed, follow this procedure to deal with the failed node and with the redundancy setup.

  1. Go to the Cluster > Redundancy screen and look for the redundancy group that the failed node belongs to: choose each group in the Redundancy Groups section and look for the node in the Active Nodes tab and the Backup Nodes tab.

    If the node appears in the Backup Nodes tab, see “If a Reserve Node Fails”, below. Otherwise, continue this procedure.

  2. Verify if there is still at least one node listed in the Backup tab.

    • If yes, then there is no immediate need to deal with the failed node, but you should still deal with it in a timely manner.

    • If not, you can assume that when the failed node failed over, it used up the last of your backup nodes. You should solve the problem on the failed node as soon as possible and bring it back into service so that you can get back to the state of having at least one backup node.

      You receive an alert if you have a redundancy group set up but do not have any backup nodes available.

  3. To investigate the failed node (either now or later):

    • Go to the Status > Nodes screen. The node should have an orange icon in the Status column. Choose this icon; the Status > Alerts & Messages screen appears, filtered to show only the information for that node.

    • Review the alerts and messages to determine why the node failed.

  4. Make sure you have the desired number of backup nodes set up.

How Worker Node Failover Occurs

  1. AWS Elemental Conductor Live 3 determines the action to attempt:

    • If the node was online/idle before it failed, AWS Elemental Conductor Live 3 takes no fail over action. The node simply goes to the failed status.

    • If the node was online/running, AWS Elemental Conductor Live 3 attempts to failover this node to one of the reserve nodes, as described in the following steps.

  2. AWS Elemental Conductor Live 3 identifies the redundancy group that the failed node belongs to and selects a reserve node in that group.

  3. AWS Elemental Conductor Live 3 then attempts to move all channels (in the case of a failed AWS Elemental Live node) or MPTS outputs (in the case of a failed AWS Elemental Statmux node) to node_Y and restart the previously running channels or MPTS outputs on this new node. The role for node_Y changes from reserve to active. This node is no longer eligible to be selected as a failover node if another active node fails.

If a Reserve Node Fails

If a reserve node fails when it is currently in reserve, it stays as a reserve node but its status changes to offline.

If a reserve node switches to active and then fails, it will be eligible to fail over to another reserve, in the same way as any other active node is eligible.

When a Failed Node Recovers

When a node that is failed is brought back into service, it returns to the status it had when it failed: Active or Backup.

False Failure

AWS Elemental Conductor Live 3 may determine that node_X has failed, when in fact it has only become disconnected from the management network (and is continuing to run channels) but has not shut down.

Meanwhile, because AWS Elemental Conductor Live 3 has determined that a failure has occurred, it attempts to perform a fail over. The fail over routine does not include any attempt to stop the channels running on node_X. If the fail over succeeds, the channels are running on both node_X and the fail over node.

However, if the network connection is later re-established (so that AWS Elemental Conductor Live 3 can now view activity on node_X), Conductor attempts to shut down the channels or MPTS outputs that are running there.

If a Node Does Not Fail Over

If a node fails but there is no reserve node ready to take over for it, the node remains active/offline. When the node problem is resolved and the node goes back online, it still has its original channels. Channels that were running before the failure start running again.

Monitoring the Distribution of Nodes

After a fail over, you should check the state of the redundancy group and take steps to ensure that the distribution of active nodes to reserve nodes matches the desired redundancy type (distribution of active versus backup nodes).

For example, you need to make sure that there is always at least one reserve node in each redundancy group. Each time a node fails, a reserve node switches to active. It is possible for all nodes to become active, in which case you need to re-assign at least one node to reserve in order to be prepared for a possible new fail over.

On the Redundancy screen, make sure that the Redundancy type has a non-zero number as the second number:


                  Image file monitor-node-distribution.png

Failed Worker Nodes without Worker Redundancy

When worker redundancy is not implemented on the cluster and a worker node is failed, you must:

  • Determine if failure of the node has caused channels to fail and then take steps to re-start those failed channels.

  • Deal with the problem node.

To troubleshoot nodes

  1. Go to the Channels screen and determine if any channels have failed. If they have, then move the channels to other nodes as soon as possible. See Modifying a Channel and change the associated node.

  2. Go to the Status > Nodes screen. The node should have an orange icon in the Status column. Choose this icon; the Status > Alerts & Messages screen appears, filtered to show only the information for that node.

  3. Review the alerts and messages to determine why the node failed.

  4. Take the necessary steps to resolve the problem and bring the node back into service.

Failed Conductor Nodes with Conductor Redundancy

When you have redundant Conductor nodes set up, then, when the primary fails, the backup automatically takes over management of the cluster. This change in role takes a few seconds.

Once failover has occurred, the two Conductor nodes continue in their new roles.

Even when the failed Conductor comes back online, it does not take back the leader role: it comes back into service as the backup.

Note

Typically, there is no reason for you to switch back to the node that was previously the primary: there is no need to prefer one Conductor node as the primary for the other. In fact, there is no mechanism for you to switch the node roles.

Failed Conductor Nodes without Conductor Redundancy

When your cluster has only one Conductor node, then when it fails, you are not able to use AWS Elemental Conductor Live 3 to control worker nodes. The worker nodes are not affected by the Conductor node failure.

To troubleshoot a Conductor node

  1. Go to the Status > Nodes screen. The Conductor node should have an orange icon in the Status column. Choose this icon; the Status > Alerts & Messages screen appears, filtered to show only the information for that node.

  2. Review the alerts and messages to determine why the node failed.

  3. Take the necessary steps to resolve the problem and bring the node back into service.

Returning a Node from Failure

When the Conductor node comes back online, it automatically takes over management of the cluster again. It brings itself up to date in terms of activity and status of all the nodes:

  • If an alert was active when the node failed and the problem no longer exists, the alert is automatically cleared.

  • If an alert was active and the problem still exists, the alert is not cleared.

  • If a problem occurred on a worker node while the Conductor node was offline, the Conductor now detects this problem and displays a new alert or message.

  • If a problem occurred and got resolved on a worker node while the Conductor node was offline, the Conductor has no knowledge of that problem ever having existed. This is really the only missing information from the outage.