Corosync & Pacemaker

July 10, 2018

Corrosync and Pacemaker offer a toolset to ensure high availability of services on Linux, and other platforms. You setup a cluster of machines and start services on them. Services can be moved between machines, and will failover to other machines in the cluster if they fail. Here are my notes from a quick investigation - more links at the end.

I first tried to do this using some EC2 instances on AWS. This failed - AWS doesn't support multicast networking, which is how these tools communicate. This example was done with two Centos VMs running in VirtualBox on a Windows 10 machine.

The two VMs are named web1 and web2, and these entries are in each VM's /etc/hosts:

192.168.56.68 web1
192.168.56.69 web2

First, install the required software on both nodes:

# yum install pacemaker corosync pcs resource-agents
# systemctl start pcsd.service
# systemctl enable pcsd.service

Set the password for the hacluster user on each machine:

# passwd hacluster

Then, one one of the nodes, create the cluster auth (using the password specified above) and then start the cluster:

# pcs cluster auth web1 web2 -u hacluster -p PASSWORD --force
# pcs cluster setup --force --name pacemaker1 web1 web2
# pcs cluster start --all
# pcs property set stonith-enabled=false
# pcs property set no-quorum-policy=ignore

The nodes are now both running, and you should be able to see this using the following command:

# pcs status
Cluster name: pacemaker1
Stack: corosync
Current DC: web2 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum
Last updated: Tue Jul 10 09:52:48 2018
Last change: Tue Jul 10 09:28:27 2018 by root via crm_resource on web2

2 nodes configured
5 resources configured

Online: [ web1 web2 ]

Full list of resources:

Clone Set: clone-test-clone [clone-test]
     Started: [ web1 web2 ]
 virtual_ip     (ocf::heartbeat:IPaddr2):       Started web2
 webserver      (ocf::heartbeat:nginx): Started web2

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

(This example has some resources configured too including a cloned resource on both nodes)

NGINX

We're going to run the nginx process on one of the boxes, and have the ability for it to failover to the other box. We'll also manually move it between boxes. Before you proceed make sure nginx is setup on both boxes but not running. Change the default HTML served so that you can tell which box it is served from.

Here's the setup we need. We first create a floating IP address (which we can point to either machine in the cluster), and then create a resource called webserver.

# pcs resource create virtual_ip ocf:heartbeat:IPaddr2 ip=192.168.56.70 cidr_netmask=32 op monitor interval=10s
# pcs resource create webserver ocf:heartbeat:nginx configfile=/etc/nginx/nginx.conf op monitor timeout="5s" interval="5s"
# pcs constraint colocation add webserver virtual_ip INFINITY
# pcs constraint order virtual_ip then webserver
# pcs contraint location webserver prefers web1=50
# pcs constraint location webserver prefers web1=50
# pcs cluster stop --all
# pcs cluster start --all

Remember to make sure that nginx isn't running outside of pacemaker control - the cluster software should start it as required.

At this point nginx should be running on web1. If you access the server by their individual IP addresses from a browser, you should see that the IP for web1 will return page whilst the IP for web2 will timeout. Using the virtual IP (192.168.56.70 in the example above) allows you to access the resource whichever node it is on - at the moment it should return the page from web1.

You can move it to web2 using:

# pcs resource move webserver web2

Once this is done you can use pcs status to see what's running where. Accessing the virtual IP as before should now return the page from web2.

Other commands

These commands can be used to show information about the resource:

# pcs resource show webserver
 Resource: webserver (class=ocf provider=heartbeat type=nginx)
  Attributes: configfile=/etc/nginx/nginx.conf
  Operations: monitor interval=5s timeout=5s (webserver-monitor-interval-5s)
          reload interval=0s timeout=40s (webserver-reload-interval-0s)
          start interval=0s timeout=60s (webserver-start-interval-0s)
          stop interval=0s timeout=60s (webserver-stop-interval-0s)

Configuration can be changed using commands such as:

# pcs resource meta webserver failure-timeout=9s

Here's a command to show details of the members of the cluster

# corosync-cmapctl | grep members
runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(192.168.56.68)
runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.1.status (str) = joined
runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0 
runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(192.168.56.69)
runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.2.status (str) = joined

Cloned Resources

Cloned resources are ones that run on all nodes in the cluster. The same resource runs everywhere, rather than having a single resource that moves around machines.

References

Command references