Corosync & Pacemaker
July 10, 2018
Corrosync and Pacemaker offer a toolset to ensure high availability of services on Linux, and other platforms. You setup a cluster of machines and start services on them. Services can be moved between machines, and will failover to other machines in the cluster if they fail. Here are my notes from a quick investigation - more links at the end.
I first tried to do this using some EC2 instances on AWS. This failed - AWS doesn't support multicast networking, which is how these tools communicate. This example was done with two Centos VMs running in VirtualBox on a Windows 10 machine.
The two VMs are named web1
and web2
, and these entries are in each VM's /etc/hosts
:
192.168.56.68 web1 192.168.56.69 web2
First, install the required software on both nodes:
# yum install pacemaker corosync pcs resource-agents # systemctl start pcsd.service # systemctl enable pcsd.service
Set the password for the hacluster user on each machine:
# passwd hacluster
Then, one one of the nodes, create the cluster auth (using the password specified above) and then start the cluster:
# pcs cluster auth web1 web2 -u hacluster -p PASSWORD --force # pcs cluster setup --force --name pacemaker1 web1 web2 # pcs cluster start --all # pcs property set stonith-enabled=false # pcs property set no-quorum-policy=ignore
The nodes are now both running, and you should be able to see this using the following command:
# pcs status Cluster name: pacemaker1 Stack: corosync Current DC: web2 (version 1.1.18-11.el7_5.3-2b07d5c5a9) - partition with quorum Last updated: Tue Jul 10 09:52:48 2018 Last change: Tue Jul 10 09:28:27 2018 by root via crm_resource on web2 2 nodes configured 5 resources configured Online: [ web1 web2 ] Full list of resources: Clone Set: clone-test-clone [clone-test] Started: [ web1 web2 ] virtual_ip (ocf::heartbeat:IPaddr2): Started web2 webserver (ocf::heartbeat:nginx): Started web2 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled
(This example has some resources configured too including a cloned resource on both nodes)
NGINX
We're going to run the nginx process on one of the boxes, and have the ability for it to failover to the other box. We'll also manually move it between boxes. Before you proceed make sure nginx is setup on both boxes but not running. Change the default HTML served so that you can tell which box it is served from.
Here's the setup we need. We first create a floating IP address (which we can point to either machine in the cluster), and then create a resource called webserver
.
# pcs resource create virtual_ip ocf:heartbeat:IPaddr2 ip=192.168.56.70 cidr_netmask=32 op monitor interval=10s # pcs resource create webserver ocf:heartbeat:nginx configfile=/etc/nginx/nginx.conf op monitor timeout="5s" interval="5s" # pcs constraint colocation add webserver virtual_ip INFINITY # pcs constraint order virtual_ip then webserver # pcs contraint location webserver prefers web1=50 # pcs constraint location webserver prefers web1=50 # pcs cluster stop --all # pcs cluster start --all
Remember to make sure that nginx isn't running outside of pacemaker control - the cluster software should start it as required.
At this point nginx should be running on web1
. If you access the server by their individual IP addresses from a browser, you should see that the IP for web1
will return page whilst the IP for web2
will timeout. Using the virtual IP (192.168.56.70
in the example above) allows you to access the resource whichever node it is on - at the moment it should return the page from web1
.
You can move it to web2
using:
# pcs resource move webserver web2
Once this is done you can use pcs status
to see what's running where. Accessing the virtual IP as before should now return the page from web2
.
Other commands
These commands can be used to show information about the resource:
# pcs resource show webserver Resource: webserver (class=ocf provider=heartbeat type=nginx) Attributes: configfile=/etc/nginx/nginx.conf Operations: monitor interval=5s timeout=5s (webserver-monitor-interval-5s) reload interval=0s timeout=40s (webserver-reload-interval-0s) start interval=0s timeout=60s (webserver-start-interval-0s) stop interval=0s timeout=60s (webserver-stop-interval-0s)
Configuration can be changed using commands such as:
# pcs resource meta webserver failure-timeout=9s
Here's a command to show details of the members of the cluster
# corosync-cmapctl | grep members runtime.totem.pg.mrp.srp.members.1.config_version (u64) = 0 runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(192.168.56.68) runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 1 runtime.totem.pg.mrp.srp.members.1.status (str) = joined runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0 runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(192.168.56.69) runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1 runtime.totem.pg.mrp.srp.members.2.status (str) = joined
Cloned Resources
Cloned resources are ones that run on all nodes in the cluster. The same resource runs everywhere, rather than having a single resource that moves around machines.
References
- http://www.alexlinux.com/pacemaker-corosync-nginx-cluster/
- http://clusterlabs.org/quickstart.html
- http://www.linux-ha.org/doc/dev-guides/ra-dev-guide.html
- https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/configuring_the_red_hat_high_availability_add-on_with_pacemaker/ch-advancedresource-haar