The Heat API – A template based orchestration framework

Over the last year, Angus Salkeld and I have been developing a IAAS high availability service called Pacemaker Cloud.  We learned that the problem we were really solving was orchestration.  Another dev group was also looking at this problem inside Red Hat from the launching side.  We decided to take two weeks off from our existing work and see if we could join together to create a proof of concept implementation from scratch of AWS CloudFormation for OpenStack.  The result of that work was a proof of concept project which provided launching of a WordPress template, as had been done in our previous project.

The developers decided to take another couple weeks to determine if we could get a more functional system that would handle composite virtual machines.  Today, we released that version, our second iteration of  the Heat API.  Since we have many more developers, and a project that exceeded our previous functionality of Pacemaker Cloud, the Heat Development Community has decided to cease work on our previous orchestration projects and focus our efforts on Heat.

A bit about Heat:  The Heat API implements the AWS Cloud Formations API.  This API provides a rest interface for creating composite VMs called Stacks from template files.  The goal of the software is to be able to accurately launch AWS CloudFormation Stacks on OpenStack.  We will also enable good quality high availability based upon the technologies we created in Pacemaker Cloud including escalation.

Given that C was a poor choice of implementation language for making REST based cloud services, Heat is implemented in Python which is fantastic for REST services.  The Heat API also follows OpenStack design principles.  Our initial design after our POC shows the basics of our architecture and our quickstart guide can be used with our second iteration release.

mailing list is available for developer and user discussion.  We track milestones and issues using github’s issue tracker.  Things are moving fast – come join our project on github or chat with the devs on #heat on freenode!

Corosync 2.0.0 released!

A few short weeks after Corosync 1.0.0 was released, the developers huddled for our future planning of Corosync 2.0.0.  The major focus of that meeting was “Corosync as implemented is too complicated”.  We had threads, semaphores, mutexes, an entire protocol, plugins, a bunch of unused services, a backwards compatability layer, multiple cryptographic engines.

Going for us, we did have a world class group communication system implementation (if not a little complicated) developed by a large community of developers, battle hardened by thousands of field deployments, tested by tens of thousands of community members.

As a result of that meeting, we decided to keep the good and throw out the bad, as we did between the openais and corosync transitions.  Gone are threads.  Gone are compatibility layers.  Gone are plugins.  Gone are unsupported encryption engines.  Gone are a bunch of other user-invisible junk that was crudding up the code base.

Shortly after Corosync 2.0.0 development was started, Angus Salkeld had the great idea of taking the infrastructure in corosync (IPC, Logging, Timers, Poll loop, shared memory, etc) and putting that into a new project called libqb.  The objective of this work was obvious:  To create a world-class infrastructure library specifically focused on the needs of cluster developers with a great built-in make-check test suite.

This helped us reach even closer to our goals of simplification.  As we pushed the infrastructure out of base Corosync, we could focus more on protocols/APIs.  You would be surprised to find that implementing the infrastructure took about as much effort as the rest of the system (APIs and Totem).

All of this herculean effort wouldn’t be possible without our developer and user community.  I’d especially like to acknowledge Jan Friesse in his leadership role of helping to coordinate the upstream release process and drive the upstream feature set to 2.0.0 resolution.  Angus Salkeld was invaluable in his huge libqb effort which occurred on time and with great quality.  Finally I want to thank Fabio Di Nitto for beating various parts of the Corosync code base into submission and his special role in designing the votequorum API.  There are many other contributors including developers and tested who I won’t mention individually, but I’d also like to thank for their improvements to the code base.

Great job devs!!  Now its up to the users of Corosync to tell us if we delivered on our objective we set out with 18 months ago – making Corosync 2.0 faster, simpler, smaller, and most importantly higher quality.

The software can be downloaded from Corosync’s Website.  Corosync 2.0, as well as the rest of the improved community developed cluster stack will show up in distros as they refresh their stacks.

Announcing Pacemaker Cloud 0.6.0 release

I am super pleased to announce the release of Pacemaker Cloud 0.6.0.

Pádraig Brady will be providing a live demonstration of Pacemaker Cloud integrated with OpenStack at FOSDEM.

 

What is pacemaker cloud?

Pacemaker Cloud is a high scale high availability system for virtual machine and cloud environments.  Pacemaker Cloud uses the techniques of fault detection, fault isolation, recovery and notification to provide a full high availability solution tailored to cloud environments.

Pacemaker Cloud combines multiple virtual machines (called assemblies) into one application group (called a deployable).  The deployable is then managed to maintain an active and running state in in the face of failures.  Recovery escalation is used to recover from repetitive failures and drive the deployable to a known good working state.

New in this release:

  • OpenStack integration
  • Division of supported infrastructures into separate packaging
  • Ubuntu assembly support
  • WordPress vm + MySql deployable demonstration
  • Significantly more readable event notification output
  • Add ssh keys to generated assemblies
  • Recovery escalation
  • Bug fixes
  • Performance enhancements

Where to get the software:

The software is available for download on the project’s website.

Adding second monitoring method to Pacemaker Cloud – sshd

Recently Angus Salkeld and I have decided to start working on a second approach to Pacemaker Cloud monitoring. Today we monitor with Matahari. We would also like the ability to monitor with OpenSSH’s sshd.  With this model, sshd becomes a second monitoring agent in addition to Matahari.  Since sshd is everywhere, and everyone is comfortable with the SSH security model, we believe this makes a superb alternative monitoring solution.

To help kick that work off, I’ve started a new branch in our git repository where this code will be located called topic-ssh.

To summarize the work, we are taking the dped binary and making a second libssh2 specific binary based on the work of the dped. We will also integrate directly with libdeltacloud as part of this work. The output of this topic will be the major work in the 0.7.0 release.

We looked at python as the language for dped, but testing showed that not to be particularly feasible without drastically complicating our operating model. With our model of running thousands of dpe processes on one system, one dpe per deployable, we would need python to have a small footprint. Testing showed that python consumes 15 times as much memory per dpe instance vs a comparable C binary.

We think there are many opportunities for people without a strong C skillset, but with a strong python skillset to contribute tremendously to the project in the CPE component. We plan to rework the CPE process into a python implementation.

If you want to get involved in the project today, working on the CPE C++ to python rework would be a great place to start!

Release schedule for Corosync Needle (2.0)

Over the last 18 months, the Corosync development community has been hard at work making Corosync Needle (version 2.0.0) a reality.  This release offers an evolutionary step in Corosync by adding several community requested features, removing the troubling threads and plugins, and tidying up the quorum code base.

I would like to point out the dilligent work of Jan Friesse (Honza) for tackling the 15 or so feature backlog items on our feature list.  Angus Salkeld has taken the architectural step of moving the infrastructure (ipc, logging, and other infrastructure components) of Corosync into a separate project (http://www.libqb.org).  Finally I’d like to point out the excellent work of Fabio Di Nitto and his cabal for tackling the quorum code base to make it truly usable for bare metal clusters.

The release schedule is as follows:

Alpha		January 17, 2012	version 1.99.0
Beta		January 31, 2012	version 1.99.1
RC1		February 7, 2012	version 1.99.2
RC2		February 14, 2012	version 1.99.3
RC3		February 20, 2012	version 1.99.4
RC4		February 27, 2012	version 1.99.5
RC5		March 6, 2012		version 1.99.6
Release 2.0.0	March 13, 2012		version 2.0.0