Background

Recently, a lot of people are reporting that they cannot launch NeCTAR instances.  The typical behavior is that the launch fails with a cryptic error message "no valid host was found".

What does "no valid host was found" mean?

The fundamental meaning of the "no valid host was found" error is that there is no compute node in the relevant availability zone (or zones) that can satisfy your instance launch request.  It is typically a combination of three factors:

  • General availability of unallocated compute resources in the availability zone that you requested.  (Or if you didn't specify a specific availability zone, then across the entire cluster.)
  • Availability of enough compute resources on a single compute node.
  • If you are requesting a non-standard flavor, unavailability of resources that satisfy the constraints of that flavor.  (For example, if you are requesting a special instance with a GPU and all GPU nodes are in use.)

Does this mean that NeCTAR is full?

Well, it depends on how you look at it.

On the one hand, since there are periods of time where launch attempts are repeatedly failing, from the user's perspective NeCTAR is effectively unavailable for new instances.

On the other hand, if you look at the various internal monitoring, and some of the external dashboards, NeCTAR is not full.  And even if is apparently full, it is not being used to its full capacity.

Cause #1 - The Fragmentation Problem

The compute nodes on NeCTAR typically have 32 or 64 available cores (processors) and sufficient memory and local storage.  This is sufficient for 32 or 64 VCPUs using the standard NeCTAR sizing.  However, the VCPUs, memory and local disc of an OpenStack / NeCTAR instance all need to be on the same compute node.  So, if you have a compute node with 15 cores currently unused, the NeCTAR scheduler cannot launch an instance there.

In an ideal world, the scheduler would take account of this, and optimize the "packing" of large and small instances within each zone, and across the entire federation.  In practice, this doesn't happen, and the net result is resource fragmentation; i.e. the available resources are scattered, and we may see that there are insufficient resources on any single compute node to launch a large instance.

Cause #2 - Shut down Instances

The second issue is that instances that are "shut down" still tie down VCPUs and memory.  The reason for this is that the file systems for NeCTAR instances are stored on the local storage devices of the compute nodes. This gives fast access (faster than if the file systems were on another machine), but it also means that an instance is effectively tied to a given compute node. Furthermore, since the user may want to restart an instance at a moments notice:

  • the file system has to stay on the disk, and
  • the compute node can't allocate the idle cores and memory to something else ... because there is no way to take it back.

Cause #3 - Under-utilized Instances

The resource shortage problem is exacerbated by the fact that most NeCTAR instances are seriously underutilized. That is to say, most of the time, most NeCTAR instances are idle.  It has been observed that the average CPU utilization rate across the NeCTAR federation is less that 5%, and many large instances sit idle for days, weeks or months at a time.

The problem is that people typically want the compute resources to be available when *they* are ready to use them.  But they don't make the connection that if they launch and "sit" on an instance to guarantee that, the resources *won't* be available for other people.  Combine this with the facts that:

  • that there is no cost or penalty for a user "sitting" on an underutilized instance, 
  • there is no fairness in when people actually succeed in launching,
  • the NeCTAR node operators don't have the authority to "claw back" underutilized instances, and
  • it can't be done without potential loss of data anyway,

and we have a "perfect storm"; a situation which positively encourages people to behave in a selfish and wasteful fashion, and penalizes people who give back resources they aren't using.

Things that OpenStack doesn't support.

In an ideal world, we could tweak some OpenStack confiuration parameters and tuning parameters and a number of these problems would be solved. But it isn't that simple.

Scheduler-driven migration of instances

The OpenStack scheduler can be configured to operate in "fill-first" and "spread-first" modes (at the cell level). In "fill-first" mode, it will attempt to fill up compute nodes, which is beneficial from the fragmentation standpoint. However, you can still get into a situation where there available capacity is too fragmented to place multi-core instances.

Ideally, the scheduler would notices this and attempt to automatically migrate instances between compute nodes to combat the fragmentation.  In practice this functionality is not available.  If the resource pools become fragmented, manual intervention is needed to defragment; e.g. by migrating instances, or by terminating and relaunching them.

Usage quotas

When a researcher applies to NeCTAR for an allocation, one of the details they provide is an estimate of the project's anticipated usage in "core hours".  However, while NeCTAR monitoring records your resource usage in VCPU-hours, the OpenStack infrastructure provides no out-of-the-box enforcement of VCPU-hours quotas.

Having said that, it shouldn't be that difficult to implement something to block a project that has gone over its VCPU-hours quota from launching instances.

A possible way to deal with existing Instances is to Resize them and move them onto an over-allocated compute node. This will preserve the Instance's ephemeral storage and its IP addresses, and at the same time free up most of the Instances resources. (There is a potential issue with disk space management, so this may not work for all nodes.)

Fair queuing in the Instance launch process

When a researcher attempts to launch in an availability zone that is "full", the launch attempt simply fails.

Ideally, the researcher should be able to put in a request to have an instance launched when capacity became available. A simple model would be to implement a "first come, first served" queue, and have the queuing system email the user when the instance was launched.  However, it would be more fair if the queuing system took account of what resources people are currently using and requesting, and / or their respective "priorities" in the allocation system.

NeCTAR Allocation Policy

Notwithstanding the various technical issues above, the fundamental problem is that NeCTAR Allocation Policy is hopelessly idealistic.

The fundamental premises of the NeCTAR allocation policy and procedures seem to be that:

  • Researchers know how to do "computational research" in a responsible and efficient fashion.
  • Researchers will only ask for the resources that they actually need, and that they are good judges of their needs.
  • Researchers will only launch and run instances when there is an (objectively) good reason to do so.
  • Researchers will proactively return instances when they have finished using them.
  • None of these things need to be monitored and "policed".

Unfortunately, there is ample anecdotal evidence that these premises are not borne out in reality. 

We have seen examples of researchers requesting and using resources for things that are (variously) wasteful, computationally implausible and / or bad science. The worst examples (e.g. people asking for ridiculous numbers of cores) tend to get weeded out by the NeCTAR allocation process. However, there is little doubt that some cases do slip through the cracks. The NeCTAR allocation procedures should include periodic review of projects to check that the resources are being used appropriately, and that the researchers are getting worthwhile results.

We often see examples where users launch large instances and then don't do anything useful with them for months at a time.  NeCTAR policy is silent on this. Certainly, there is no penalty to the users for doing this. 

(In practice, some NeCTAR node operators are starting to take steps to "encourage" users to kill off idle instances. However, depending on who you talk to, it appears that the node operators are contractually constrained in managing the problem.  And certainly, I have observed that some users are not amenable to "encouragement" unless there is some kind of "big stick" on display.)

Another problem is that there are a number of disconnects between the the NeCTAR model of resource allocation, and what happens at the implementation level:

  • I am told that NeCTAR Resource Allocation Committee (RAC) approves resource allocation purely on the basis of the scientific and technical merit of the application and its requirements. No account is taken of the resources that are available in the NeCTAR cloud. The amount of NeCTAR instance, vcpu and memory quotas issued to projects across the federation bear no relation to the federation's capacity to satisfy the (implied) demand in terms of attempts to launch Instances.
  • The NeCTAR quota system (as currently implemented) takes no account of where the resources are.  For example, to access a QRIScloud RDSI collection in Polaris via NFS, an instance has to be launched in the "qriscloud" availability zone. However, quotas are not zone specific.
  • While the NeCTAR allocation form includes a place to specify the VCPU-hours required, it is not properly explained. In practice, the RAC pays no attention to what the researcher puts into that field, and it doesn't feed into an enforced / enforceable quota anyway.
  • While the NeCTAR allocation form includes a place to include a project end date, the RAC pays little attention to this, and researchers tend to effectively ask for "as long as possible".  The net result is that it is harder to "turn over" NeCTAR projects; e.g. so that a different set of users gets to use the resources.

Finally, there has been considerable "lack of will" to deal with these issues in the past.

So what can be done?

NeCTAR WP8: Research Cloud Resource Allocations

NeCTAR has recently allocated resources to various nodes to address a number of "work packages". One of these work packages deals with improvements to the policies, procedures and technical implementation of NeCTAR Resource Allocation. Many people are hopeful that WP8 will lead to substantial improvements in this area.  However, we are unlikely to see these improvements for a number of months.

Proactive management by NeCTAR node operators

I am told that some NeCTAR node operators have started being proactive in the way that the compute resources are managed at the operational level.  They are apparently doing things like:

  • Using OpenStack monitoring to identify Instances in their data centres that are egregiously under-utilized.
  • "Leaning" on researchers to shutdown and terminate instances that they don't need.
  • Resizing instances, and moving instances onto compute nodes that have been configured to allow over-allocation at the VM level.  (These have to be done carefully because they have performance implications, both for the targeted instance and for other instances on the same compute node.)

Elastic Compute Pools

QCIF / RCC are trying an alternative approach. Rather than trying to address fairness directly, we are trying a solving the problem by forcing instances to be "recycled" after 7 days. We have set aside a pool with 256 cores which is currently available to invited users. A user can launch instances in this pool (using their regular NeCTAR quotas) with the proviso that an instance older than 7 days will be automatically Terminated ... without a snapshot. 

The idea is to encourage the instances to be used in an "elastic" fashion; i.e. spin up, use, discard.  Indeed one of our initial use-cases is the QCIF Nimrod Portal, and involves Nimrod spinning up instances on-the-fly to do parameter scan computations.

For more information, refer to "Using the QCIF Elastic Compute Pool".