... what to do when snapshots fail or get stuck.

The Problem

Snapshots are a simple way of recording the state of a NeCTAR instance. Snapshotting can be used for backup purposes, for creating a master image that is going to be "cloned", or as part of a "snapshot and relaunch" migration. 

The problem is that the procedure for taking a snapshot of an instance via the NeCTAR Dashboard doesn't always work. 

Creating a Snapshot - the normal procedure

The normal procedure for creating a snapshot is to go to the Instances panel in the NeCTAR Dashboard, find the instance, and run the "Create Snapshot" action from the instance's action pulldown menu.  You will be asked to provide a name of the snapshot, and then the dashboard will switch to Images panel where your image should show up as "queued".  After the snapshot has been completed and upload to the Image store, the image status should change to "Active".

Unfortunately, snapshotting doesn't always work:

  • Sometimes the process fails completely, either with an error message, or silently.  In the latter case, the entry for the new image just disappears from the Dashboard's Images panel.
  • At other times, the process gets stuck with the image showing in "queued" state.

Primary troubleshooting

Snapshotting is stuck

First thing, it is possible that the snapshot is just taking a long time. In the past, there have been cases where the NeCTAR Swift cluster (where images are stored) has been under extreme load, and image upload just takes a long time.  However, under normal circumstances it would be "unusual" for an instance snapshot to take more than 5 to 10 minutes.  If a snapshot has taken more than half an hour, then it most likely won't succeed.

If you decide that it is really stuck, then one strategy is to Delete the image and retry the snapshotting.  (I would advise doing this once.)

When you retry, it is a good idea to first shut down the instance.  Apart from the fact that you eliminate the risk of an "unclean" snapshot, doing this eliminates a couple of the potential causes of snapshot failures.

Snapshotting failed

As with a stuck shapshot, the first remedy is to try (once) again, making sure that you start with the instance in shut down state.

What next?

At this point, it would be a good idea to raise a support request with your local NeCTAR node, or via the NeCTAR RC support channel.  The problem is that shapshotting is a complicated multi-step process under the hood. There are lots of things that can go wrong, and without admin access it is diagnose and remedy them.

However, there are some situations where the root cause of the problem can be something that you have done, so it may be worth reviewing them.

Possible causes of Snapshot problems

The following is a (non-exhaustive) list of the things that can go wrong.

NeCTAR / OpenStack / hypervisor services not working

Snapshotting process involves a large number of OpenStack and lower-level services, any of which might be not working.  The list includes:

  • The Dashboard's web service.
  • The Nova and Nova-cells services run by the lead node.
  • The Nova and Nova-cells services in the cell /  availability zone running your instance.
  • Various hypervisor level services on the compute node.
  • The Glance service.
  • The Swift service and the cell-level Swift proxy service. 
  • The Keystone service.
  • The various database and message bus services that underpin many of the above.

Problems in any one of these services can potentially derail snapshots.

Launch image no longer accessible

The snapshotting process needs access to the original image that your instance was launched from.

We have seen a case where someone made an image public in one tenant, launched an instance in a second tenant, and then made the image private again. When they came to snapshot the instance, it failed because access to the (now) private image was no longer permitted from the second tenant.

The solution in that case was to make the original image public again so that the snapshot could be created.

(However, this does flag that "sharing" images between tenants by temporarily making them public is not good practice.  Unfortunately, the Dashboard doesn't allow you to share images the right way.  You have to use the "glance" command-line client to do this; see "glance help member-create" for details.)

Snapshot image too big for Image store

While QCOW2 is a relatively efficient encoding, snapshot images can still be pretty large.  If the image size is too large, the upload to Glance could fail.  Things that could cause an image to be excessively large include:

  • An excessively large primary file system.  (The file system doesn't even need to be full. If you have previously filled the file system then the "free" disk blocks will still be non-zero, and the QCOW2 image won't be sparse.)
  • If you snapshot a running or paused instance, then the snapshot also includes a copy of the instance's memory. For an instance with a significant memory footprint, this can result in a very large snapshot image.

In some cases, this can be remedied by shutting down the instance, since that will result in a smaller snapshot.

Snapshot image too big for staging

The previous "cause" talks about images that are too large to upload. In fact, there may be another "size related" bottleneck.

When you snapshot a running system, the hypervisor will write a staging copy of the image onto the root file system of host system.  If the image is too enough, then it won't fit into the available space.  (How large that is depends on site-specific of compute node configuration details.)

Instance in inconsistent state

We have observed that snapshotting can fail if there are inconsistencies between the instance's.  For example, we have seen a case where a QRIScloud "special compute" instance which was supposed to have a GPU attached to the instance gor "confused", and this manifested as snapshot failures.

If you suspect that this might be the problem, then you need to raise a support request.