The Problem

It is possible to get a NeCTAR instance into a state where it won't boot normally, or you cannot get into it.  The "nova rescue" command provides a possible way to repair an instance that is in such a state.

Some background on system rescue

System rescue is a technique that is designed to cope with an number of scenarios.  For example:

  • Damage to the boot loader or partition table on a computer's hard drive that prevents it from booting properly.
  • Loss of critical system files that results in a system that won't boot properly.
  • Corruption to the (physical) file system that requires manual intervention to repair. 
  • Loss of the administrator credentials; e.g. the root password or SSH private keys.
  • Suspicion that booting up the system normally may damage things; e.g. destroy files to remove evidence.

The idea of system rescue is that you boot up a rescue image on your system, using a different boot medium; e.g. a "rescue" disk or memory stick.  (For OpenStack instances, a special rescue image is used). The rescue system then attaches the system's regular disks in such a way that you can repair them; e.g. run file system checkers, restore missing system files from a backup, fix broken configurations, change credentials and so on.  Obviously, what you will actually need to do depends on the situation you find yourself in.

Once you are done, you shut down the system and reboot normally.

Using Nova rescue

Before you start, you need a working system (e.g. desktop or laptop) on which you can run the "nova" command-line client. I recommend that you use a Linux, UNIX or Mac OSX system, and use the command shell.  (You don't need local root access to run "nova".)

This page explains how to install OpenStack clients.  It also explains how to obtain the NeCTAR credentials that you will need to use to run "nova".  (You will need an Openstack RC file for your project AND your personal Openstack password.)

The procedure is as follows:

  1. Set up your credentials on your system.  This is typically done by "sourcing" the OpenStack RC file and entering the password when prompted.  Note that this is save the credentials and other information as environment variables for the current shell session.
  2. Put the instance into rescue more by running "nova rescue <instance>" where "<instance>" is the name or id of the instance that you want to "rescue".  (The command outputs a temporary password, but you can ignore that.)
  3. The instance will shutdown and reboot into "rescue" mode.
  4. Use SSH to login to the instance, using the standard "admin" account.  To do this, you will need to know the private key of the key-pair that is associated with the instance (e.g. as shown in the NeCTAR dashboard).
  5. When you login, you will find a full Linux system (not just a cut-down rescue system), with your file system to-be-rescued mounted (typically) on "/mnt".
  6. Do your rescue work.
  7. When you are done, logout from the system.
  8. Run "nova unrescue <instance>" to take the instance out of rescue mode.  This should cause the instance reboot in normal mode.  You can check this by logging in using SSH, and / or by looking at the instance "state" as shown by the Dashboard.

How Nova rescue works

The mechanics of rescue mode are not documented in the official OpenStack documentation, but my understanding is the process is as follows:

  1. When the Nova API gets the initial request, it generates a password and calls through to the Compute API.
  2. The Compute API checks that the system isn't already being rescued, and if not, it shuts down the instance and launches a new instance with the same metadata; e.g. network address, primordial instance image, ssh keypair and so on.
  3. On launch, the rescue instance presumably goes the first-time boot procedure, and cloud-init inserts the public key for the admin account.  (I have noticed that the rescue instance comes up with a different MAC address, but you get the old MAC address back when the instance is "unrescued".  This is good.)
  4. Once the rescue instance is launched, something then attaches the instance-to-be-rescued's image, and it (typically) gets mounted on "/mnt".

This page provides more information.

A few things are not clear:

  • What would happen if the original boot image no longer exists in the image store? Is there a fallback rescue image?
  • What would happen if OpenStack was unable to launch a rescue instance because the resources were not available? Is it possible to use a smaller flavour?
  • Would it be possible to launch the rescue instance with a different keypair?
  • What would happen if the instance-to-be-rescued's image had file system corruption; i.e. it needed "fsck"-ing.  Would this happen automatically?  Would the file system be left unmounted ... so that you could "fsck" it from the rescue system?