... what to do if you can't SSH to a NeCTAR instance

1 - The Problem

These days, a lot of my time is spent providing tier 1 and tier 2 support for NeCTAR Research Cloud users.  One of the issues that we see fairly regularly is new users who cannot SSH to their NeCTAR instances.  In some cases, the root cause is that the user has made a mistake.  In other cases, the root cause is ... something else.

The goal of this article is to provide "self help" instructions for diagnosing and remedying common causes of SSH problems.

(This is a complex area, and I may have missed something important.  If so, please bring it to my attention.)

1.1 - A check-list for common mistakes

Before we get into detailed problem diagnosis and troubleshooting, it is worth running through some common mistakes that can lead to SSH connections failing:

  1. Did you configure the instance's Security Groups to allow access on the SSH port you are using?  (Normally you need TCP/IP access on port 22, from at least the system you are attempting to connect from.)
  2. Have you created or uploaded a public key to the NeCTAR Dashboard?
  3. If you have uploaded multiple keys, did you select one of them at launch time? If no key is selected, your newly launched instance won't plant a key in the admin account, and you typically won't be able to log in.
  4. Did you tell "ssh" to use the correct private key? (The one that matches the public key that you used when launching.)
  5. Did you tell "ssh" to use the correct account name?  (A NeCTAR instance's admin account name varies depending on the Linux image used.)
  6. Are you connecting to the right host?  (Check the IP address you are using ...)
  7. Is the instance running?
  8. Are the ownership and permissions correct on your ~/.ssh directory and its contents?  (Client and server side ...)
  9. Starting with OpenSSH 7.0, use of DSA keys has been deprecated as these are considered to be "less secure".  If you are using OpenSSH 7.0 or later with DSA keys, you need to take extra steps; see https://www.gentoo.org/support/news-items/2015-08-13-openssh-weak-keys.html for details.  (A better solution is to stop using DSA key-pairs)

If one of those things "rings bells", then the solution may be self-evident, and you can avoid reading the rest of this article.

2 - Some general background on SSH

The name "ssh" is a contraction of "secure shell".  It is:

  • a protocol for providing a secure terminal connection from one computer to another one over a network connection, and
  • the name of the UNIX / Linux / BSD / MacOS command that is typically used to establish an SSH connection.

(On Windows, you typically use a third party tool such as "putty" instead of the "ssh" command.)

The SSH protocol is TCP/IP-based. It uses port 22 as the default port number for making connections.  The server side of the protocol is typically provided by an "sshd" application running with "root" privilege.

SSH does a number of things from a security perspective:

  • It ensures that you connect to the correct server, and not to some other server that (in the worst case) is attempting to impersonate it.
  • It authenticates you to the server.
  • It protects all information sent over the established connection.

SSH security is based on the use of public key cryptography.  In particular, the "host key" that is used to identify a server is the public key of a key pair, and the remote server will typically have the public key of the key-pair that identifies you. This offers strong protection, provided that sufficient measures are taken to avoid the disclosure of the private keys.  Private keys should either be kept on a system or media that is secure against physical or network access at all times.  Alternatively, they should be protected with a strong passphrase.

2.1 - What happens when an SSH connection is established.

The life-cycle of an SSH connection is as follows.  (Note that this has been simplified to make it easier to understand.)

  1. The user on machine "host1.example.com" runs the "ssh" command as follows:
    $ ssh fred@host2.example.com
  2. The "ssh" command on "host1" opens a TCP connection to "host2" on port 22.
  3. The "sshd" service running on "host2" accepts the TCP connection.
  4. The "ssh" command and "sshd" services exchange host keys.  The "ssh" command typically uses the host key of the remote host to check that it has connected to the right server.
  5. The "ssh" and "sshd" complete the "negotiation" and set up a secure connection which encrypts all data exchanged from now on until the end of the session.
  6. The "sshd" end sends the "message of the day" banner to the "ssh" end, and it is output to the user's shell.
  7. The "ssh" command sends the user name ("fred") to the remote "sshd" service together with a password and/or an agreed token that has been encrypted with the user's private key.
  8. The "sshd" service uses the supplied credentials to identify and authenticate the user, and see if s/he is permitted to log in.
  9. If access is granted, a "shell" is created for the user, and "wired up" to the secure network connection.
  10. At the "host1" end, the user types characters.  The characters are sent to the user's shell on "host2" which interprets them as commands and runs the commands, and sends the output back to "host1" for the user to read.
  11. Finally, the user exits the shell or disconnects the SSH connection, or the connection is disconnected by the "sshd" service.

The life-cycle is the same for SSH connections to NeCTAR instances with one important restriction:  The default behaviour of "sshd" on a NeCTAR instance is to refuse to use a password to authenticate connections over the network. You have to authenticate using public key authentication, i.e. you need the private key that corresponds to the public key used when the instance was launched. This restriction is for security reasons.  It is to protect against hackers who will attempt to break into your instances by guessing passwords. 

(We strongly recommend that you DON'T change the "sshd" configuration to allow "ssh" access with password authentication.  It is a dangerous thing to do.)

3. - Can you talk to the sshd service?

When you attempt to connect using your ssh client, a number of things can go wrong. The first step in diagnosing the problem is to figure out if your ssh client is able to contact the "sshd" service.  Look at the initial output from your ssh client:

  • Does nothing happen for a number of seconds, followed by a "Connection timed out" message?  This means that the client got no response to the TCP/IP "connect" messages that it sent to the server. 
  • Do you see a "Connection refused" message?  This means that you have reached the server, but there was no service "listening" for incoming requests.
  • Do you see a "No route to host" or "no route to network" message?  This most likely means there is a more general networking problem.

For other messages, you have most likely managed to establish (at least) a low-level network connection to the sshd service.  If you are unsure about this diagnosis and you are on a Linux or Mac OSX system, you can run "telnet <ip-address> 22" to see if you can connect.  If the connection establishes, and telnet tells you what its escape character is, then you have managed to talk to sshd.

3.1 - Diagnosis for "connection timed out"

A "connection timed out" message means that the ssh client got no response to the TCP/IP "connect" messages that it sent to the server. The most likely explanations are that access is blocked by a firewall or by "messed up" network routing.  You will also get this if you attempt to connect to an NeCTAR instance that is stopped, paused or shut down, or that has been terminated and its IP address not been recycled.  Some of these possible causes you can diagnose and fix for yourself.  For others, you will need some help.

3.1.1 - Check that your instance is running.

Login to the NeCTAR Dashboard, select your project, and choose the "Instances" panel.

Do you see the instance you are trying to connect to?  Does it have state "Active"? If not, that is the likely cause of your being unable to SSH to it.

  • If it is currently in state "Build", the instance is still being created.  This can take a varying length of time, but if an instance has been in "Build" state for 10 minutes or more, there is probably something wrong with the NeCTAR infrastructure.
  • If it is currently in state "Shutoff", you will need to "Hard Reboot" it using the "Actions" pull-down.
  • If it is state "Paused" or "Suspended", you will need to "Resume" it using the "Actions" pull-down.

If the instance is in state "Active", you can attempt to connect to its virtual console. Use the "Console" action in the "Actions" pull-down, and click on "Click here to show console". Assuming that the VNC console infrastructure is working, you should now see your instance's virtual console.  If the instance is up and running, there should be a login prompt. (If you have set a password on the "root" account, then you can login, and check for problems inside your instance.)

3.1.2 - Check your instance's Security Groups.

Login to the NeCTAR Dashboard, select your project and choose the "Instances" panel.  Next, click on the link for your instance in the Instance Name column.  This should take you to the "Instance Overview" page.  Scroll down until you can see "Security Groups".

For SSH access to work, at least one of the rules needs to allow TCP access on port 22 with a CIDR that includes the apparent IP address that you are connecting from.  (For more information on CIDR notation, please refer to the Wikipedia article on "Classless Inter-domain Routing".) 

For example:

ALLOW 22:22 from 0.0.0.0/0

This says "allow incoming and outgoing traffic on port 22 from / to any external IP address".  The CIDR "0.0.0.0/0" means "anywhere", and is the most common way to open an SSH port.  (It is safe to do this provided that you only allow public key authentication AND you take appropriate steps to keep your private key secure.)

ALLOW 22:22 from 130.102.131.0/24

This says "allow incoming and outgoing traffic on port 22 from / to any IP addresses in the range 130.102.131.00 to 130.102.131.255". 

If there is no Security Group rule allowing access on port 22 (or your chosen alternative SSH port), then connection attempts will time out.  You can remedy this by adding a new Security Group to your instance, or modifying one of its existing Security Groups.  The former can be done via the Dashboard, using the "Edit Security Groups" action on the instance.  The latter can be done using the "Access & Security > Security Groups" panel.

Changes to access rules should propagate to the instance immediately.

3.1.3 - "Messed up" Networking

This covers a couple of failure syndromes that have been observed in NeCTAR OpenStack which can lead to connection timeouts.

The NeCTAR federation is built using OpenStack "nova cells". There is a long-standing issue in "nova cells" where custom Security Group definitions do not propagate reliably to the compute node when a new instance is launched.  The result is that the access rules for an instance don't always match what the Dashboard says. 

There is no way for you to diagnose this problem.  However, we have observed that "Hard reboot"-ing an instance can help.  Another workaround is to modify the rules after the launch.  These workarounds may be sufficient to get the access rules in sync.

A second syndrome relates to NeCTAR instances that have two ethernet interfaces.  (Currently, this only applies to instances in QRIScloud.) We have seen cases where the local DHCP server has incorrectly advertised that the external gateway is accessible via both ethernet interfaces.  This caused the interfaces to set up their network routing incorrectly, leading to the instance not accepting any network connections.

If you have this kind of problem (and the workarounds don't help) you need to raise a support ticket with your local cloud support team. 

3.1.4 - Other causes of "connection timed out"

These problems could also be caused by:

  • Networking issues (including firewalls) on your laptop / PC, on your local network, on AARNET or in the data centre that hosts your instance.
  • Incorrectly configured networking / fire-walling within your instance.  (However, I would imagine that you would know about this.)

3.2 - Diagnosis for "Connection Refused"

A "connection refused" message means that you have reached the server, but there was no service "listening" for incoming requests.  In that situation, the operating system will respond to incoming requests by "refusing" them.

The possible explanations include:

  • The "sshd" service has not been enabled to start automatically on system start-up.
  • The "sshd" service has failed to start because its configurations are wrong.
  • The "sshd" service has been stopped manually.
  • The system has not started up properly.  For example, it may have detected a file system corruption and booted into single user mode.
  • The "sshd" service has been configured to "listen" for requests on a non-standard port.

None of these things should happen if you are trying to get into a new instance that has been launched from one of the standard NeCTAR images.

Remedying the various forms of this problem generally requires "root" access to the system.  This is problematic unless you had the foresight to set a "root" password before the problem occurred.

3.3 - Diagnosis of other network errors.

There are a variety of other possible network-level errors that you could experience.  However, since they typically indicate problems that need a network administrator to fix, there is little to be gained in trying to diagnose the problem yourself.  Contact your local IT support for help.

4. - SSH problems

Assuming that your "ssh" client has managed to contact the "sshd" service, there are three new error syndromes to consider:

  • A message of the form "The authenticity of host .... cannot be established" is a warning rather than an error.
  • A message of the form "WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!" is also a warning, but one that you can't just "click through".
  • A message of the form "Permission denied (...)" means that your login attempt has been refused.

4.1 - Dealing with "Authenticity cannot be established"

As I mentioned in the "background" section, when an SSH connection is being established, the "sshd" service uses public key cryptography to try to tell the client who it really is. 

In a nutshell, the normal process is as follows:

  • The server encrypts a message including a timestamp using its host private key, and sends it to the "ssh" client, together with the host public key.
  • The client looks up the host public key in the user's "~/.ssh/known_hosts" file, and checks that the key tendered by the server is the right one for the remote IP address and/or DNS name.
  • The client then uses the public key to decrypt the message.  If that succeeds, and the timestamp is current, then the client knows that that server is actually the correct one to talk to.

In this case the message is saying that "ssh" has no prior knowledge of the server to which you are trying to connect.  The problem is that the "~/.ssh/known_hosts" file does not contain a record for the IP address of your NeCTAR instance. Assuming that this is a new instance, or it is the first time that you have connected from the client machine, this is expected behaviour.  However, it could also be an indication that you have used the wrong IP address.

The normal solution is to allow "ssh" to add the IP address of the host to "~/.ssh/known_hosts".  This will allow the connection establishment to proceed to the next step.

4.2 - Dealing with "Remote host identification changed"

The issue here is related to the previous one, but it is potentially more alarming.

The message is saying (in effect) "last time you talked to host X it had key A, but now it has key B".  There are a couple of ways that this could happen:

  • Whenever a new NeCTAR instance is launched, its operating system generates a new host key-pair to identify it.  Suppose that you have previously talked to a NeCTAR instance with address X and it has now been terminated.  If you launch another instance in the same availability zone, there is a significant chance that the new instance will have the same IP address as the previous one.  But when you connect, "ssh" will notice that the host key is different.
  • Suppose that someone else has managed to set up an instance that has "borrowed" your instance's IP address.  (They would need to break security or subvert network routing to do that, but let's assume that that is possible.)  Unless they have also managed to steal your instance's host key as well, when you connect using "ssh", the remote server will send a host key that doesn't match the one that was used previously.

As you can see, distinguishing the "real" and "false" alarms requires you to make a judgement.

Assuming that this is just a false alarm, then the solution is to open the "~/.ssh/known_hosts" file in a text editor and remove the line that matches the IP address you tried to connect to.  Then you need to retry the "ssh" command, and allow it to add the new host key to the file when it asks.

4.3 - Dealing with "Permission denied"

The "Permission denied" message is a generic message that says that the remote "sshd" service has not accepted your credentials.  There are a number of different reasons why this may occur:

  • You may have used the wrong admin account name.
  • You may have attempted to use an account that is not enabled for SSH access.
  • The account's "~/.ssh" directory or "~/.ssh/authorized_keys" file may have incorrect permissions or ownership.
  • The account you are trying to use could be locked or be a "no login" account.
  • You are trying to login as "root" via SSH.
  • You may have tried to authenticate using a password (rather than public key authentication) which a NeCTAR configured "sshd" will not allow.
  • Your "ssh" client may be trying too many keys, resulting in "fail2ban" blocking the login.  This can happen if you put lots of private keys into your "~/.ssh" directory.
  • You may have have a problem with client-side permissions on key files or "~/.ssh".
  • You may have used "-i" and specified the key pathname incorrectly.
  • A "fail2ban" block from a previous failed attempt to connect could still be in force.  Fail2ban blocks all logins from a given IP address for 5 to 10 minutes.
  • A problem at instance boot time may have prevented "cloud-init" from retrieving the admin public key from the compute node's metadata service.

Unfortunately, it is difficult to impossible to distinguish these cases based on error messages. This is largely a result of a standard security design principle: you don't want to give hackers clues that will make it easier for them to break in.

(Suppose you were a hacker and you needed to guess a username / password pair. If the log in system tells you that you have the password wrong, you can infer that you have the username correct.  That drastically reduces the "search space".  You now only need to guess the correct password.)

While the "Permission denied" messages tend to be uninformative, you can sometimes glean clues as to what might be going wrong by running "ssh" with debug messages enabled.  I won't go into this, because it requires deep knowledge of what ought to happen.  If you get to the point where looking at debug output is your only remaining option, you should raise a support request with your local cloud support team.

4.3.1 - Incorrect admin account names

Different Linux images have different names for the default admin account: 

  • for CentOS, Scientific Linux & Fedora, use "ec2-user",
  • for Ubuntu use "ubuntu", and
  • for Debian use "debian".

The actual default admin account name is determined by the "cloud-init" configurations on your instance.

It is also possible that you, your team, or the people who created the image you are using have set up an alternative admin account. It may be worth asking, or checking the relevant documentation.

4.3.2 - Account not enabled for SSH access

If you simply create an account on your server using the standard "adduser" or "useradd" utility, it will not be enabled for SSH access.

To enable SSH access to a user account, you need to do the following as a minimum:

  1. Obtain the user's SSH public key file in a form that you can copy-and-paste.
  2. Make sure that the user's home directory is owned by the user, and does not have "group" or "other" write access.
  3. Create a directory called ".ssh" in the user's home directory.
  4. Make sure that the "~/.ssh" directory is owned by the user, and has access "drwx------".
  5. Using a text editor, open "~/.ssh/authorized_keys", copy-and-paste the public key into it, and save it.
  6. Make sure that the "~/.ssh/authorized_keys" file is owned by the user, and has access "rwx------".

Note that it is critical to get the ownership and permissions correct.  If the "sshd" service thinks that the keys file is potentially insecure, it will "deny" attempts to login on that account without explanation.

4.3.3 - Locked or "no login" accounts

Linux accounts can be locked (e.g. by running "chage -E 0 <user>") and can be configured with a "no login" shell.  The former will definitely lead to login "denial" without explanation, and the latter may too.

(Note that "passwd -l <user>" only locks out password-based login.)

4.3.4 - Root login won't work

The default configuration for the :"sshd" service on NeCTAR instances (launched from the official images) will not allow "root" login via SSH.  Adding an "authorized_keys" file (as above) won't help.

This configuration choice has been made for security reasons, and you should not change it.

  • If you need a "root" shell, log in using your admin account and run "sudo bash".
  • If you actually need to log in as "root", you can set a password for the root account ("sudo passwd root") and use the NeCTAR Dashboard to connect to the virtual machine's console.  (In fact, setting a root password is recommended.)

4.3.5 - Password authentication won't work

The default configuration for the :"sshd" service on NeCTAR instances (launched from the official images) does allow password-based authentication via SSH.  Setting a password on an account won't allow you access over SSH.

This configuration choice has been made for security reasons, and you should not change it.

Password-based authentication over SSH will make you vulnerable to hacking by password guessing.  NeCTAR instances are potentially open to access from the public internet (depending on your Security Group settings). We see a continual stream of attempts by hackers to break into machines over SSH by guessing passwords.

4.3.6 - Blocking by fail2ban

The standard NeCTAR images have the "fail2ban" service installed and active by default.  Fail2ban is a service that responds to repeated failed attempts to authenticate from the same IP address by blocking all further attempts from that IP address for a period. It is an anti-hacker measure that is designed to make it harder to use "brute-force" methods to gain access to your system.

The problem is that Fail2ban blocking does not (and cannot) distinguish between a hacker guessing passwords, and you (for example) using the wrong key or account name.  They are both liable to result in blocking.

The default ban time on NeCTAR instances is 10 minutes.  If you suspect that your current IP address has been blocked, wait 10 minutes before trying again, or try connecting from a different IP address.  (If you can login by other means using an account with "root" access, you can cancel a ban.  However, this is inapplicable to a newly launched NeCTAR instance.)

4.3.7 - Too many keys in "~/.ssh"

When you try to ssh to a machine in the normal way, like this:

$ ssh <account>@<ip-address>

the "ssh" command looks in your "~/.ssh/" directory for all files that are valid private keys. If you have lots of different keys there, the "ssh" will offer them to the remote "sshd" service one at a time until the remote service accepts one of them.

Unfortunately, each one of these offers appears to "fail2ban" to be a failed login attempt.  Thus, if you have multiple keys and "ssh" tries them in an inconvenient order, "fail2ban" can block your IP address before "ssh" offers the right key.

There are a couple of ways to deal with this:

  • Remove any outdated key files from your "~/.ssh/" directory.
  • Move all but your most commonly used key file to a different directory (making sure that it is owned by you and has access "drwx------").
  • Use the "-i <keyfile-pathname>" option to tell "ssh" to use a specific keyfile.

4.3.8 - Client-side private key file permissions

I mentioned above that "sshd" is picky about keyfile and directory ownership and permissions. In fact, the same thing applies to the "ssh" command. If "ssh" thinks that your private key files could be readable or writeable by other users, it is liable to ignore them.  Sometimes it tells you, but sometimes it doesn't.  Either way, the correct key is not offered to the remote "sshd", and you get a "denied" login.

The solution to this is to make sure that your private key files are owned by you and have access "rwx------" (or less).  It is also a good idea to put them in a "drwx------" directory owned by you.

4.3.9 - Key pathname issues with "-i"

A common mistake is to provide an incorrect pathname for the key file when using the "-i" option.  For example, if my keyfile is `~/.ssh/nectar.pem", I might try to use it as follows:

  $ ssh -i nectar.pem <user>@<host>

The problem is that if "ssh" cannot find a key file in the expected place (in this case "./nectar.pem") then it prints a warning and attempts to use other keys that it can find.  It is easy to miss the warning message, and focus on the "Permission denied" message.

4.3.10 - Cloud-init problems at boot time

You may have wondered how a newly launched NeCTAR instance knows what public key to associate with the admin account.  Does it patch the key into the boot image?

In fact, this and similar tasks are handled by the "cloud-init" service, and the compute node's metadata service.  When an instance is launched, the metadata service is populated with name-value pairs, including the instance's name and the public key to be used for the instance's admin account.  When the instance boots up, the "cloud-init" service fetches the name-value pairs from the metadata service, and uses them to initialise things.  For example, "cloud-init" gets the default admin account name from its configuration file, creates the account (if necessary), and sets up a "~/.ssh" directory.  Then it fetches the key from the metadata service, and adds it to the "authorized_keys" file if it isn't there already.  (The "cloud-init" service does other things too; check the manual entry and config files if you are interested.)

Sometimes "cloud-init" doesn't work properly.  One problem that occasionally happens on NeCTAR OpenStack is that the metadata service on a compute node dies. Then, when the "cloud-init" service attempts to fetch the key for the admin account, the request fails.  After a bit, "cloud-init" gives up and the instance startup continues.  Unfortunately, you are left with an instance where the admin account doesn't have an SSH key.  As a result. when you attempt to login, you get a "Permission denied" error.

If "cloud-init" is failing, there are often clues in the system's console log file.  You can access this from the NeCTAR Dashboard.

When "cloud-init" is misbehaving, you will need to raise a support ticket to get the node operators to look at the problem. Fortunately, the problem can be fixed if they restart the metadata service and then reboot your instance.  When "cloud-init" runs, it completes the initialisation.