Setting up the cluster: Difference between revisions

From APL_wiki
Jump to navigation Jump to search
Wikiuser (talk | contribs)
No edit summary
Wikiuser (talk | contribs)
No edit summary
Line 23: Line 23:
<ul>
<ul>
<li>At the time of writing this, you can download Ubuntu 14.04 Server here.</li>
<li>At the time of writing this, you can download Ubuntu 14.04 Server here.</li>
<li>http://www.ubuntu.com/download/server</li>
<li>I'm sure that if a newer version comes out, it should work as well, although I tried initially with 15.10 (Desktop) and had some trouble with that.</li>
<li>I'm sure that if a newer version comes out, it should work as well, although I tried initially with 15.10 (Desktop) and had some trouble with that.</li>
</ul>
</ul>

Revision as of 03:03, 24 February 2016

Building a Beowulf Cluster (MPICH2)

Prerequisites

Let's Go

  1. Get a copy of Ubuntu 14.04 Server.  If you really want to, you can install a gui interface onto it later, like unity.

1? Install Ubuntu

This in itself can be a process, and is slow.  At least on my computers it was.

  • Make the bootable USB.
    • At the time of writing this, you can download Ubuntu 14.04 Server here.
    • http://www.ubuntu.com/download/server
    • I'm sure that if a newer version comes out, it should work as well, although I tried initially with 15.10 (Desktop) and had some trouble with that.
  • Choose a master node
    • People recommend that the master node be the most powerful.  Of course if you are using identical computers just choose whichever.  In my case, I had 1 computer with 4 cores, whereas the rest had 2.
  • Plop the usb into the master node and get to the install screen.
    • If you're not booting from the usb it may be because it is not the top priority in the boot sequence.  To change this you'll need to go to the bios setup screen.
    • While the computer is booting up press f2 or the Delete key.  (It could be either or, some manufacturers choose different keys, but that's what it usually is)
    • Find the settings for something along the lines of "Boot Sequence".
    • Identify the usb and move it to the top.  Save and Reboot.

Once you got that all working we can start installing Ubuntu!

1. Installing Ubuntu For Reals

The installation is mostly straightforward but just follow along to make sure we're on the same page.  There are a few things that need to be done during this process - like not encrypting the home directory.

  1. Do not detect keyboard layout - Choose EN - US (or whichever you'd prefer) manually
    • Why?  I'm actually not sure, it was just what I was directed to do.  Follow directions and stop asking questions.  These were directions for installing 12.04, so maybe there was a bug in the autodetect feature.
  2. Set host name to ub<x>, where x is the node number.
    • For example, the master node should be ub0.
  3. Set the name of the new user to "new-user".
  4. Set the account name to "new-user"
    • This might not be an option for you.  I don't think I was given this option.
  5. Do not encrypt the home directory.
    • If you do, setting up a shared folder with nfs will be rather difficult.
  6. Partition Method: Guided Use Entire Disk
  7. Remove existing logical volume data
    • Basically just allow it to overwrite whatever it wants.
  8. Leave HTTP proxy information blank
  9. No Automatic Updates
  10. Choose software to install.  Just select OpenSSH
    • Note that you actually have to press the spacebar to select the option.  If you press enter it will just go on without installing.  If you do this, it's ok.  You can install it later with sudo apt-get later.
  11. Install GRUB
  12. Hooray!
  13. Repeat this process for your other nodes.  You can of course do some now and some later (I did), though it may be easier to just do them all at once.

2. Setting up your hosts file

You'll want to do this so that you don't have to type in an entire ip address everytime you want to communicate with another node.  Write a list of each ip address and which node it corresponds to.

  • Set this up on every node.  Go edit the hosts file like so.

allnodes: sudo nano /etc/hosts/

Write to the file so it looks like this

127.0.0.1	localhost
192.168.1.6	ub0
192.168.1.7	ub1
192.168.1.8	ub2
192.168.1.9	ub3

 Be sure that each name is only used once, and replace the ip addresses with yours.  You can find the ip address of each node by using ifconfigin the terminal.

3. Adding the Cluster User

Now we'll make a new user that will be our cluster user.  This user will have the same name and password on every node.  I will call my user "beo", for beowulf.  We also need to clarify a user id.  Make the id be a number between 900 and 999.  That makes it so it is a user that doesn't show up in the usual gui interface.

Take note that I will write which node the command needs to be run on, followed by a colon, before writing the actual command.

allnodes: sudo adduser beo --uid 999

4. Sharing Your Home Directory With NFS

  • Now we need to set up nfs on the master node so that you can share a folder for programs and whatnot.  This is so you don't have to install a lot of things on every single node, or put a script on every single computer.  As you can imagine that would be quite tedious, especially if you have many nodes.
  • Install nfs-kernel-server on the master node

masternode: sudo apt-get install nfs-kernel-server

  • And on the children nodes install nfs-common

childnode: sudo apt-get install nfs-common

  • We will need to indicate which folder we want to share in our exports file.  Edit it with nano

masternode: sudo nano /etc/exports

/home/beo *(rw, sync, no_subtree_check)

Add the above line to the bottom of /etc/exports and restart the server

masternode: sudo service nfs-kernel-server restart

  • Now here, the york article discusses running a sudo ufw allow from <ipaddress>, but I didn't have to do this.  If after you run the next few steps you find that your specified folder is not being shared to your other nodes, you may need to check out the york article I posted near the beginning of this post.

  • Now we need to edit our /etc/fstab file and install nfs-common on the child nodes.  This will tell us where to copy the incoming shared folder from the master node.

childnode: sudo apt-get install nfs-common

childnode: sudo nano /etc/fstab

Add this line to the file.

ub0:/home/beo /home/beo nfs
  • Now, when the computer boots it should automatically mount the home directory from the master node, to the child nodes' home directories.  Check to see with

childnode: ls /home/beo/

This should mirror what is on the master node

5. Passwordless SSH

To get communication working smoothly between the nodes, we're going to want to set up passwordless ssh.

  • Get on master node
  • Change into your cluster user (beo)

masternode: su beo

masternode: ssh-keygen

  • When asked for a keyphrase, do not enter one.  Just leave the field blank, so that it will be "passwordless".
  • Once that finishes, run the command

masternode: ssh-copy-id localhost

  • Now you should be able to quickly log into the other nodes in your cluster through ssh.

masternode: ssh ub1

If you want to change the default port that is used for ssh, we have to make some changes to config files.  Unfortunately we have to do this in all nodes.

allnodes: sudo nano /etc/ssh/ssh_config

Port xxxxx

allnodes: sudo nano /etc/ssh/sshd_config

Port xxxxx
PermitRootLogin no

Note that there is already a line that says "PermitRootLogin".  Replace that line with the above.

6. MPICH

So there are two main options for setting up a message passing interface on your cluster as far as I can tell.  MPICH, and OpenMPI.  I am using MPICH.  Right before installing mpich2, we might want to install some other software.

beo@ub0:~$ sudo apt-get install build-essentials gfortran gfortran-multilib autoconf

Install MPICH2.  Now.  There are two ways to do this.  The easy way, and the way that worked for me.  Here's the easy way.

beo@ub0:~$ sudo apt-get install mpich2

7. Process Manager

Once mpich2 is installed we need to set up the machine file so that mpich2 knows how many processes to use in each node.  We can make this file somewhere in the shared directory.

beo@ub0:~$ sudo nano /home/beo/machinefile

ub0:4
ub1:2
ub2:2
ub3:4

Where we write the node name and how many processes we want to use separated by a colon.

Great!  Now everything is set up and should be working, theoretically. Try to test it with this helloworld program from https://help.ubuntu.com/community/MpichCluster</a>.  Place the script in your home directory as "mpi_hello.c"

#include <stdio.h>
#include <mpi.h>

int main(int argc, char** argv) {
    int myrank, nprocs;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

    printf("Hello from processor %d of %d\n", myrank, nprocs);

    MPI_Finalize();
    return 0;
}

You need to compile it and then run it with...

beo@ub0:~$ mpicc mpi_hello.c -o mpi_hello

beo@ub0:~$ mpiexec -n 8 -f /home/beo/machinefile /home/beo/mpi_hello

Where -n is the flag for how many processors to use, and -f is the flag for the path to the machinefile.

Installing mpi4py

If you're like me and you really like using python, you might want to install mpi4py onto your cluster.  This was easy enough for me.  I just used pip to install the module.  

  • Don't use sudo apt-get to install mpi4py.  Apparently that has given people problems when being used with mpich2.

beo@ub0:~$ sudo pip install mpi4py --user

  • Notice that I added a --user flag in my pip install.  This is because if you do not do this, pip will install the module somewhere in the /usr/ folder, but we want it installed in the shared directory.

Troubleshooting Errors

During this process I had quite a few errors.  I'll try to go through the things that happened to me and tell you how I fixed them.

Installing Ubuntu

  • Got an error right off the bat.  "Error loading cdrom".
    • Moved usb to a new port and hit "retry" and it detected everything fine.  Strange Error.

Installing mpich

  • Like I said before, I had problems with using the apt-get install method to install mpich2.
  • "cannot find hydraproxy file"
    • This file is installed when you install mpich2.  What I had to do is add its location to my path variable
    • When all was said and done, my .bashrc file had these lines added to it
    • sudo nano ~/.bashrc
    • export PATH=/home/beo/mpich-install/bin:/home/beo/mpich-install/lib:$PATH
      export LD_LIBRARY_PATH=/usr/lib:/home/beo/mpich-install/lib:$LD_LIBRARY_PATH
      export LD_LIBRARY_PATH=/home/beo/.local/python2.7/site-packages:$LD_LIBRARY_PATH
      export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
      export HYDRA_HOST_FILE=/home/beo/machinefile
      export PYTHONPATH=/home/beo/.local/python2.7/site-packages:$PYTHONPATH
      export PYTHONPATH=/home/beo/mpich-install/lib:$PYTHONPATH
    • The HYDRA_HOST_FILE variable didn't actually do what I was hoping it would do, so that line may be unecessary.
  • Some error about not being able to find libmpich.so files
    • sudo apt-get install libcr-dev
    • I actually ran that on each node
  • Problems with getting nodes to communicate
    • This was a strange error, but when I was running python programs on the cluster and trying to get the nodes to exchange information, it just refused to work because it couldn't find libmpich.so.10
    • For this I actually physically moved the file to somewhere on the path variable.  Also, I couldn't find a file called libmpich.so.10, but I had one called libmpich.so.10.4.  So here's what I did
      • Move libmpich.so.10.4 from /usr/lib/x86_64-linux-gnu to /home/beo/mpich-install/lib then
    • Create a symbolic link while in the mpich-install/lib directory
      • sudo ln -s libmpich.so.10.0.4 libmpich.so.10

Python Packages "not installed"

  • Errors where after installing a python package via pip install <package> --user, other nodes could not import the module.
    • Saw that the nodes did not have read/write permissions in the shared python package folder.  Changed permissions with
      • sudo chown -r 755 /home/beo/.local/