Installation to use AWS
This section assumes that you have installed neuropointillist already, and are planning to run it on AWS (in the cloud). The benefit of doing this is that AWS steeply discounts unused compute capacity (this is called the “spot” market). You can use this to run jobs that will take too long or just make your jobs finish more quickly.
In this section, I refer to things that you should do or should install on your “development computer”. This is the computer you are using to write and test your neuropointillist code, in contrast to any virtual cloud resources you might use to run your code.
To make sure that everything is working correctly, set up neuropoint to run locally or on a cluster before moving on to AWS. This will allow you to test your docker container and workflow.
Get an AWS account if you do not have one
You will need to create an AWS account if you do not have one. If you are part of a participating educational institution and have not joined AWS Educate, you might consider doing that to get some credits to try things out.
Install Docker if not installed
To use AWS in the way that neuropoint does, you need to package up your R environment in a Docker container. You will need to install Docker on your development computer.
Install the AWS CLI and AWS SDK for Python
You will need to install some tools that allow you to control AWS resources from your development computer using the shell (AWS CLI) and using Python (AWS SDK for Python).
First, install the AWS CLI on your platform. You will also need to configure it.
Then install the AWS SDK for Python
Install Nextflow
Nextflow is an amazing workflow language that I just discovered while trying to get neuropoint working with AWS Batch. I move from GNU make reluctantly and this is good enough, elegant enough, and powerful enough for me to fall in love. Trust me. Install Nextflow. Make sure your Java release is not too new to run nextflow.
Configure AWS Batch to Run Neuropointillist
There are four steps to using AWS Batch to run neuropointillist
jobs. Scripts to execute these steps are in the directory aws
.
Change directory to aws
:
cd aws
Step 1. Create a Docker container
- The first step is to create a Docker container that has the R
packages and other software that you need. The commands to do this
are in the
dockerbuild
directory. In particular, look at the filedocker_install.Rscript
and see if there are R packages you need that are not installed. In general, you don’t want to make the container enormous.
It is also possible that you might need to call some Linux commands
that are not installed by default. The base image is based on Amazon
Linux, which uses the yum
package manager. You can edit the
Dockerfile
file to install other Linux packages that you need to
run your application.
To build your container, from within the aws
directory, type
00_dockerbuild
This will create a container called neuropointillist-nextflow
.
You should test this container before trying to run it on AWS to make
sure that there are no unexpected problems. To do this, assuming that
you have run neuropointillist already and generated a directory with a
Makefile
, try the following.
Modify the line that begins with npointrun
to begin with the
following prefix.
docker run -v $(pwd):/containerdat -w /containerdat -it
neuropointillist-nextflow
This will run npointrun
in the container that you have just
created. If there are problems finding R packages or programs, modify
theDockerfile
or the docker_install.Rscript
file as described
above, rerun 00_dockerbuild
, and try again.
If you have difficulty finding files, make sure that they are all located in the directory with your Makefile. It will become important that all the files you need are eventually copied to AWS storage.
Step 2. Create an AWS ECR Registry (Optional)
You do not need to do this step if you prefer to use a different registry that can be accessed by AWS Batch (for example, Docker Hub or Quay). You just need to know the path to your container.
There is a charge for storing an image in AWS ECR
depending on its size. You can
determine the size of your Docker image with the command docker
images
. As an example, my image cost approximately 18 cents per
month. However, note that if you upload multiple versions of the image
after making corrections, you should delete those, or you can set
rules to automatically delete them in the AWS console.
However, if you wish to use AWS ECR, run the following command:
01_createECRRegistry
This will print out the path to your image, and link to the AWS console where you can see it and do any housekeeping to delete old images.
Note the path to your image for future configuration.
Step 3. Create an AWS Batch Queue
You need to create and configure a structure called an AWS Batch Queue that uses spot pricing to execute jobs as cheaply as possible. The drawback of that is that jobs might be killed. This is why Nextflow is so critical to the process; it can restart jobs that need to be rerun.
Data will be staged in an Amazon S3 bucket. Before you can create the
AWS Batch Queue, create a bucket either using
the console
or the
CLI.
Note that this name needs to be globally unique, so common names like
mybucket
will be taken.
To create the AWS Batch Queue, provide the bucketname as follows.
02_createNpointBatchQueue mybucket
Cleaning Up When Done
After you have finished running your npoint jobs, you will want to delete any resources that you have used on AWS.
Cleanup Step 1. Delete the AWS Batch Queue
You can remove the AWS Batch queue with the command
03_deleteNpointBatchQueue
This will remove any jobs and resources you have provisioned, but it will not remove your S3 bucket storage or your ECR registry. Note that there is no charge for the AWS Batch queue unless you are running things. However, there is a charge for the S3 bucket storage and for the storage of your image on the ECR Registry.
This command will print commands for deleting these resources, or you can refer to the commands below.
Cleanup Step 2. Delete the ECR Registry
You can remove your ECR registry with the comand
aws ecr delete-repository --force --repository-name neuropointillist-nextflow
Note that there is a small charge to store containers in ECR, so you will want to delete the registry and save your container for archival when you are done.
Cleanup Step 3: Remove your S3 Bucket (if desired)
You can copy data from your S3 bucket (mybucket
) to local disk as
follows:
aws s3 sync s3://mybucket mybucket-localcopy
You can delete the bucket permanently:
aws s3 rm s3://mybucket