# Measure the training throughput on V100 or RTX6000 GPUs
For V100, you need to have access to GPU resources on major GPU cloud platforms such as Amazon EC2. We will give an example of Amazon EC2.
For RTX6000, you need to have access to a local machine that has at least one RTX6000 GPU.
If you have never used Amazon EC2 before, you should request an account from https://aws.amazon.com/ec2 to get access to the GPU VMs.
After getting access, navigate to the “EC2 Dashboard” -> “Launch instance” pane to create an VM with V100 GPUs.
p3.2xlarge
.NVIDIA Deep Learning AMI v20.06.3
.Note: please make sure to turn off your VM instances as soon as you finish the experiments, as the instances are quite costly
If you have a machine with at least one RTX6000, you need to configure the NVIDIA driver and nvidia-docker properly. Please refer to the steps in https://github.com/NVIDIA/nvidia-docker#getting-started on how to setup a fresh machine for our experiments.
The software specification of the machine with RTX6000 under our experiments is:
NVIDIA GPU Driver Version: 450.66 OS: Ubuntu 18.04.5 LTS Docker Version: 20.10.2, build 2291f61
nvidia-docker2
Version: 2.5.0-1
First, clone the repo and navigate to the project.
# clone the code base
git clone https://github.com/UofT-EcoSystem/hfta.git
cd hfta
git checkout releases/mlsys21
We require two installation files (.deb
) for Nsight Systems and DCGM pre-downloaded to build the docker image.
3.1.72
, downloaded under third_party/nsys/nsys_cli_2020.3.1.72.deb
2.0.10
, downloaded under third_party/dcgm/datacenter-gpu-manager_2.0.10_amd64.deb
In order to download the .deb
files, you need to register a NVIDIA developer account via: https://developer.nvidia.com/login, after that, you can download the .deb file:
Follow the commands below to prepare and launch the docker image, this will take approximately 10 mins.
# build the image, select native1.6-cu10.2 for V100 and RTX6000
# this will take about 10 mins to complete
bash docker/build.sh native1.6-cu10.2
If you do not wish to build the docker image from scratch, you can reuse the prebuilt docker image that we provide:
docker pull wangshangsam/hfta:mlsys21_native1.6-cu10.2
docker tag wangshangsam/hfta:mlsys21_native1.6-cu10.2 hfta:dev
# launch the image
# you will need to provide a placeholder mount point for the data directory
# default is under ${HOME}/datasets
ubuntu@ip-xxxxxxxx:~/hfta$ bash docker/launch.sh <optional: data directory mount point> <optional: image tag>
root@c7ee88f34a48:/home/ubuntu/hfta: pip install -e .
... # additional outputs not shown
Installing collected packages: pandas, hfta
Attempting uninstall: pandas
Found existing installation: pandas 1.1.3
Uninstalling pandas-1.1.3:
Successfully uninstalled pandas-1.1.3
Running setup.py develop for hfta
Successfully installed hfta pandas-1.1.5
cd /home/ubuntu/hfta
source datasets/prepare_datasets.sh
# Download the dataset by calling helper functions defined in `prepare_datasets.sh`. For example: run
# `prepare_bert` for BERT experiment.
prepare_bert
cuda
v100
benchmarks
The first two arguments, are often needed in every run,. Please refer to the script for the usage of the arguments
# The command below will set the target device and device model to be CUDA, V100,
# and the output directory by default is ./benchmarks, but you can specify other things
source benchmarks/workflow.sh cuda v100 <optional: output root dir >
# For CUDA, RTX6000
source benchmarks/workflow.sh cuda rtx6000 <optional: output root dir>
_workflow_<modelname>.sh
files under <repo root>/benchmarks
.
# The functions are generally named as `workflow_<modelname>`.
# For example, in order to run BERT experiment, run
workflow_bert
# For partially fused Rsenet experiment, run
workflow_resnet_partially_fused
# For Rsenet convergence experiment, run
workflow_convergence
After the workflow experiment is done, run bash function below to process the output and plot the speedup curves. The plot functions are also defined in _workflow_<modelname>.sh
files.
# In general, plot functions are defined as plot_<exp name>
# For example, for BERT experiment, run
plot_bert
# For partially fused Rsenet experiment, run
plot_resnet_partially_fused
# For Rsenet convergence experiment, run
plot_resnet_convergence
Finally, you should be able to see the .csv
and .png
files under the output directory (./benchmarks (or the directory you specified above)
).