{"id":8169,"date":"2021-04-05T00:17:15","date_gmt":"2021-04-05T00:17:15","guid":{"rendered":"https:\/\/wealthrevelation.com\/data-science\/2021\/04\/05\/on-demand-spark-clusters-with-gpu-acceleration\/"},"modified":"2021-04-05T00:17:15","modified_gmt":"2021-04-05T00:17:15","slug":"on-demand-spark-clusters-with-gpu-acceleration","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2021\/04\/05\/on-demand-spark-clusters-with-gpu-acceleration\/","title":{"rendered":"On-Demand Spark clusters with GPU acceleration"},"content":{"rendered":"<div>\n<p>Apache Spark has become the de-facto standard for processing large amounts of stationary and streaming data in a distributed fashion. The addition of the MLlib library, consisting of common learning algorithms and utilities, opened up Spark for a wide range of machine learning tasks and paved the way for running complex machine learning workflows on top of Apache Spark clusters. Some of the key benefits of using Spark for machine learning include:\u00a0<\/p>\n<ul>\n<li><strong>Distributed Learning<\/strong> \u2013 Parallelize compute-heavy workloads such as distributed training or hyper-parameter tuning\u00a0<\/li>\n<li><strong>Interactive Exploratory Analysis<\/strong> \u2013 Efficiently load large data sets in a distributed manner. Explore and understand the data using a familiar interface with Spark SQL\u00a0<\/li>\n<li><strong>Featurization and Transformation<\/strong> \u2013 Sample, aggregate, and re-label large data sets.<\/li>\n<\/ul>\n<p>At the same time, the use of Spark by Data Scientists presents its own set of challenges:\u00a0<\/p>\n<ul>\n<li><strong>Complexity<\/strong> \u2013 Apache Spark uses a layered architecture that mandates a master node, a cluster manager, and a set of worker nodes. Quite often Spark is not deployed in isolation but sits on top of a virtualized infrastructure (e.g. virtual machines or OS-level virtualization). Maintaining the cluster and the underlying infrastructure configuration can be a complex and time-consuming task\u00a0<\/li>\n<li><strong>Lack of GPU acceleration <\/strong>\u2013 Complex machine workloads, especially the ones involving Deep Learning, benefit from GPU architectures that are well adapted for vector and matrix operations. The Spark provided executor level and CPU-centric parallelization is typically no match for the large and fast registers and optimized bandwidth of the GPU architecture\u00a0<\/li>\n<li><strong>Cost<\/strong> \u2013 Keeping a Spark cluster up and running and using it intermittently, can quickly become a costly exercise (especially if Spark is running in the cloud). Quite often Spark is only needed for a fraction of the ML pipeline (e.g. data pre-processing) as the result set it produces fits comfortably in something like a cuDF DataFrame<\/li>\n<\/ul>\n<p>To address the challenges associated with complexity and costs Domino offers the ability to dynamically provision and orchestrate a Spark cluster directly on the infrastructure backing the Domino instance. This allows Domino users to get quick access to Spark without having to rely on their IT team to create and manage one for them. The Spark workloads are fully containerized on the Domino Kubernetes cluster and users can access Spark interactively through a Domino workspace (e.g. JupyterLab) or in batch mode through a Domino job or spark-submit. Moreover, because Domino can provision and de-provision clusters automatically, and can spin up Spark clusters on-demand, use them as part of a complex pipeline, and tear them down once the stage they were needed for is complete.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2021\/03\/on-demand-spark.png\" alt=\"Spark driver interacting with worker nodes within Domino\" class=\"wp-image-7448\" width=\"523\" height=\"432\"><\/figure>\n<\/div>\n<p>To solve the need for GPU accelerated Spark, Domino has teamed up with Nvidia. The Domino platform has been capable of leveraging GPU-accelerated hardware (both in the cloud and on-premises) for quite some time, and thanks to its underlying Kubernetes architecture can natively deploy and use <a href=\"https:\/\/www.nvidia.com\/en-gb\/gpu-cloud\/containers\/\">NGC containers<\/a> out of the box. This, for example, enables the Data Scientists to natively use <a href=\"https:\/\/developer.nvidia.com\/rapids\">NVIDIA RAPIDS<\/a>\u00a0 \u2013 a suite of software libraries, built on CUDA-X AI, that gives them the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. In addition, Domino supports the integration of <a href=\"https:\/\/nvidia.github.io\/spark-rapids\/\">RAPIDS Accelerator for Apache Spark<\/a>, which combines the power of the RAPIDS cuDF library and the scale of the Spark distributed computing framework. The RAPIDS Accelerator library also has a built-in accelerated shuffle based on <a href=\"https:\/\/github.com\/openucx\/ucx\/\">UCX<\/a> that can be configured to leverage GPU-to-GPU communication and RDMA capabilities. These capabilities allow Domino to provide streamlined access to GPU accelerated ML\/DL frameworks and GPU accelerated Apache Spark components through a unified and Data Scientist-friendly UI.\u00a0<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2021\/03\/nvidia-spark-3.png\" alt=\"Showing GPU accelerated architecture - Spark components with Nvidia SQL\/DF plugin and accelerated ML\/DL frameworks on top of Spark 3.0 core\" class=\"wp-image-7452\" width=\"521\" height=\"435\"><\/figure>\n<\/div>\n<h3>Configuring Spark clusters with RAPIDS Accelerator in Domino\u00a0<\/h3>\n<p>By default, Domino does not come with a Spark-compatible Compute Environment (Docker image), so our first task is to create one. Creating a new Compute Environment is a well-documented process, so feel free to check the <a href=\"https:\/\/docs.dominodatalab.com\/en\/4.3\/reference\/environments\/Environment_management.html\">official documentation<\/a> if you need a refresher.\u00a0\u00a0<\/p>\n<p>The key steps are to give the new environment a name (e.g. Spark 3.0.0 GPU) and use <em>bitnami\/spark:2.4.6<\/em> as the base image. Domino\u2019s on-demand Spark functionality has been developed and tested using open-source Spark images from Bitnami (<a href=\"https:\/\/hub.docker.com\/r\/bitnami\/spark\">this is why<\/a>, in case you are interested). However, you could also use the <em>bitnami\/spark:3.0.0<\/em> image, as we are replacing the Spark installation within so it doesn\u2019t really matter.\u00a0\u00a0<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2021\/03\/new-environment-912x1024.png\" alt=\"Screenshot of the Domino New Environment UI\" class=\"wp-image-7453\" width=\"456\" height=\"512\"><\/figure>\n<\/div>\n<p>Next, we need to edit the Compute Environment\u2019s Dockerfile to bring Spark up to 3.0.0, add the NVIDIA CUDA drivers, the RAPIDS accelerator, and the GPU discovery script. Adding the code below to the Dockerfile instructions triggers a compute environment rebuild.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: bash; gutter: false; title: ; notranslate\" title=\"\">\n# SPARK 3.0.0 GPU ENVIRONMENT DOCKERFILE\nUSER root\n\n#\n#  SPARK AND HADOOP\n#\n\nRUN apt-get update &amp;&amp; apt-get install -y wget &amp;&amp; rm -r \/var\/lib\/apt\/lists \/var\/cache\/apt\/archives\n\nENV HADOOP_VERSION=3.2.1\nENV HADOOP_HOME=\/opt\/hadoop\nENV HADOOP_CONF_DIR=\/opt\/hadoop\/etc\/hadoop\nENV SPARK_VERSION=3.0.0\nENV SPARK_HOME=\/opt\/bitnami\/spark\n\n### Remove the pre-installed Spark since it is pre-bundled with hadoop but preserve the python env\nWORKDIR \/opt\/bitnami\nRUN rm -rf ${SPARK_HOME}\n\n### Install the desired Hadoop-free Spark distribution\nRUN wget -q https:\/\/archive.apache.org\/dist\/spark\/spark-${SPARK_VERSION}\/spark-${SPARK_VERSION}-bin-without-hadoop.tgz &amp;&amp; \n    tar -xf spark-${SPARK_VERSION}-bin-without-hadoop.tgz &amp;&amp; \n    rm spark-${SPARK_VERSION}-bin-without-hadoop.tgz &amp;&amp; \n    mv spark-${SPARK_VERSION}-bin-without-hadoop ${SPARK_HOME} &amp;&amp; \n    chmod -R 777 ${SPARK_HOME}\/conf\n\n### Install the desired Hadoop libraries\nRUN wget -q http:\/\/archive.apache.org\/dist\/hadoop\/common\/hadoop-${HADOOP_VERSION}\/hadoop-${HADOOP_VERSION}.tar.gz &amp;&amp; \n    tar -xf hadoop-${HADOOP_VERSION}.tar.gz &amp;&amp; \n    rm hadoop-${HADOOP_VERSION}.tar.gz &amp;&amp; \n    mv hadoop-${HADOOP_VERSION} ${HADOOP_HOME}\n\n### Setup the Hadoop libraries classpath\nRUN echo 'export SPARK_DIST_CLASSPATH=\"$(hadoop classpath):${HADOOP_HOME}\/share\/hadoop\/tools\/lib\/*:\/opt\/sparkRapidsPlugin\"' &gt;&gt; ${SPARK_HOME}\/conf\/spark-env.sh\nENV LD_LIBRARY_PATH=\"${LD_LIBRARY_PATH}:$HADOOP_HOME\/lib\/native\"\n\n### This is important to maintain compatibility with Bitnami\nWORKDIR \/\nRUN \/opt\/bitnami\/scripts\/spark\/postunpack.sh\nWORKDIR ${SPARK_HOME}\n\n#\n# NVIDIA CUDA\n#\n\nRUN apt-get update &amp;&amp; apt-get install -y --no-install-recommends \ngnupg2 curl ca-certificates &amp;&amp; \n    curl -fsSL https:\/\/developer.download.nvidia.com\/compute\/cuda\/repos\/ubuntu1804\/x86_64\/7fa2af80.pub | apt-key add - &amp;&amp; \n    echo \"deb https:\/\/developer.download.nvidia.com\/compute\/cuda\/repos\/ubuntu1804\/x86_64 \/\" &gt; \/etc\/apt\/sources.list.d\/cuda.list &amp;&amp; \n    echo \"deb https:\/\/developer.download.nvidia.com\/compute\/machine-learning\/repos\/ubuntu1804\/x86_64 \/\" &gt; \/etc\/apt\/sources.list.d\/nvidia-ml.list &amp;&amp; \n    apt-get purge --autoremove -y curl &amp;&amp; \nrm -rf \/var\/lib\/apt\/lists\/*\n\nENV CUDA_VERSION 10.1.243\n\nENV CUDA_PKG_VERSION 10-1=$CUDA_VERSION-1\n\n# For libraries in the cuda-compat-* package: https:\/\/docs.nvidia.com\/cuda\/eula\/index.html#attachment-a\nRUN apt-get update &amp;&amp; apt-get install -y --no-install-recommends \n        cuda-cudart-$CUDA_PKG_VERSION \ncuda-compat-10-1 &amp;&amp; \nln -s cuda-10.1 \/usr\/local\/cuda &amp;&amp; \n    rm -rf \/var\/lib\/apt\/lists\/*\n\n# Required for nvidia-docker v1\nRUN echo \"\/usr\/local\/nvidia\/lib\" &gt;&gt; \/etc\/ld.so.conf.d\/nvidia.conf &amp;&amp; \n    echo \"\/usr\/local\/nvidia\/lib64\" &gt;&gt; \/etc\/ld.so.conf.d\/nvidia.conf\n\nENV PATH \/usr\/local\/nvidia\/bin:\/usr\/local\/cuda\/bin:${PATH}\nENV LD_LIBRARY_PATH \/usr\/local\/nvidia\/lib:\/usr\/local\/nvidia\/lib64\n\n# nvidia-container-runtime\nENV NVIDIA_VISIBLE_DEVICES all\nENV NVIDIA_DRIVER_CAPABILITIES compute,utility\nENV NVIDIA_REQUIRE_CUDA \"cuda&gt;=10.1 brand=tesla,driver&gt;=384,driver&lt;385 brand=tesla,driver&gt;=396,driver&lt;397 brand=tesla,driver&gt;=410,driver&lt;411\"\n\nENV NCCL_VERSION 2.4.8\n\nRUN apt-get update &amp;&amp; apt-get install -y --no-install-recommends \n    cuda-libraries-$CUDA_PKG_VERSION \ncuda-nvtx-$CUDA_PKG_VERSION \nlibcublas10=10.2.1.243-1 \nlibnccl2=$NCCL_VERSION-1+cuda10.1 &amp;&amp; \n    apt-mark hold libnccl2 &amp;&amp; \n    rm -rf \/var\/lib\/apt\/lists\/*\n\nRUN apt-get update &amp;&amp; apt-get install -y --no-install-recommends \n        cuda-nvml-dev-$CUDA_PKG_VERSION \n        cuda-command-line-tools-$CUDA_PKG_VERSION \ncuda-libraries-dev-$CUDA_PKG_VERSION \n        cuda-minimal-build-$CUDA_PKG_VERSION \n        libnccl-dev=$NCCL_VERSION-1+cuda10.1 \nlibcublas-dev=10.2.1.243-1 \n&amp;&amp; \n    rm -rf \/var\/lib\/apt\/lists\/*\n\nENV LIBRARY_PATH \/usr\/local\/cuda\/lib64\/stubs\n\nENV CUDNN_VERSION 7.6.5.32\nLABEL com.nvidia.cudnn.version=\"${CUDNN_VERSION}\"\n\nRUN apt-get update &amp;&amp; apt-get install -y --no-install-recommends \n    libcudnn7=$CUDNN_VERSION-1+cuda10.1 \nlibcudnn7-dev=$CUDNN_VERSION-1+cuda10.1 \n&amp;&amp; \n    apt-mark hold libcudnn7 &amp;&amp; \n    rm -rf \/var\/lib\/apt\/lists\/*\n\n# GPU Discovery Script\n#\nENV SPARK_RAPIDS_DIR=\/opt\/sparkRapidsPlugin\nRUN wget -q -P $SPARK_RAPIDS_DIR https:\/\/raw.githubusercontent.com\/apache\/spark\/master\/examples\/src\/main\/scripts\/getGpusResources.sh\nRUN chmod +x $SPARK_RAPIDS_DIR\/getGpusResources.sh\nRUN echo 'export SPARK_WORKER_OPTS=\"-Dspark.worker.resource.gpu.amount=1 -Dspark.worker.resource.gpu.discoveryScript=\/opt\/sparkRapidsPlugin\/getGpusResources.sh\"' &gt;&gt; ${SPARK_HOME}\/conf\/spark-env.sh\n\nENV PATH=\"$PATH:$HADOOP_HOME\/bin:$SPARK_HOME\/bin\"\nWORKDIR ${SPARK_HOME}\n\nRUN wget -q -P $SPARK_RAPIDS_DIR https:\/\/repo1.maven.org\/maven2\/com\/nvidia\/rapids-4-spark_2.12\/0.1.0\/rapids-4-spark_2.12-0.1.0.jar\nRUN wget -q -P $SPARK_RAPIDS_DIR https:\/\/repo1.maven.org\/maven2\/ai\/rapids\/cudf\/0.14\/cudf-0.14-cuda10-1.jar\nENV SPARK_CUDF_JAR=${SPARK_RAPIDS_DIR}\/cudf-0.14-cuda10-1.jar\nENV SPARK_RAPIDS_PLUGIN_JAR=${SPARK_RAPIDS_DIR}\/rapids-4-spark_2.12-0.1.0.jar\n<\/pre>\n<\/div>\n<p>We can verify that the new environment has been successfully built by inspecting the Revisions sections and making sure that the active environment is the most recent one.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2021\/03\/environment-built.png\" alt=\"Showing Domino's Revisions tab with successfully built version of the GPU compute environment.\" class=\"wp-image-7454\" width=\"758\" height=\"311\"><\/figure>\n<\/div>\n<p>Now that we have a Spark environment with the RAPIDS accelerator in place, we need to create a Workspace environment \u2013 an environment that will host the IDE that we\u2019ll use to interact with Spark.<\/p>\n<p>The process of creating a custom PySpark workspace environment is fully covered in the <a href=\"https:\/\/docs.dominodatalab.com\/en\/4.4\/reference\/spark\/on_demand_spark\/Configuring_prerequisites.html#pyspark-compute-environment-advanced-custom-hadoop-client-libraries\">Domino official documentation<\/a>. It is similar to how we built the Spark environment above, the key differences being that we use a Domino base image (instead of bitnami) and that we also need to configure pluggable workspaces tools. The latter enables access to the web-based tools inside the compute environment (e.g. JupyterLab).<\/p>\n<p>To build the workspace environment we create a new Compute Environment (Spark 3.0.0 RAPIDS Workspace Py3.6) using dominodatalab\/base:Ubuntu18_DAD_Py3.6_R3.6_20200508 as the base image and we add the following contents to the Dockerfile instructions section:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: bash; gutter: false; title: ; notranslate\" title=\"\">\n# SPARK 3.0.0 RAPIDS WORKSPACE DOCKERFILE\n\nRUN mkdir -p \/opt\/domino\n\n### Modify the Hadoop and Spark versions below as needed.\nENV HADOOP_VERSION=3.2.1\nENV HADOOP_HOME=\/opt\/domino\/hadoop\nENV HADOOP_CONF_DIR=\/opt\/domino\/hadoop\/etc\/hadoop\nENV SPARK_VERSION=3.0.0\nENV SPARK_HOME=\/opt\/domino\/spark\nENV PATH=\"$PATH:$SPARK_HOME\/bin:$HADOOP_HOME\/bin\"\n\n### Install the desired Hadoop-free Spark distribution\nRUN rm -rf ${SPARK_HOME} &amp;&amp; \n    wget -q https:\/\/archive.apache.org\/dist\/spark\/spark-${SPARK_VERSION}\/spark-${SPARK_VERSION}-bin-without-hadoop.tgz &amp;&amp; \n    tar -xf spark-${SPARK_VERSION}-bin-without-hadoop.tgz &amp;&amp; \n    rm spark-${SPARK_VERSION}-bin-without-hadoop.tgz &amp;&amp; \n    mv spark-${SPARK_VERSION}-bin-without-hadoop ${SPARK_HOME} &amp;&amp; \n    chmod -R 777 ${SPARK_HOME}\/conf\n\n### Install the desired Hadoop libraries\nRUN rm -rf ${HADOOP_HOME} &amp;&amp; \n    wget -q http:\/\/archive.apache.org\/dist\/hadoop\/common\/hadoop-${HADOOP_VERSION}\/hadoop-${HADOOP_VERSION}.tar.gz &amp;&amp; \n    tar -xf hadoop-${HADOOP_VERSION}.tar.gz &amp;&amp; \n    rm hadoop-${HADOOP_VERSION}.tar.gz &amp;&amp; \n    mv hadoop-${HADOOP_VERSION} ${HADOOP_HOME}\n\n### Setup the Hadoop libraries classpath and Spark related envars for proper init in Domino\nRUN echo \"export SPARK_HOME=${SPARK_HOME}\" &gt;&gt; \/home\/ubuntu\/.domino-defaults\nRUN echo \"export HADOOP_HOME=${HADOOP_HOME}\" &gt;&gt; \/home\/ubuntu\/.domino-defaults\nRUN echo \"export HADOOP_CONF_DIR=${HADOOP_CONF_DIR}\" &gt;&gt; \/home\/ubuntu\/.domino-defaults\nRUN echo \"export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${HADOOP_HOME}\/lib\/native\" &gt;&gt; \/home\/ubuntu\/.domino-defaults\nRUN echo \"export PATH=$PATH:${SPARK_HOME}\/bin:${HADOOP_HOME}\/bin\" &gt;&gt; \/home\/ubuntu\/.domino-defaults\nRUN echo \"export SPARK_DIST_CLASSPATH=\"$(hadoop classpath):${HADOOP_HOME}\/share\/hadoop\/tools\/lib\/*\"\" &gt;&gt; ${SPARK_HOME}\/conf\/spark-env.sh\n\n### Complete the PySpark setup from the Spark distribution files\nWORKDIR $SPARK_HOME\/python\nRUN python setup.py install\n\n### Optionally copy spark-submit to spark-submit.sh to be able to run from Domino jobs\nRUN spark_submit_path=$(which spark-submit) &amp;&amp; \n    cp ${spark_submit_path} ${spark_submit_path}.sh\n    \nENV SPARK_RAPIDS_DIR=\/opt\/sparkRapidsPlugin\nRUN wget -q -P $SPARK_RAPIDS_DIR https:\/\/repo1.maven.org\/maven2\/com\/nvidia\/rapids-4-spark_2.12\/0.1.0\/rapids-4-spark_2.12-0.1.0.jar\nRUN wget -q -P $SPARK_RAPIDS_DIR https:\/\/repo1.maven.org\/maven2\/ai\/rapids\/cudf\/0.14\/cudf-0.14-cuda10-1.jar\nENV SPARK_CUDF_JAR=${SPARK_RAPIDS_DIR}\/cudf-0.14-cuda10-1.jar\nENV SPARK_RAPIDS_PLUGIN_JAR=${SPARK_RAPIDS_DIR}\/rapids-4-spark_2.12-0.1.0.jar\n<\/pre>\n<\/div>\n<p>Notice that we also add the RAPIDS accelerator at the end and set a number of environment variables to make the plugin readily available in the preferred IDE (e.g. JupyterLab). We also add the following mapping to the Pluggable Workspaces Tools section in order to make Jupyter and JupyterLab available through the Domino UI.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: yaml; gutter: false; title: ; notranslate\" title=\"\">\njupyter:\n  title: \"Jupyter (Python, R, Julia)\"\n  iconUrl: \"\/assets\/images\/workspace-logos\/Jupyter.svg\"\n  start: [ \"\/var\/opt\/workspaces\/jupyter\/start\" ]\n  httpProxy:\n    port: 8888\n    rewrite: false\n    internalPath: \"\/{{ownerUsername}}\/{{projectName}}\/{{sessionPathComponent}}\/{{runId}}\/{{#if pathToOpen}}tree\/{{pathToOpen}}{{\/if}}\"\n    requireSubdomain: false\n  supportedFileExtensions: [ \".ipynb\" ]\njupyterlab:\n  title: \"JupyterLab\"\n  iconUrl: \"\/assets\/images\/workspace-logos\/jupyterlab.svg\"\n  start: [  \/var\/opt\/workspaces\/Jupyterlab\/start.sh ]\n  httpProxy:\n    internalPath: \"\/{{ownerUsername}}\/{{projectName}}\/{{sessionPathComponent}}\/{{runId}}\/{{#if pathToOpen}}tree\/{{pathToOpen}}{{\/if}}\"\n    port: 8888\n    rewrite: false\n    requireSubdomain: false\nvscode:\n title: \"vscode\"\n iconUrl: \"\/assets\/images\/workspace-logos\/vscode.svg\"\n start: [ \"\/var\/opt\/workspaces\/vscode\/start\" ]\n httpProxy:\n    port: 8888\n    requireSubdomain: false\nrstudio:\n  title: \"RStudio\"\n  iconUrl: \"\/assets\/images\/workspace-logos\/Rstudio.svg\"\n  start: [ \"\/var\/opt\/workspaces\/rstudio\/start\" ]\n  httpProxy:\n    port: 8888\n    requireSubdomain: false\n<\/pre>\n<\/div>\n<p>After the workspace and Spark environments are made available, everything is in place for launching GPU-accelerated Spark clusters. All we need to do at this point is to go to an arbitrary project and define a new Workspace. We can name the workspace On Demand Spark, select the Spark 3.0.0 RAPIDS Workspace Py3.6 environment, and mark JupyterLab as the desired IDE. The selected hardware tier for the workspace can be relatively small as most of the heavy lifting will be carried out by the Spark cluster.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2021\/03\/new-spark-workspace.png\" alt=\"The Launch New Workspace screen in Domino. Environment is set to Spark 3.0.0 Workspace, IDE is set to JupyterLab.\" class=\"wp-image-7455\" width=\"509\" height=\"427\"><\/figure>\n<\/div>\n<p>On the Compute Cluster screen, we select Spark, set the number of executors that we want Domino to create for the cluster, and select hardware tiers for the Spark executors and Spark driver. We need to make sure that these hardware tiers have Nvidia GPUs if we are to benefit from using the RAPIDS accelerator.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2021\/03\/spark-cluster-configuration.png\" alt=\"Compute Cluster tab of the Launch New Workspace dialog, showing 2 executors, GPU HW tier for the executors, and GPU HW tier for the spark master. Compute environment is set to Spark 3.0.0 GPU\" class=\"wp-image-7456\" width=\"509\" height=\"427\"><\/figure>\n<\/div>\n<p>Once the cluster is up and running we will be presented with an instance of JupyterLab. The workspace will also feature an extra tab \u2013 Spark Web UI, which provides access to the web interface of the running Spark application and allows us to monitor and inspect the relevant job executions.<\/p>\n<p>We can then create a notebook with a minimal example to smoke test the configuration. First, we establish a connection to the on-demand cluster and create an application:<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; gutter: false; title: ; notranslate\" title=\"\">\nfrom pyspark.sql import SparkSession\n\nspark = SparkSession.builder \n                    .config(\"spark.task.cpus\", 1) \n                    .config(\"spark.driver.extraClassPath\", \"\/opt\/sparkRapidsPlugin\/rapids-4-spark_2.12-0.1.0.jar:\/opt\/sparkRapidsPlugin\/cudf-0.14-cuda10-1.jar\") \n                    .config(\"spark.executor.extraClassPath\", \"\/opt\/sparkRapidsPlugin\/rapids-4-spark_2.12-0.1.0.jar:\/opt\/sparkRapidsPlugin\/cudf-0.14-cuda10-1.jar\") \n                    .config(\"spark.executor.resource.gpu.amount\", 1) \n                    .config(\"spark.executor.cores\", 6) \n                    .config(\"spark.task.resource.gpu.amount\", 0.15) \n                    .config(\"spark.rapids.sql.concurrentGpuTasks\", 1) \n                    .config(\"spark.rapids.memory.pinnedPool.size\", \"2G\") \n                    .config(\"spark.locality.wait\", \"0s\") \n                    .config(\"spark.sql.files.maxPartitionBytes\", \"512m\") \n                    .config(\"spark.sql.shuffle.partitions\", 10) \n                    .config(\"spark.plugins\", \"com.nvidia.spark.SQLPlugin\") \n                    .appName(\"MyGPUAppName\") \n                    .getOrCreate()\n\n<\/pre>\n<\/div>\n<p>Note that we keep parts of the configuration dynamic, as it will vary based on the specific GPU hardware tier that is running the execution.\u00a0<\/p>\n<ul>\n<li>spark.task.cpus \u2013 number of cores to allocate for each task<\/li>\n<li>spark.task.resource.gpu.amount \u2013 number of GPUs per task. Note, that this can be a decimal and it can be set in line with the number of CPUs available on the executor hardware tier. In this test, we set it 0.15, which is slightly under 1\/6 (6 CPUs sharing a single GPU)<\/li>\n<li>spark.executor.resource.gpu.amount \u2013 number of GPUs available in the hardware tier (we have 1 V100 here)<\/li>\n<\/ul>\n<p>After the application is initialised and connected to the cluster, it appears in the Spark Web UI section of the workspace:<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" width=\"1024\" height=\"684\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2021\/03\/spark-web-ui-1024x684.png\" alt=\"Spark Web UI tab showing 2 workers, and the MyGPUAppName application using 12 cores and 1 gpu per executor\" class=\"wp-image-7457\"><\/figure>\n<\/div>\n<p>We can then run a simple outer join task that looks like this.<\/p>\n<div class=\"wp-block-syntaxhighlighter-code \">\n<pre class=\"brush: python; gutter: false; title: ; notranslate\" title=\"\">\ndf1 = spark.sparkContext.parallelize(range(1, 100)).map(lambda x: (x, \"a\" * x)).toDF()\ndf2 = spark.sparkContext.parallelize(range(1, 100)).map(lambda x: (x, \"b\" * x)).toDF()\ndf = df1.join(df2, how=\"outer\")\ndf.count()\n<\/pre>\n<\/div>\n<p>After the count() action completes, we can inspect the DAG for the first job (for example), and clearly see that Spark is using GPU accelerated operations (e.g. GpuColumnarExchange, GpuHashAggregate etc.)<\/p>\n<div class=\"wp-block-image is-style-default\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2021\/03\/gpu-accelerated-dag.png\" alt=\"Spark DAG visualisation showing 2 stages with standard operations replaced by GPU-accelerated operations (e.g. GpuHashAggregate, GPUColumnarExchange etc.)\" class=\"wp-image-7458\" width=\"565\" height=\"627\"><\/figure>\n<\/div>\n<h3>Summary<\/h3>\n<p>In this post, we showed that configuring an on-demand Apache Spark cluster with RAPIDS Accelerator and GPU backends is a fairly straightforward process in Domino. Besides the benefits around not having to deal with the underlying infrastructure, reducing costs by on-demand provisioning, and out-of-the-box reproducibility provided by the Domino platform, this setup also significantly reduces the processing times, making data science teams more efficient and enabling them to achieve higher model velocity.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2021\/03\/spark-benchmark.png\" alt=\"Benchmark plot showing 3.8x speed-up and 50% cost savings on ETL workloads.\" class=\"wp-image-7459\" width=\"693\" height=\"209\"><\/figure>\n<\/div>\n<p>A benchmark published by <a href=\"https:\/\/nvidia.github.io\/spark-rapids\/\">Nvidia<\/a> shows 3.8x speed up and 50% cost reduction for an ETL workload executed on the FannieMae Mortgage Dataset (~200GB) using V100 GPU instances.<\/p>\n<p>If you\u2019d like to learn more, you can use the following additional resources:<\/p>\n<p><!-- relpost-thumb-wrapper --><!-- close relpost-thumb-wrapper -->    <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/blog.dominodatalab.com\/on-demand-spark-clusters-with-gpu-acceleration\/<\/p>\n","protected":false},"author":0,"featured_media":8170,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8169"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=8169"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8169\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/8170"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=8169"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=8169"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=8169"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}