{"id":882,"date":"2020-09-01T14:04:50","date_gmt":"2020-09-01T14:04:50","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/09\/01\/showcasing-the-benefits-of-software-optimizations-for-ai-workloads-on-intel-xeon-scalable-platforms\/"},"modified":"2020-09-01T14:04:50","modified_gmt":"2020-09-01T14:04:50","slug":"showcasing-the-benefits-of-software-optimizations-for-ai-workloads-on-intel-xeon-scalable-platforms","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/09\/01\/showcasing-the-benefits-of-software-optimizations-for-ai-workloads-on-intel-xeon-scalable-platforms\/","title":{"rendered":"Showcasing the Benefits of Software Optimizations for AI Workloads on Intel\u00ae Xeon\u00ae Scalable Platforms"},"content":{"rendered":"<div id=\"post-\">\n<p><b>By Huma Abidi &amp; Haihao Shen, Intel<\/b><\/p>\n<p>Intel\u00ae Xeon\u00ae Scalable platforms provide the foundation for an evolutionary leap forward in data centric innovation. <a href=\"https:\/\/www.intel.com\/content\/dam\/www\/public\/us\/en\/documents\/product-overviews\/dl-boost-product-overview.pdf\" rel=\"noopener noreferrer\" target=\"_blank\">Intel\u00ae Deep Learning Boost<\/a> is a technology that has built-in AI acceleration to enhance artificial intelligence inference and training performance. Specifically, for the 3rd Generation Intel Xeon Scalable Processors that was <a href=\"https:\/\/newsroom.intel.com\/wp-content\/uploads\/sites\/11\/2020\/06\/3rd-Gen-Intel-Xeon-Product-Brief.pdf\" rel=\"noopener noreferrer\" target=\"_blank\">announced<\/a> in June 2020, it was the industry\u2019s first x86 support of Brain Floating Point 16-bit (blfoat16) and Vector Neural Network Instructions (VNNI). The enhancements in hardware architecture, coupled with software optimizations deliver up to <b>1.93x<\/b> more performance in AI training and up to <b>1.9x<\/b> more performance in AI inference compared to 2<sup>nd<\/sup> Generation Intel Xeon Scalable processors<sup>1<\/sup>. <\/p>\n<p>The focus of this blog is to bring to light that continued software optimizations can boost performance not only for the latest platforms, but also for the current install base from prior generations. For each magnitude of performance increase achieved by new hardware, software optimizations have the potential to multiply those results. However, new hardware platforms come to market roughly every 12-18 months. In between those new platform introductions, we continue to innovate on the software stack in order to enhance performance for even the current generation of platforms. In addition, the open source software stacks such as TensorFlow and PyTorch are continuously driving innovations and changes into their implementation. Our software stack is aligned to address these rapid releases and accelerate our Intel-optimized versions. We are also at the forefront of enabling new models and use cases published by the AI research community in industry and academia. This means customers can continue to extract value from their current platform investments. <\/p>\n<p>To highlight this point, we measured the performance on popular deep learning workloads. The table below includes image classification (ResNet50 v1.5), natural language processing (Transformer\/BERT), and recommendation systems (Wide &amp; Deep). It is important to note that the workloads were all run on 2<sup>nd<\/sup> generation Intel Xeon Scalable processors. The only variables are the different versions of software releases. Therefore, the performance gains in the resulting table are purely attributed to software optimizations.<\/p>\n<p>\u00a0<\/p>\n<h3>Results<\/h3>\n<p>\u00a0<br \/><b>Deep Learning Workload Performance<\/b><sup><b>2<\/b><\/sup><\/p>\n<div><img src=\"https:\/\/i.ibb.co\/HgBZKxK\/intel-sw-optimizations-fig.jpg\" alt=\"Results\" width=\"100%\"><\/div>\n<p>\u00a0<\/p>\n<p>These impressive performance gains are due to software optimizations used in Intel-optimized deep learning frameworks (e.g.TensorFlow, PyTorch, and MXNet) as well as the <a href=\"https:\/\/software.intel.com\/content\/www\/us\/en\/develop\/tools\/openvino-toolkit.html\" rel=\"noopener noreferrer\" target=\"_blank\">Intel\u00ae Distribution of OpenVINO<sup>TM <\/sup>toolkit<\/a> for deep learning inference. Listed are some of our approaches to software optimization: <\/p>\n<ul>\n<li>Intel\u00ae oneAPI Deep Neural Network Library (oneDNN) had been used in all popular deep learning frameworks and the primitives are being used to accelerate the execution of typical DNN operators, like Convolution, GEMM, BatchNorm, etc.\n<\/li>\n<li>Graph fusion is used to reduce memory footprint and save on memory bandwidth. In addition to graph fusion, constant folding (Convolution and BatchNorm) and common subexpression elimination are also used as pre-computation of graph execution.\n<\/li>\n<li>Runtime optimization achieved through better memory and thread management. Memory management helps to improve the cache utilization by re-using the memory as much as possible. Thread management allocates the thread resources more effectively for workload execution.\n<\/li>\n<\/ul>\n<p>In addition to single node software optimizations, multi-node software optimizations are also explored to show better scale-up\/scale-out efficiency. Data\/model\/hybrid parallelism is a well-known technique for multi-node training, together with computation\/communication overlap (or pipeline overlap) and some novel communication collective algorithms (e.g., ring-based all-reduce).<\/p>\n<p>\u00a0<\/p>\n<h3>How You Can Access Intel\u2019s Software Optimizations<\/h3>\n<p>\u00a0<br \/>Through our work in software, we continuously make performance improvements for deep learning frameworks. There are multiple ways to gain access to Intel\u2019s software optimizations that enhance current generations of Intel hardware and benefit future platforms. <\/p>\n<p>The <a href=\"https:\/\/software.intel.com\/content\/www\/us\/en\/develop\/tools\/oneapi\/ai-analytics-toolkit.html\" rel=\"noopener noreferrer\" target=\"_blank\">Intel\u00ae AI Analytics Toolkit<\/a>, powered by oneAPI, gives\u00a0developers, researchers, and data scientists familiar\u00a0Python tools\u00a0to\u00a0accelerate\u00a0each step in the pipeline\u2014training deep neural networks, integrating trained models into applications for inference, and\u00a0executing functions for\u00a0data analytics and machine learning workloads. The toolkit includes:<\/p>\n<p><a href=\"https:\/\/software.intel.com\/content\/www\/us\/en\/develop\/tools\/frameworks.html#pytorch\" rel=\"noopener noreferrer\" target=\"_blank\">PyTorch Optimized for Intel\u00ae Technology<\/a>. This includes Intel optimizations up-streamed to both the mainline Pytorch and the <a href=\"https:\/\/github.com\/intel\/intel-extension-for-pytorch\" rel=\"noopener noreferrer\" target=\"_blank\">Intel extension of Pytorch<\/a> that is intended to make the Out of Box experience better for our customers. <\/p>\n<p><a href=\"https:\/\/software.intel.com\/content\/www\/us\/en\/develop\/tools\/frameworks.html#tensor-flow\" rel=\"noopener noreferrer\" target=\"_blank\">Intel\u00ae Optimization for TensorFlow<\/a>. In collaboration with Google, TensorFlow has been directly optimized for Intel\u00ae architecture to achieve high performance on Intel\u00ae Xeon\u00ae Scalable processors. Intel also offers AI containers. We publish the docker image of Intel\u00ae Optimization of TensorFlow on DockHub. The following tags are used:<\/p>\n<ul>\n<li>3<sup>rd<\/sup> Gen Intel Xeon Scalable Processors image: intel\/intel-optimized-tensorflow:tensorflow-2.2-bf16-nightly\n<\/li>\n<li>2<sup>nd<\/sup> Gen Intel Xeon Scalable Processors image: intelaipg\/intel-optimized-tensorflow:latest-prs-b5d67b7-avx2-devel-mkl-py3\n<\/li>\n<\/ul>\n<p>\u00a0<br \/><a href=\"https:\/\/github.com\/IntelAI\/models\" rel=\"noopener noreferrer\" target=\"_blank\">Model Zoo for Intel\u00ae Architecture<\/a>. This repository contains\u00a0links to pre-trained models, sample scripts, best practices, and step-by-step tutorials\u00a0for greater than 40 popular open-source machine learning models optimized by Intel to run on Intel\u00ae Xeon\u00ae Scalable processors. We are also contributing to <a href=\"https:\/\/github.com\/tensorflow\/models\/tree\/master\/community\" rel=\"noopener noreferrer\" target=\"_blank\">Google Model Garden<\/a> by adding Intel-optimized models.<\/p>\n<p><a href=\"https:\/\/github.com\/intel\/lp-inference-kit\" rel=\"noopener noreferrer\" target=\"_blank\">The Intel\u00ae Low Precision Inference Toolkit <\/a>accelerates deep learning inference workloads. This is developed to convert FP32 precision to int8 precision in order to assist customers with deploying low precision inference solutions rapidly.<\/p>\n<p><a href=\"https:\/\/software.intel.com\/content\/www\/us\/en\/develop\/tools\/distribution-for-python.html\" rel=\"noopener noreferrer\" target=\"_blank\">The Intel\u00ae Distribution for Python<\/a> enables customers to speed up computational packages without code changes.<\/p>\n<p><a href=\"https:\/\/software.intel.com\/content\/www\/us\/en\/develop\/tools\/oneapi\/components\/onednn.html\" rel=\"noopener noreferrer\" target=\"_blank\">Intel\u2019s oneAPI Deep Neural Network Library<\/a> (oneDNN) is an open-source performance library that contains basic building blocks for neural networks optimized for Intel Architecture Processors and Intel\u00ae Processor Graphics\/GPU. OneDNN is default for CPU in PyTorch and MXNet binaries and in the process to be added to Tensorflow.<\/p>\n<p>\u00a0<\/p>\n<h3>Conclusion<\/h3>\n<p>\u00a0<br \/>Intel\u00ae Xeon\u00ae Scalable processors support both Complex AI workloads and general purpose compute workloads. In addition to innovating and releasing AI features into Intel Xeon Scalable processors for each generation, Intel software optimizations take advantage of the hardware features and continue to bring significant performance speed up to popular AI workloads. Developers and end customers should stay up to date and take advantage of the Intel-optimized frameworks and software toolkits that are designed to unleash the performance of Intel platforms.<\/p>\n<p>\u00a0<br \/><b>References<\/b><\/p>\n<p>\u00a0<br \/><strong>Notices and Disclaimers<\/strong><\/p>\n<p>Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.\u00a0\u00a0<\/p>\n<p>Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions.\u00a0 Any change to any of those factors may cause the results to vary.\u00a0 You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.\u00a0\u00a0 For more complete information visit www.intel.com\/benchmarks.<\/p>\n<p>Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available \u200bupdates.\u00a0 See backup for configuration details.\u00a0 No product or component can be absolutely secure.\u00a0<\/p>\n<p>Intel&#8217;s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.\u00a0<\/p>\n<p>Your costs and results may vary.\u00a0<\/p>\n<p>Intel technologies may require enabled hardware, software or service activation.<\/p>\n<p>Configurations: Testing by Intel as of Jul 3<sup>rd<\/sup>, 2020.<\/p>\n<p>ResNet50 v1.5<\/p>\n<ul>\n<li>Jun-18: training, torch v0.4, KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0 OMP_NUM_THREADS=28 python benchmark_.py &#8211;arch resnet50 &#8211;num-iters=20; inference, torch v0.4, KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0 OMP_NUM_THREADS=28 python benchmark_.py &#8211;arch resnet50 &#8211;num-iters=20 \u2013inference\n<\/li>\n<li>Apr-19: training, torch v1.01, KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0 OMP_NUM_THREADS=28 python benchmark_.py &#8211;arch resnet50 &#8211;num-iters=20; inference: MLPerf v0.5 submission + patch: <a href=\"https:\/\/github.com\/pytorch\/pytorch\/pull\/25235\" rel=\"noopener noreferrer\" target=\"_blank\">https:\/\/github.com\/pytorch\/pytorch\/pull\/25235<\/a>, .\/inferencer &#8211;net_conf resnet50 &#8211;log_level 0 &#8211;w 20 &#8211;batch_size 128 &#8211;iterations 1000 &#8211;device_type ideep &#8211;dummy_data true &#8211;random_multibatch false &#8211;numa_id 0 &#8211;init_net_path resnet50\/init_net_int8.pbtxt &#8211;predict_net_path resnet50\/predict_net_int8.pbtxt &#8211;shared_memory_option USE_LOCAL &#8211;shared_weight USE_LOCAL &#8211;data_order NHWC &#8211;quantized true\n<\/li>\n<li>Jun-20: training, <a href=\"https:\/\/github.com\/pytorch\/pytorch\/tree\/gh\/xiaobingsuper\/18\/orig\" rel=\"noopener noreferrer\" target=\"_blank\">https:\/\/github.com\/pytorch\/pytorch\/tree\/gh\/xiaobingsuper\/18\/orig<\/a>, KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0 OMP_NUM_THREADS=28 python benchmark_.py &#8211;arch resnet50 &#8211;num-iters=20; inference, same as Apr-19\n<\/li>\n<\/ul>\n<p>Transformer<\/p>\n<ul>\n<li>Apr-19: training, docker intelaipg\/intel-optimized-tensorflow:latest-prs-b5d67b7-avx2-devel-mkl-py3, OMP_NUM_THREADS=28 python .\/benchmarks\/launch_benchmark.py &#8211;framework tensorflow &#8211;precision fp32 &#8211;mode training &#8211;model-name transformer_mlperf &#8211;num-intra-threads 28 &#8211;num-inter-threads 1 &#8211;data-location transformer_data &#8211;random_seed=11 train_steps=100 steps_between_eval=100 params=big save_checkpoints=&#8221;No&#8221; do_eval=&#8221;No&#8221; print_iter=10; inference, docker intelaipg\/intel-optimized-tensorflow:latest-prs-b5d67b7-avx2-devel-mkl-py3, OMP_NUM_THREADS=28 python benchmarks\/launch_benchmark.py &#8211;model-name transformer_lt_official &#8211;precision fp32 &#8211;mode inference &#8211;framework tensorflow &#8211;batch-size 64 &#8211;num-intra-threads 28 &#8211;num-inter-threads 1 &#8211;in-graph fp32_graphdef.pb &#8211;data-location transformer_lt_official_fp32_pretrained_model\/data &#8212; file=newstest2014.en file_out=translate.txt reference=newstest2014.de vocab_file=vocab.txt NOINSTALL=True\n<\/li>\n<li>Jun-20: training, docker i intel\/intel-optimized-tensorflow:tensorflow-2.2-bf16-nightly, OMP_NUM_THREADS=28 python .\/benchmarks\/launch_benchmark.py &#8211;framework tensorflow &#8211;precision fp32 &#8211;mode training &#8211;model-name transformer_mlperf &#8211;num-intra-threads 28 &#8211;num-inter-threads 1 &#8211;data-location transformer_data &#8211;random_seed=11 train_steps=100 steps_between_eval=100 params=big save_checkpoints=&#8221;No&#8221; do_eval=&#8221;No&#8221; print_iter=10; inference, docker intelaipg\/intel-optimized-tensorflow:latest-prs-b5d67b7-avx2-devel-mkl-py3, OMP_NUM_THREADS=28 python benchmarks\/launch_benchmark.py &#8211;model-name transformer_lt_official &#8211;precision fp32 &#8211;mode inference &#8211;framework tensorflow &#8211;batch-size 64 &#8211;num-intra-threads 28 &#8211;num-inter-threads 1 &#8211;in-graph fp32_graphdef.pb &#8211;data-location transformer_lt_official_fp32_pretrained_model\/data &#8212; file=newstest2014.en file_out=translate.txt reference=newstest2014.de vocab_file=vocab.txt NOINSTALL=True\n<\/li>\n<\/ul>\n<p>BERT<\/p>\n<ul>\n<li>Apr-19: inference, docker intelaipg\/intel-optimized-tensorflow:latest-prs-b5d67b7-avx2-devel-mkl-py3, OMP_NUM_THREADS=28 python run_squad.py &#8211;init_checkpoint=\/tf_dataset\/dataset\/data-bert-squad\/squad-ckpts\/model.ckpt-3649 &#8211;vocab_file=\/tf_dataset\/dataset\/data-bert-squad\/uncased_L-24_H-1024_A-16\/vocab.txt &#8211;bert_config_file=\/tf_dataset\/dataset\/data-bert-squad\/uncased_L-24_H-1024_A-16\/bert_config.json &#8211;predict_file=\/tf_dataset\/dataset\/data-bert-squad\/uncased_L-24_H-1024_A-16\/dev-v1.1.json &#8211;precision=fp32 &#8211;output_dir=\/root\/logs &#8211;predict_batch_size=32 &#8211;do_predict=True &#8211;mode=benchmark\n<\/li>\n<li>Jun-20: inference, docker i intel\/intel-optimized-tensorflow:tensorflow-2.2-bf16-nightly, OMP_NUM_THREADS=28 python launch_benchmark.py &#8211;model-name bert_large &#8211;precision fp32 &#8211;mode inference &#8211;framework tensorflow &#8211;batch-size 32 &#8211;socket-id 0 &#8211;docker-image intel\/intel-optimized-tensorflow:tensorflow-2.2-bf16-nightly &#8211;data-location dataset\/bert_large_wwm\/wwm_uncased_L-24_H-1024_A-16 <br \/>&#8211;checkpoint \/tf_dataset\/dataset\/data-bert-squad\/squad-ckpts &#8211;benchmark-only &#8211;verbose &#8212; https_proxy=http:\/\/proxy.ra.intel.com:912 http_proxy=http:\/\/proxy.ra.intel.com:911 DEBIAN_FRONTEND=noninteractive init_checkpoint=model.ckpt-3649 infer_option=SQuAD\n<\/li>\n<\/ul>\n<p><a href=\"https:\/\/github.com\/intel\/optimized-models\/tree\/master\/mxnet\/wide_deep_criteo\" rel=\"noopener noreferrer\" target=\"_blank\">Wide&amp;Deep<\/a><\/p>\n<ul>\n<li>Jun-18: training, mxnet 1.3, KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0 OMP_NUM_THREADS=28 python train.py, inference, mxnet 1.3, KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0 OMP_NUM_THREADS=28 numactl &#8211;physcpubind=0-27 &#8211;membind=0 python inference.py &#8211;accuracy True\n<\/li>\n<li>Apr-19: training, mxnet 1.4, KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0 OMP_NUM_THREADS=28 python train.py, inference, mxnet 1.4, KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0 OMP_NUM_THREADS=28 numactl &#8211;physcpubind=0-27 &#8211;membind=0 python inference.py &#8211;accuracy True\n<\/li>\n<li>Jun-20: training, mxnet 1.7, KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0 OMP_NUM_THREADS=28 python train.py, inference, mxnet 1.7, KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0 OMP_NUM_THREADS=28 numactl &#8211;physcpubind=0-27 &#8211;membind=0 python inference.py &#8211;symbol-file=WD-quantized-162batches-naive-symbol.json &#8211;param-file=WD-quantized-0000.params &#8211;accuracy True\n<\/li>\n<\/ul>\n<p>\u00a9 Intel Corporation.\u00a0 Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.\u00a0 Other names and brands may be claimed as the property of others.\u00a0 <\/p>\n<ol>\n<li>Refer to\u00a0<a href=\"https:\/\/software.intel.com\/articles\/optimization-notice\" rel=\"noopener noreferrer\" target=\"_blank\">https:\/\/software.intel.com\/articles\/optimization-notice<\/a>\u00a0for more information regarding performance and optimization choices in Intel software products.\n<\/li>\n<li>See testing configuration details.\u00a0 For more complete information about performance and benchmark results, visit\u00a0<a href=\"http:\/\/www.intel.com\/benchmarks\" rel=\"noopener noreferrer\" target=\"_blank\">www.intel.com\/benchmarks<\/a>.\n<\/li>\n<\/ol>\n<p>\u00a0<br \/><b>Huma Abidi<\/b> is a Senior Director of AI Software Products at Intel, responsible for strategy, roadmaps, requirements, validation and Benchmarking of DL, ML and Analytics Software Products. She leads a globally diverse team of engineers and technologists responsible for delivering AI products that enable customers to create AI solutions.<\/p>\n<p><b>Haihao Shen<\/b> is a senior deep learning engineer in Machine Learning Performance (MLP) at Intel. He leads the benchmarking for deep learning frameworks and the development of low precision optimization tool. He has more than 10 years of experience working on software optimization and verification at Intel. Prior to joining Intel, he received his master\u2019s degree from Shanghai Jiao Tong University.<\/p>\n<p><b>Related:<\/b><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2020\/09\/showcasing-benefits-software-optimizations-ai-workloads-intel.html<\/p>\n","protected":false},"author":0,"featured_media":883,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/882"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=882"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/882\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/883"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=882"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=882"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=882"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}