{"id":8223,"date":"2021-04-09T22:26:27","date_gmt":"2021-04-09T22:26:27","guid":{"rendered":"https:\/\/wealthrevelation.com\/data-science\/2021\/04\/09\/deep-learning-recommendation-models-dlrm-a-deep-dive\/"},"modified":"2021-04-09T22:26:27","modified_gmt":"2021-04-09T22:26:27","slug":"deep-learning-recommendation-models-dlrm-a-deep-dive","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2021\/04\/09\/deep-learning-recommendation-models-dlrm-a-deep-dive\/","title":{"rendered":"Deep Learning Recommendation Models (DLRM): A Deep Dive"},"content":{"rendered":"<div id=\"post-\">\n   <!-- post_author Nishant Kumar -->  <\/p>\n<p><b>By <a href=\"https:\/\/www.linkedin.com\/in\/nishant-kumar-350043a5\/\" target=\"_blank\" rel=\"noopener\">Nishant Kumar<\/a>, Data Science Professional<\/b>.<\/p>\n<p>Recommendation systems are built to predict what users might like, especially when there are lots of choices available.<\/p>\n<p>The DLRM algorithm was open-sourced by Facebook on March 31, 2019, and is\u00a0part of the popular MLPerf Benchmark.<\/p>\n<p><img class=\"aligncenter size-full wp-image-125421\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/deep-learning-recommendation-model-facebook.jpg\" alt=\"\" width=\"90%\"><\/p>\n<p><em><a href=\"https:\/\/arxiv.org\/pdf\/1906.00091.pdf\" target=\"_blank\" rel=\"noopener\">Deep Learning Recommendation Model Architecture (DLRM)<\/a><\/em><\/p>\n<p><strong>Why should you consider using DLRM?<\/strong><\/p>\n<p>This paper attempts to combine 2 important concepts that are driving the architectural changes in recommendation systems:<\/p>\n<ol>\n<li>From the view of Recommendation Systems, initially, content filtering systems were employed that matched users to products based on their preferences. This subsequently evolved to use collaborative filtering where recommendations were based on past user behaviors.<\/li>\n<li>From the view of Predictive Analytics, it relies on statistical models to classify or predict the probability of events based on the given data. These models shifted from simple models such as linear and logistic regression to models that incorporate deep networks.<\/li>\n<\/ol>\n<p>In this paper, the authors claim to succeed in unifying these 2 perspectives in the DLRM Model.<\/p>\n<p>A few notable features:<\/p>\n<ol>\n<li><strong>Extensive use of Embedding Tables<\/strong>: Embedding provides a rich and meaningful representation of the data of the users.<\/li>\n<li><strong>Exploits Multi-layer Perceptron (MLP):\u00a0<\/strong>MLP presents a flavor of Deep Learning. They can well address the limitations presented by the statistical methods.<\/li>\n<li><strong>Model Parallelism:\u00a0<\/strong>Poses less overhead on memory and speeds it up.<\/li>\n<li><strong>Interaction between Embeddings<\/strong>: Used to interpret latent factors (i.e., hidden factors) between feature interactions. An example would be how likely a user who likes comedy and horror movies would like a horror-comedy movie. Such interactions play a major role in the working of recommendation systems.<\/li>\n<\/ol>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/0*maDJlm5iWkQ3VdyT\" width=\"90%\"><\/p>\n<p><strong>LET&#8217;S START<\/strong><\/p>\n<p>\u00a0<\/p>\n<h3>Model Workflow:<\/h3>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*w1I1laI2NeeTEyUBiY9O4w.gif\" width=\"90%\"><\/p>\n<p><em>DLRM Workflow.<\/em><\/p>\n<ul>\n<li>The model uses embedding to process sparse features that represent categorical data and a Multi-layer Perceptron (MLP) to process dense features,<\/li>\n<li>Interacts these features explicitly using the statistical techniques proposed.<\/li>\n<li>Finally, it finds the event probability by post-processing the interactions with another MLP.<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<h3>ARCHITECTURE:<\/h3>\n<p>\u00a0<\/p>\n<ol>\n<li>Embeddings<\/li>\n<li>Matrix Factorization<\/li>\n<li>Factorization Machine<\/li>\n<li>Multi-layer Perceptron (MLP)<\/li>\n<\/ol>\n<p>Let\u2019s discuss them in a little detail.<\/p>\n<p><strong>1. Embeddings<\/strong><\/p>\n<p><em>Mapping of concepts, objects, or items into a vector space is called an Embedding<\/em><\/p>\n<p>E.g.,<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/0*3Ym44dctxGISlqGQ\" width=\"90%\"><\/p>\n<p>In the context of neural networks, embeddings are low-dimensional, learned continuous vector representation of discrete variables.<\/p>\n<p><strong>Why should we use Embeddings instead of other options such as lists of sparse items?<\/strong><\/p>\n<ul>\n<li>Reduces dimensionality of categorical variables and meaningfully represent categories in the abstract space.<\/li>\n<li>We can measure the distance between embeddings in a more meaningful way.<\/li>\n<li>Embedding elements represent sparse features in some abstract space relevant to the model at hand, while integers represent an ordering of the input data.<\/li>\n<li>Embedding vectors project\u00a0<strong>n-dimensional items space\u00a0<\/strong>into\u00a0<strong>d-dimensional embedding vectors\u00a0<\/strong>where <em>n &gt;&gt; d<\/em>.<\/li>\n<\/ul>\n<p><strong>2. Matrix Factorization<\/strong><\/p>\n<p>This technique belongs to a class of Collaborative filtering algorithms used in Recommendation Systems.<\/p>\n<p>Matrix Factorization algorithms work by decomposing user-item interaction matrix into the product of 2 lower dimensionality rectangular matrices.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/738\/0*Gs49CEBwLp6vYZsI\" width=\"90%\"><\/p>\n<p><em>Refer to <\/em><em><u><a href=\"https:\/\/developers.google.com\/machine-learning\/recommendation\/collaborative\/matrix\">https:\/\/developers.google.com\/machine-learning\/recommendation\/collaborative\/matrix<\/a>\u00a0<\/u><\/em><em>for more details.<\/em><\/p>\n<p><strong>3. Factorization Machines (FM)<\/strong><\/p>\n<p>A good choice for tasks dealing with high-dimensional sparse datasets.<\/p>\n<p>FM is an improved version of MF. It is designed to capture interactions between features within high-dimensional sparse datasets economically.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/838\/0*-8Q3H2GSzt1eTRG9\" width=\"90%\"><\/p>\n<p><em>Factorization Matrix (FM) Equation.<\/em><\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/689\/0*wt1NSe-hgtdJpxYj\" width='\"90%'><\/p>\n<p>Features of Factorization Machines:<\/p>\n<ul>\n<li>Able to estimate interactions in sparse settings because they break the independence of interaction by parameters by factoring them.<\/li>\n<li>Incorporates second-order interactions into a linear model with categorical data by defining a model of the form,<\/li>\n<\/ul>\n<p><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/286\/0*jI_TfRWbUw6oRGci\" width=\"229\" height=\"27\"><\/p>\n<p>FMs factorize second-order interaction matrix to its latent factors (or embedding vectors) as in matrix factorization, which more effectively handles sparse data.<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/390\/0*tkFVjfYMiOKsEFce\" width=\"312\" height=\"416\"><em>Different orders of interaction matrices.<\/em><\/p>\n<p><strong>Significantly reduces the complexity of second-order interactions by only capturing interactions between pairs of distinct embedding vectors, yielding linear computational complexity.<\/strong><\/p>\n<p><em>Refer to <\/em><a href=\"https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/fact-machines.html\" target=\"_blank\" rel=\"noopener\"><em>https:\/\/docs.aws.amazon.com\/sagemaker\/latest\/dg\/fact-machines.html<\/em><\/a>.<\/p>\n<p><strong>4. Multi-layer Perceptron<\/strong><\/p>\n<p>Finally, a little flavor of Deep Learning.<\/p>\n<p>A Multi-layer Perceptron (MLP) is a class of Feed-Forward Artificial Neural Network.<\/p>\n<p>An MLP consists of at least 3 layers of nodes:<\/p>\n<ul>\n<li>Input layer<\/li>\n<li>Hidden layer<\/li>\n<li>Output layer<\/li>\n<\/ul>\n<p>Except for input nodes, each node is a neuron that uses a nonlinear activation function.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/625\/1*UyndHD1FdTHsAaeid2fn3Q.gif\" width=\"90%\"><\/p>\n<p><em>MLP utilizes supervised learning called Backpropagation for training.<\/em><\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/675\/0*XmYBFUWTrWaTACK8\" width=\"90%\"><\/p>\n<p>These methods have been used to capture more complex interactions.<\/p>\n<p>MLPs with sufficient depth and width can fit data to arbitrary precision.<\/p>\n<p>One specific case,\u00a0Neural Collaborative Filtering (NCF), used as part of MLPerf Benchmark, uses an MLP rather than dot product to compute interactions between embeddings in Matrix Factorization.<\/p>\n<p>\u00a0<\/p>\n<h3>DLRM Operators by Framework<\/h3>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/1*KQpKYSBqt75H5JCZaFk1PA.png\" width=\"90%\"><\/p>\n<p>You can find below the overall architecture of open-source recommendation model system. All configurable parameters are outlined in blue. And\u00a0the operators used are shown in green.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/808\/0*O9r79-SKw1xsUbqf\" width=\"90%\"><\/p>\n<p><em>Source:\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1906.03109.pdf\" target=\"_blank\" rel=\"noopener\">https:\/\/arxiv.org\/pdf\/1906.03109.pdf<\/a>.<\/em><\/p>\n<p>We have 3 tested models from Facebook (<a href=\"https:\/\/arxiv.org\/pdf\/1906.03109.pdf\">Source: Architectural Implication of Facebook\u2019s DNN-Based Personalized Recommendation<\/a>)<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/814\/0*DohG5SRSIqv0hYVF\" width=\"90%\"><\/p>\n<p>Model Architecture parameters are representative of production scale recommendation workloads for\u00a03 examples of recommendation models used, highlighting their diversity in terms of embedding table and FC sizes.\u00a0Each parameter (column) is normalized to the smallest instance across all 3 configurations.<\/p>\n<p>\u00a0<\/p>\n<h3>ISSUES<\/h3>\n<p>\u00a0<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/824\/0*P7BuskcDAtf4AOGL\" width=\"90%\"><\/p>\n<ol>\n<li>Memory Capacity Dominated (Input from Network)<\/li>\n<li>Memory Band-Width Dominated (Processing of Features: Embedding Lookup and MLP)<\/li>\n<li>Communication Based (Interaction between Features)<\/li>\n<li>Compute Dominated\u00a0(Compute\/Run-Time Bottleneck)<\/li>\n<\/ol>\n<p>\u00a0<\/p>\n<h3>1. Memory Capacity Dominated<\/h3>\n<p>\u00a0<\/p>\n<p>(Input From Network)<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/793\/0*Gd4cvFMV6jjP5uaB\" width=\"90%\"><\/p>\n<p><em>Source:\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1906.00091.pdf\" target=\"_blank\" rel=\"noopener\">https:\/\/arxiv.org\/pdf\/1906.00091.pdf<\/a>.<\/em><\/p>\n<p><strong>SOLUTION:\u00a0<\/strong>Parallelism<\/p>\n<p><em>One of the basic and most important steps<\/em><\/p>\n<ul>\n<li>Embeddings contribute the majority of parameters, with several tables each requiring an excess of multiple GBs of memory. This necessitates the distribution of models across multiple devices.<\/li>\n<li>MLP parameters are smaller in memory but translate to sizeable amounts of compute<\/li>\n<\/ul>\n<p>Data Parallelism is preferred for MLPs since this enables concurrent processing of samples on different devices and only requires communication when accumulating updates.<\/p>\n<p><strong>Personalization:<\/strong><\/p>\n<p><strong>SETUP:\u00a0<\/strong>Top MLP and interaction operator requires access to part of mini-batch from the bottom MLP and all of the embeddings. Since model parallelism has been used to distribute embeddings across devices, this requires a\u00a0<strong>personalized all-to-all communication.<\/strong><\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/620\/0*Dq8lje6ErCINAsth\" width=\"90%\"><\/p>\n<p><em>Butterfly Shuffle for the all-to-all (Personalized) Communication. Source:\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1906.00091.pdf\">https:\/\/arxiv.org\/pdf\/1906.00091.pdf<\/a>.<\/em><\/p>\n<p>Slices (i.e., 1,2,3) are Embedding vectors that are supposed to be transferred to target devices for personalization.<\/p>\n<p>Currently, transfers are only explicit copies.<\/p>\n<p>\u00a0<\/p>\n<h3>2. Memory Bandwidth Dominated<\/h3>\n<p>\u00a0<\/p>\n<p>(Processing of Features: Embedding Lookup and MLP)<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/0*q675vLM0tOyNEjI6\" width=\"90%\"><\/p>\n<p><em>Source:\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1909.02107.pdf\" target=\"_blank\" rel=\"noopener\">https:\/\/arxiv.org\/pdf\/1909.02107.pdf<\/a>,\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1901.02103.pdf\" target=\"_blank\" rel=\"noopener\">https:\/\/arxiv.org\/pdf\/1901.02103.pdf<\/a>.<\/em><\/p>\n<ul>\n<li>MLP parameters are smaller in memory but translate to sizeable amounts of compute (so the issue will come during compute).<\/li>\n<li>Embedding lookups can cause memory constraints.<\/li>\n<\/ul>\n<p><strong>SOLUTION:\u00a0<\/strong>Compositional Embeddings using Complementary Partitions<\/p>\n<p>Representation of n items in d dimensional vector space can be broadly divided into 2 categories:<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/795\/1*3m-qL-IdvmPK0Z_bY9g5Gg.png\" width=\"90%\"><\/p>\n<p>An approach is proposed for generating unique embedding for each categorical feature using Complementary Partitions of category set to generate Compositional Embeddings.<\/p>\n<p><strong>Approaching Memory-Bandwidth Consumption issue:<\/strong><\/p>\n<ol>\n<li>Hashing Trick<\/li>\n<li>Quotient-Remainder Trick<\/li>\n<\/ol>\n<p><strong>HASHING TRICK:<\/strong><\/p>\n<p>Naive approach of reducing embedding table using a simple hash function.<\/p>\n<p><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/465\/0*RK0j36Rsaj43QxUk\" width=\"372\" height=\"84\"><\/p>\n<p><em>Hashing Trick.<\/em><\/p>\n<p>It significantly\u00a0reduces the size of the embedding matrix\u00a0from\u00a0O(|S|D) to O(mD)\u00a0since <em>m &lt;&lt; |S|<\/em>.<\/p>\n<p><strong>Disadvantages &#8211;<\/strong><\/p>\n<ul>\n<li>Does not yield a\u00a0<em>Unique Embedding\u00a0<\/em>for\u00a0<em>each Category.<\/em><\/li>\n<li>Naively maps multiple categories to the same embedding vector.<\/li>\n<li>Results in\u00a0<em>loss of Information,<\/em>hence,\u00a0<em>rapid deterioration of model quality.<\/em><\/li>\n<\/ul>\n<p><strong>QUOTIENT-REMAINDER TRICK:<\/strong><\/p>\n<p>Using 2 complementary functions, i.e., integer quotient and remainder functions: we can produce 2 separate embedding tables and combine them in a way that yields a unique embedding for each category.<\/p>\n<p>It results in memory complexity\u00a0O(D*|S|\/m + mD), a slight increase in memory compared to hashing trick,<\/p>\n<ul>\n<li>But with an added benefit of producing a unique representation.<\/li>\n<\/ul>\n<p><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/465\/0*H_nsmRpcUxiI3U8g\" width=\"372\" height=\"118\"><\/p>\n<p><em>Quotient-Remainder Trick.<\/em><\/p>\n<p><strong>Complementary Partitions<\/strong><\/p>\n<p>In the Quotient-Remainder trick, each operation partitions a set of categories in to \u201cMultiple Buckets\u201d such that every index in the same \u201cbucket\u201d is mapped to the same vector.<\/p>\n<p>By combining embeddings from both quotient and remainder together, one is able to generate a distinct vector for each index.<\/p>\n<p><strong>NOTE: <\/strong>Complementary Partitions: Avoids repetition of data or embedding tables across partitions (as it\u2019s complementary, duh!! )<\/p>\n<p>Types Based on structure:<\/p>\n<ul>\n<li>Naive Complementary Partition<\/li>\n<li>Quotient \u2014 Remainder Complementary Partitions<\/li>\n<li>Generalized Quotient-Remainder Complementary Partitions<\/li>\n<li>Chinese Remainder Partitions<\/li>\n<\/ul>\n<p>Types Based on Function:<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/0*z137OIgIHS8X3s34\" width=\"90%\"><\/p>\n<p><em>Types of Complementary Partitions based on function.<\/em><\/p>\n<p><strong><em>Operation Based Compositional Embeddings:<\/em><\/strong><\/p>\n<p>Assume that vectors in each embedding table are distinct. If\u00a0<strong>concatenation\u00a0<\/strong>operation is used, then compositional embeddings of any category are unique.<\/p>\n<p>This approach reduces the memory complexity of storing the entire embedding table O(|S|D) to O(|P1|D1+|P2|D2+\u2026|Pk|Dk).<\/p>\n<p>Operation-based embeddings are more complex due to the operations applied.<\/p>\n<p><strong><em>Path Based Compositional Embeddings:<\/em><\/strong><\/p>\n<p>Each function in a composition is determined based on a unique set of equivalence classes from each partition,<strong>\u00a0yielding a unique \u2018path\u2019 of transformations.<\/strong><\/p>\n<p>Path Based Compositional Embeddings are expected to give better results with the benefit of lower model complexity.<\/p>\n<p><strong>TRADE-OFF &#8211;\u00a0<\/strong><\/p>\n<p>There\u2019s a catch.<\/p>\n<ul>\n<li>A larger embedding table will yield better model quality but at the cost of increased memory requirements.<\/li>\n<li>Using a more aggressive version will yield smaller models but lead to a reduction in model quality.<\/li>\n<li>Most models exponentially decrease in performance with a number of parameters.<\/li>\n<li>Both types of compositional embeddings reduce the number of parameters by implicitly enforcing some\u00a0<strong>structure defined by<\/strong><strong>complementary partitions<\/strong>\u00a0in the generation of each category\u2019s embedding.<\/li>\n<li><strong>The quality of the model ought to depend on how closely the chosen partitions reflect intrinsic properties of the category set and their respective embeddings.<\/strong><\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<h3>3. Communication Based<\/h3>\n<p>\u00a0<\/p>\n<p>(Interaction between Features)<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/0*7_dPgZbjlTAxoAH0\" width=\"90%\"><\/p>\n<p><em>Source:\u00a0<a href=\"https:\/\/github.com\/thumbe3\/Distributed_Training_of_DLRM\/blob\/master\/CS744_group10.pdf\" target=\"_blank\" rel=\"noopener\">https:\/\/github.com\/thumbe3\/Distributed_Training_of_DLRM\/blob\/master\/CS744_group10.pdf<\/a>.<\/em><\/p>\n<p>DLRM uses\u00a0<strong>model parallelism\u00a0<\/strong>to avoid replicating the whole set of embedding tables on every GPU device and\u00a0<strong>data parallelism<\/strong>\u00a0to enable concurrent processing of samples in FC layers.<\/p>\n<p>MLP parameters are replicated across GPU devices and not shuffled.<\/p>\n<p><strong>What is the problem?<\/strong><\/p>\n<p>Transferring embedding tables across nodes in a cluster becomes expensive and could be a Bottleneck.<\/p>\n<p><strong>Solution:<\/strong><\/p>\n<p>Since it is the interaction between pairs of learned embedding vectors that matters and not the absolute values of embedding themselves.<\/p>\n<p><em>We hypothesize we can learn embeddings in different nodes independently to result in a good model.<\/em><\/p>\n<p>Saves Network Bandwidth by synchronizing only MLP parameters and learning Embedding tables independently on each of the server nodes.<\/p>\n<p>In order to speed up training,\u00a0<strong>sharding\u00a0<\/strong>of input dataset\u00a0<strong>across cluster nodes\u00a0<\/strong>has been implemented such that both nodes can\u00a0<strong>process different shards of data concurrently and therefore make more progress than a single node.<\/strong><\/p>\n<p><img loading=\"lazy\" class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/446\/0*KXFcMO8GtuQfYhUj\" width=\"357\" height=\"329\"><\/p>\n<p><em>Distributed DLRM.<\/em><\/p>\n<p><strong>Master node collects gradients of MLP parameters<\/strong>\u00a0from the slave node and itself.\u00a0<strong>MLP parameters were synchronized<\/strong>\u00a0by monitoring their values for some of the experiments.<\/p>\n<p><em>Embedding tables<\/em>\u00a0learned were different in both nodes as these are not synchronized, and nodes work on different shards of the input dataset.<\/p>\n<p><em>Since the DLRM uses Interaction of Embeddings<\/em>\u00a0rather than embedding themselves,\u00a0<em>good models were achievable\u00a0<\/em>even though embeddings were not synchronized across the nodes.<\/p>\n<p>\u00a0<\/p>\n<h3>4. Compute Dominated<\/h3>\n<p>\u00a0<\/p>\n<p>(Compute\/Run-Time Bottleneck)<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/0*p7aMU0nxGffa0uDc\" width=\"90%\"><\/p>\n<p><em>Source:\u00a0<a href=\"https:\/\/github.com\/pytorch\/FBGEMM\/wiki\/Recent-feature-additions-and-improvements-in-FBGEMM\" target=\"_blank\" rel=\"noopener\">https:\/\/github.com\/pytorch\/FBGEMM\/wiki\/Recent-feature-additions-and-improvements-in-FBGEMM<\/a>,\u00a0<a href=\"https:\/\/engineering.fb.com\/ml-applications\/fbgemm\/\" target=\"_blank\" rel=\"noopener\">https:\/\/engineering.fb.com\/ml-applications\/fbgemm\/<\/a><\/em><\/p>\n<p>As discussed above,<\/p>\n<ul>\n<li>MLP also results in Compute Overload<\/li>\n<li>Co-location creates performance bottlenecks when running production-scale recommendation models leading to lower resource utilization<\/li>\n<\/ul>\n<p>Co-location impacts more on\u00a0<a href=\"https:\/\/caffe2.ai\/docs\/operators-catalogue.html\" target=\"_blank\" rel=\"noopener\">SparseLengthSum<\/a>\u00a0due to higher irregular memory accesses, which exhibits less cache reuse.<\/p>\n<p><strong>SOLUTION: <\/strong>FBGEMM (Facebook + General Matrix Multiplication)<\/p>\n<p><strong>Introducing the workhorse of our model<\/strong><\/p>\n<p>It is the definite back-end of PyTorch for quantized inference on servers.<\/p>\n<ul>\n<li>It is specifically optimized for low-precision data, unlike the conventional linear algebra libraries used in scientific computing (which work with FP32 or FP64 precision).<\/li>\n<li>It provides efficient low-precision general matrix-matrix multiplication (GEMM) for small batch sizes and support for accuracy-loss-minimizing techniques such as row-wise quantization and outlier-aware quantization.<\/li>\n<li>It also exploits fusion opportunities to overcome the unique challenges of matrix multiplication at lower precision with bandwidth-bound pre- and post-GEMM operations.<\/li>\n<\/ul>\n<p>A number of improvements to the existing features as well as new features were added in the\u00a0<a href=\"https:\/\/github.com\/pytorch\/FBGEMM\/wiki\/Recent-feature-additions-and-improvements-in-FBGEMM\" target=\"_blank\" rel=\"noopener\">January 2020 release<\/a>.<\/p>\n<p>These include Embedding Kernels\u00a0<em>(very important to us)\u00a0<\/em>JIT\u2019ed sparse kernels and int64 GEMM for Privacy Preserving Machine Learning Models.<\/p>\n<p>A couple of implementation stats:<\/p>\n<ol>\n<li>Reduces DRAM bandwidth usage in Recommendation Systems by 40%<\/li>\n<li>Speeds up character detection by 2.4x in\u00a0<a href=\"https:\/\/engineering.fb.com\/ai-research\/rosetta-understanding-text-in-images-and-videos-with-machine-learning\/\">Rosetta<\/a>(ML Algo for detecting text in Images and Videos)<\/li>\n<\/ol>\n<p><strong>Computations occur on 64-bit Matrix Multiplication Operations<\/strong>, which is widely used in Privacy Preserving field,\u00a0<strong>essentially speeding up Privacy Preserving Machine Learning Models.<\/strong><\/p>\n<p>Currently, there exists no good high-performance implementation of 64-bit GEMMs on the current generation of CPUs.<\/p>\n<p>Therefore, 64-bit GEMMs have been added to FBGEMM. It achieves 10.5 GOPs\/sec on Intel Xeon Gold 6138 processor with turbo off. It is 3.5x faster than the existing implementation that runs at 3 GOps\/sec. This is the first iteration of the 64-bit GEMM implementation.<\/p>\n<p>\u00a0<\/p>\n<h3>REFERENCES<\/h3>\n<p>\u00a0<\/p>\n<ol>\n<li>Recommendation series by James Le: It\u2019s really good for building up basics on Recommendation Systems.\u00a0<a href=\"https:\/\/jameskle.com\/writes\/rec-sys-part-1\" target=\"_blank\" rel=\"noopener\">https:\/\/jameskle.com\/writes\/rec-sys-part-1<\/a><\/li>\n<li>Deep Learning Recommendation Model for Personalization and Recommendation Systems\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1906.00091.pdf\" target=\"_blank\" rel=\"noopener\">https:\/\/arxiv.org\/pdf\/1906.00091.pdf<\/a><\/li>\n<li>Compositional Embeddings Using Complementary Partitionsfor Memory-Efficient Recommendation Systems\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1909.02107.pdf\" target=\"_blank\" rel=\"noopener\">https:\/\/arxiv.org\/pdf\/1909.02107.pdf<\/a><\/li>\n<li>On the Dimensionality of Embeddings for Sparse Features and Data\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1901.02103.pdf\" target=\"_blank\" rel=\"noopener\">https:\/\/arxiv.org\/pdf\/1901.02103.pdf<\/a><\/li>\n<li>The Architectural Implications of Facebook\u2019s DNN-based Personalized Recommendation\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/1906.03109.pdf\" target=\"_blank\" rel=\"noopener\">https:\/\/arxiv.org\/pdf\/1906.03109.pdf<\/a><\/li>\n<li><a href=\"https:\/\/github.com\/pytorch\/FBGEMM\/wiki\/Recent-feature-additions-and-improvements-in-FBGEMM\" target=\"_blank\" rel=\"noopener\">https:\/\/github.com\/pytorch\/FBGEMM\/wiki\/Recent-feature-additions-and-improvements-in-FBGEMM<\/a><\/li>\n<li>Open-sourcing FBGEMM for state-of-the-art server-side inference\u00a0<a href=\"https:\/\/engineering.fb.com\/ml-applications\/fbgemm\/\" target=\"_blank\" rel=\"noopener\">https:\/\/engineering.fb.com\/ml-applications\/fbgemm\/<\/a><\/li>\n<\/ol>\n<p>\u00a0<\/p>\n<p><a href=\"https:\/\/medium.com\/swlh\/deep-learning-recommendation-models-dlrm-a-deep-dive-f38a95f47c2c\" target=\"_blank\" rel=\"noopener\">Original<\/a>. Reposted with permission.<\/p>\n<p>\u00a0<\/p>\n<p><strong>Bio:<\/strong> <a href=\"https:\/\/www.linkedin.com\/in\/nishant-kumar-350043a5\/\" target=\"_blank\" rel=\"noopener\">Nishant Kumar<\/a>\u00a0holds a Bachelor&#8217;s Degree in Computer Science. He has over 4 years of IT industry experience in various domains related to Data Science including Machine Learning, NLP, Recommendation Systems. When he is not playing around with data he spends time Cycling, and cooking. He loves to write technical articles on technical aspects of Data Science and interesting Research in NLP. You can check out articles on <a href=\"https:\/\/nishantkumar94.medium.com\/\" target=\"_blank\" rel=\"noopener\">Medium<\/a>.<\/p>\n<p><b>Related:<\/b><\/p>\n<\/p><\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2021\/04\/deep-learning-recommendation-models-dlrm-deep-dive.html<\/p>\n","protected":false},"author":0,"featured_media":8224,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8223"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=8223"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8223\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/8224"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=8223"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=8223"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=8223"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}