{"id":732,"date":"2020-08-26T13:10:24","date_gmt":"2020-08-26T13:10:24","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/08\/26\/breaking-privacy-in-federated-learning\/"},"modified":"2020-08-26T13:10:24","modified_gmt":"2020-08-26T13:10:24","slug":"breaking-privacy-in-federated-learning","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/08\/26\/breaking-privacy-in-federated-learning\/","title":{"rendered":"Breaking Privacy in Federated Learning"},"content":{"rendered":"<div id=\"post-\">\n<p>Federated learning is a new way of training a machine learning using distributed data that is not centralized in a server. It works by training a generic (shared) model with a given user\u2019s private data, without having direct access to such data.<\/p>\n<p>For a deeper dive into how this works, I\u2019d encourage you to check out my previous blog post, which provides a high-level overview, as well as an in depth look at Google\u2019s research.<\/p>\n<p>\u00a0<br \/><a href=\"https:\/\/heartbeat.fritz.ai\/introduction-to-federated-learning-40eb122754a2\" rel=\"noopener noreferrer\" target=\"_blank\"><b>Introduction to Federated Learning<\/b><br \/><i>Enabling on-device training, model personalization, and more<\/i><\/a><br \/>\u00a0<\/p>\n<p>Federated learning has the major benefit of building models that are customized based on a user\u2019s private data, which allows for better customization that can enhances the UX. This, as compared to models trained by the data aggregated at a data center that are more generic and may not fit the user quite as well. Federated learning also help save a user\u2019s bandwidth, since they aren\u2019t sending private data to a server.<\/p>\n<p>Despite the benefits of federated learning, there are still ways of breaching a user\u2019s privacy, even without sharing private data. In this article, we\u2019ll review some research papers that discuss how federated learning includes this vulnerability.<\/p>\n<p>The outline of the article is as follows:<\/p>\n<ul>\n<li>Introduction\n<\/li>\n<li>Federated Learning Doesn\u2019t Guarantee Privacy\n<\/li>\n<li>Privacy and Security Issues of Federated Learning\n<\/li>\n<li>Reconstructing Private Data by Inverting Gradients\n<\/li>\n<\/ul>\n<p>Let\u2019s get started.<\/p>\n<p>\u00a0<\/p>\n<h3>Introduction<\/h3>\n<p>\u00a0<br \/>Federated learning was introduced by Google in 2016 in a paper titled\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1602.05629\" rel=\"noopener noreferrer\" target=\"_blank\">Communication-Efficient Learning of Deep Networks from Decentralized Data<\/a>. It\u2019s a new machine learning paradigm that allows us to build machine learning models from private data, without sharing such data to a data center.<\/p>\n<p>The summary of the steps we take to do this is as follows:<\/p>\n<ul>\n<li>A generic model (i.e. neural network) is created at a server. The model will not be trained on the server but on the users\u2019 devices (the majority are mobile devices).\n<\/li>\n<li>The model is sent to the users\u2019 devices where the training occurs. So the same model (i.e. neural network) is trained parallelly on different devices, according to their private data.\n<\/li>\n<li>Just the trained model (i.e. parameters or gradients) is shared back to the server.\n<\/li>\n<li>The server averages the trained parameters from all devices to update the generic model based on the\u00a0<strong>federated averaging algorithm<\/strong>.\n<\/li>\n<\/ul>\n<p>This way, a model is trained using private data without being moved from the devices. The next figure from a\u00a0<a href=\"https:\/\/proandroiddev.com\/federated-learning-e79e054c33ef\" rel=\"noopener noreferrer\" target=\"_blank\">post by Jose Corbacho<\/a>\u00a0summarizes the previous steps.<\/p>\n<p><img alt=\"Image for post\" class=\"aligncenter\" src=\"https:\/\/i.ibb.co\/q0D6DRP\/gad-fed-learning-2-0.png\" width=\"100%\"><\/p>\n<p>Even though the data isn\u2019t shared with the server, the process is not 100% private, and there\u2019s still a possibility of obtaining information about the data used to train the network and calculate the gradients. The next section discusses how privacy is not entirely preserved using federated learning.<\/p>\n<p>\u00a0<\/p>\n<h3>Federated Learning Doesn\u2019t Guarantee Privacy<\/h3>\n<p>\u00a0<br \/>Federated learning has some privacy advantages as compared to sharing private data with data centers. The benefits also include the ability to build highly-customized machine learning models based on the user data, while avoiding using hits to a user\u2019s bandwidth for transferring the private data to the server.<\/p>\n<p>Undoubtedly, not sharing the data with data centers and keeping it private is an advantage\u2014but there are still some risks. The reason is that there remains a way to extract some private information from the data.<\/p>\n<p>After the generic model is trained at the user\u2019s device, the trained model is sent to the server. Given that the model\u2019s parameters are trained based on the user\u2019s data, there is a chance of getting information about the data from such parameters.<\/p>\n<p>Moreover, joining the user\u2019s data with data from other users has some risks and this is mentioned in the\u00a0<a href=\"https:\/\/arxiv.org\/abs\/1602.05629\" rel=\"noopener noreferrer\" target=\"_blank\">Google research\u00a0<\/a>paper:<\/p>\n<p>\u00a0<\/p>\n<blockquote>\n<p>\n<em>Holding even an \u201canonymized\u201d dataset can still put user privacy at risk via joins with other data.<\/em>\n<\/p>\n<\/blockquote>\n<p>\u00a0<\/p>\n<p>Here, the seminal paper on federated learning makes it clear that there are still some risks, and data privacy is not 100% guaranteed. Even if the data is anonymized, it\u2019s still vulnerable.<\/p>\n<p>The updates transmitted from the device to the server should be minimal. There\u2019s no need to share more than the minimum amount of info required to update the model at the server. Otherwise, there remains the possibility of private data being exposed and intercepted.<\/p>\n<p>The private data is this vulnerable, even without being sent explicitly to the server because it\u2019s possible to restore it based on the parameters trained by such data. In the worst case when an attacker is able to restore the data, it should be anonymous as much as possible without revealing some user\u2019s private information like the name for example.<\/p>\n<p>It\u2019s possible to reveal the words entered by a user based on the gradients for some simple NLP models. In these case, if the private data already contains some information (i.e. words) about the user, then such words could be restored, and thus the privacy would also not be preserved.<\/p>\n<p>The original paper for federated learning didn\u2019t mention a clear example in which the private data could not be deduced, but it mentioned a case in which it would be difficult (which implies still possible) to extract information about a user\u2019s private data by averaging and summing gradients. The example they include involves revealing information about private data from complex networks like CNNs. Here\u2019s what the paper mentioned:<\/p>\n<p>\u00a0<\/p>\n<blockquote>\n<p>\n<em>The sum of many gradients for a dense model such as a CNN offers a harder target for attackers seeking information about individual training instances.<\/em>\n<\/p>\n<\/blockquote>\n<p>\u00a0<\/p>\n<p>In essence, there\u2019s no way to 100% prevent an attacker from getting information about the samples used for calculating the gradients of a neural network. But the key is making things harder for the attacker to get such information. It\u2019s like a\u00a0<a href=\"http:\/\/www.indieretronews.com\/2019\/06\/plutonium-caverns-new-puzzle-game.html\" rel=\"noopener noreferrer\" target=\"_blank\">cavern puzzle<\/a>, where you should make it difficult as possible to solve.<\/p>\n<p>This is the case for a convolutional neural network (CNN) because it usually has many layers connecting the input to the output, resulting in a large number of interleaving gradients. These gradients render it difficult (<strong><em>though attacks are still possible<\/em><\/strong>) to find a relationship between the inputs and the outputs based on the available gradients.<\/p>\n<p><img alt=\"Image for post\" class=\"aligncenter\" src=\"https:\/\/i.ibb.co\/djhhzd6\/gad-fed-learning-2-1.jpg\" width=\"100%\"><\/p>\n<p>To summarize the previous discussion\u2014even if the private data itself is not shared with the server, the gradients of the trained network are, which makes it possible to extract information about the training samples. The paper discussed 2 main measures you should take to maximize privacy:<\/p>\n<ol>\n<li>Sharing the minimum of information required to update the generic model at the server.\n<\/li>\n<li>Making the neural network as complex as possible so that it\u2019s difficult to use the available gradients (after being averaged and summed) to extract information about the training samples.\n<\/li>\n<\/ol>\n<p>The next section summarizes a paper that discusses some specific privacy and security issues related to federated learning.<\/p>\n<p>\u00a0<\/p>\n<h3>Privacy and Security Issues of Federated Learning<\/h3>\n<p>\u00a0<br \/>In a recent paper\u2014<a href=\"https:\/\/arxiv.org\/abs\/1909.06512v2\" rel=\"noopener noreferrer\" target=\"_blank\">Ma, Chuan, et al. \u201cOn safeguarding privacy and security in the framework of federated learning.\u201d\u00a0<em>IEEE Network<\/em>\u00a0(2020)<\/a>\u2014a number of privacy and security issues related to federated learning are discussed.<\/p>\n<p>The paper started by introducing the basic model for federated learning, according to the next figure. This figure shares some similarities to\u00a0<a href=\"https:\/\/proandroiddev.com\/federated-learning-e79e054c33ef\" rel=\"noopener noreferrer\" target=\"_blank\">Jose Corbacho\u2019s post<\/a>.<\/p>\n<p><img alt=\"Image for post\" class=\"aligncenter\" src=\"https:\/\/i.ibb.co\/H7v6drT\/gad-fed-learning-2-2.jpg\" width=\"100%\"><\/p>\n<p>The paper addresses both the security and privacy issues for federated learning. The difference between security and privacy issues is that\u00a0<strong>security issues<\/strong>\u00a0refer to unauthorized\/malicious access, change or denial to data while\u00a0<strong>privacy issues<\/strong>\u00a0refer to unintentional disclosure of personal information.<\/p>\n<p>The paper classified the protection methods for the privacy and security issues into 3 categories, which are:<\/p>\n<ol>\n<li>Privacy protection at the client-side\n<\/li>\n<li>Privacy protection at the server-side\n<\/li>\n<li>Security protection for the FL\n<\/li>\n<\/ol>\n<p>\u00a0<\/p>\n<h3>Privacy Protection at the Client-Side<\/h3>\n<p>\u00a0<br \/>Regarding privacy protection at the client-side, the paper discussed 2 ways which are\u00a0<strong>perturbation<\/strong>\u00a0and\u00a0<strong>dummy<\/strong>:<\/p>\n<ul>\n<li>\n<strong>Perturbation<\/strong>: Adding noise to the shared parameters to the server so that attackers cannot restore the data or at least not able to get the identity of the user.\n<\/li>\n<li>\n<strong>Dummy<\/strong>: Alongside the trained model parameters at the client-side, some dummy parameters are sent to the server.\n<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<h3>Privacy Protection at the Server-Side<\/h3>\n<p>\u00a0<br \/>Privacy protection at the server side is necessary because, as the paper mentioned, when the server broadcasts the aggregated parameters to clients for model synchronizing, this information may leak as there may exist eavesdroppers. The paper mentioned some ways to preserve the privacy at the server side which are\u00a0<strong>aggregation<\/strong>\u00a0and\u00a0<strong>secure multi-party computation<\/strong>.<\/p>\n<ul>\n<li>\n<strong>Aggregation<\/strong>: To make revealing information about the user\u2019s data more complex, the parameters from different users are combined together.\n<\/li>\n<li>\n<strong>Secure Multi-Party Computation (SMC)<\/strong>: The parameters sent from the client to the server should be secured by encrypted it so that an attacker will find it difficult to get information about the user.\n<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<h3>Security Protection for the Federated Learning Framework<\/h3>\n<p>\u00a0<br \/>After the client trains the model by its private data, the model is sent to the server. At this time, an attacker might make some changes to the model to make it behave for their benefit. For example, the attacker might control the labels assigned to images with certain features.<\/p>\n<p>The paper suggests 2 ways to secure the design of a federated learning pipeline:\u00a0<strong>homomorphic encryption<\/strong>\u00a0and\u00a0<strong>back-door defender<\/strong>.<\/p>\n<ul>\n<li>\n<strong>Homomorphic Encryption<\/strong>: The model parameters are encrypted so that an attacker finds it difficult to interpret; thus, they\u2019re unable to be changed.\n<\/li>\n<li>\n<strong>Back-door Defender<\/strong>: A mechanism(s) for detecting malicious users who try to access the generic model and make updates to it that change its behavior, according to their needs.\n<\/li>\n<\/ul>\n<p>The next section provides a quick summary of a paper that is able to reconstruct images by inverting gradients.<\/p>\n<p>\u00a0<\/p>\n<h3>Reconstructing Private Data by Inverting Gradients<\/h3>\n<p>\u00a0<br \/>According to a recent research paper\u2014<a href=\"https:\/\/arxiv.org\/abs\/2003.14053\" rel=\"noopener noreferrer\" target=\"_blank\">Geiping, Jonas, et al. \u201cInverting Gradients \u2014 How easy is it to break privacy in federated learning?\u201d arXiv preprint arXiv:2003.14053 (2020)<\/a>\u2014simply sharing the gradients but not the private data still uncovers private information about the data. Thus, federated learning has not entirely achieved one of its goals, which is keeping user\u2019s data private.<\/p>\n<p>As we mentioned in the previous section, one thing that makes it harder for an attacker to get information about private data is the existence of many gradients, like those available in CNNs. The main proposal of\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2003.14053\" rel=\"noopener noreferrer\" target=\"_blank\">this paper<\/a>\u00a0is to reconstruct images based on the gradients of the neural network with high quality. Successfully doing that means the privacy is not guaranteed even if just the parameters, not the data is shared with the server.<\/p>\n<p>The paper proved that the input to a fully connected layer could be reconstructed independently of the network architecture. Even if the gradients are averaged through a number of iterations, this doe not help to protect the user\u2019s privacy.<\/p>\n<p>The paper proves that it is possible to recover much of the information available in the original data. The key findings from this paper are summarized in the following points:<\/p>\n<ol>\n<li>Reconstruction of input data from gradient information is possible for realistic deep architectures with both trained and untrained parameters.\n<\/li>\n<li>With the right attack, there is little \u201cdefense-in-depth\u201d \u2014 deep networks are as vulnerable as shallow networks.\n<\/li>\n<li>We prove that the input to any fully connected layer can be reconstructed analytically independent of the remaining network architecture.\n<\/li>\n<li>Especially dishonest-and-curious servers (which may adapt the architecture or parameters maliciously) excel in information retrieval, and dishonesties can be as subtle as permuting some network parameters.\n<\/li>\n<li>Federated averaging confers no security bene\ufb01t compared to federated SGD.\n<\/li>\n<li>Reconstruction of multiple, separate input images from their averaged gradient is possible in practice, even for a batch of 100 images.\n<\/li>\n<\/ol>\n<p>The next figure, taken from the paper shows, an image, and its reconstruction. The original image is reconstructed with high quality (with little degradation) based on the shared gradients to the server.<\/p>\n<p><img alt=\"Image for post\" class=\"aligncenter\" src=\"https:\/\/i.ibb.co\/SwxHbYc\/gad-fed-learning-2-3.jpg\" width=\"100%\"><\/p>\n<p>\u00a0<\/p>\n<h3>Conclusion<\/h3>\n<p>\u00a0<br \/>This article discussed some of the privacy and security issues in federated learning by summarizing 3 papers. It\u2019s clear that it\u2019s immensely challenging to preserve a user\u2019s privacy, even if only sharing gradients returned by training the global model (e.g. neural network). Even though the data used for training local network updates isn\u2019t shared, it is possible to reconstruct that data.<\/p>\n<p>\u00a0<br \/><b>Bio: <a href=\"https:\/\/www.linkedin.com\/in\/ahmedfgad\/\" target=\"_blank\" rel=\"noopener noreferrer\">Ahmed Gad<\/a><\/b> received his B.Sc. degree with excellent with honors in information technology from the Faculty of Computers and Information (FCI), Menoufia University, Egypt, in July 2015. For being ranked first in his faculty, he was recommended to work as a teaching assistant in one of the Egyptian institutes in 2015 and then in 2016 to work as a teaching assistant and a researcher in his faculty. His current research interests include deep learning, machine learning, artificial intelligence, digital signal processing, and computer vision.<\/p>\n<p><a href=\"https:\/\/heartbeat.fritz.ai\/breaking-privacy-in-federated-learning-77fa08ccac9a\" target=\"_blank\" rel=\"noopener noreferrer\">Original<\/a>. Reposted with permission.<\/p>\n<p><b>Related:<\/b><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2020\/08\/breaking-privacy-federated-learning.html<\/p>\n","protected":false},"author":0,"featured_media":733,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/732"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=732"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/732\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/733"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=732"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=732"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=732"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}