{"id":8023,"date":"2020-12-29T00:26:33","date_gmt":"2020-12-29T00:26:33","guid":{"rendered":"https:\/\/healinglifespan.com\/data-science\/2020\/12\/29\/density-based-clustering\/"},"modified":"2020-12-29T00:26:33","modified_gmt":"2020-12-29T00:26:33","slug":"density-based-clustering","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/12\/29\/density-based-clustering\/","title":{"rendered":"Density-Based Clustering"},"content":{"rendered":"<div>\n<h6>Original content by Manojit Nandi \u2013 Updated by Josh Poduska.<\/h6>\n<p>Cluster Analysis is an important problem in data analysis. Data scientists use clustering to identify malfunctioning servers, group genes with similar expression patterns, and perform various other applications.<\/p>\n<p>There are many families of data clustering algorithms, and you may be familiar with the most popular one: k-means. As a quick refresher, k-means determines k centroids in the data and clusters points by assigning them to the nearest centroid.<\/p>\n<p>While k-means is easy to understand and implement in practice, the algorithm has no notion of outliers, so all points are assigned to a cluster even if they do not belong in any. In the domain of anomaly detection, this causes problems as anomalous points will be assigned to the same cluster as \u201cnormal\u201d data points. The anomalous points pull the cluster centroid towards them, making it harder to classify them as anomalous points.<\/p>\n<p>In this blog post, I will cover a family of techniques known as density-based clustering. Compared to centroid-based clustering like k-means, density-based clustering works by identifying \u201cdense\u201d clusters of points, allowing it to learn clusters of arbitrary shape and identify outliers in the data. In particular, I will:<\/p>\n<ul>\n<li>Discuss the highly popular DBSCAN algorithm.<\/li>\n<li>Use the denpro R package. This library contains an efficient implementation of the Level Set Tree clustering algorithm.<\/li>\n<\/ul>\n<p>As always, the code <a href=\"https:\/\/try.dominodatalab.com\/u\/nmanchev\/clustering\/overview\">can be found on the Domino platform<\/a>. You can browse the project without signing up. \u201cClustering.ipynb\u201d under Files is a good place to start. You\u2019ll need to <a href=\"https:\/\/www.dominodatalab.com\/try\">sign up<\/a> to run the code for free in Domino.<\/p>\n<h2><a id=\"post-1475-_u3dp70dgqyag\"><\/a><strong>Preliminary: \u025b-Balls and neighborhood density<\/strong><\/h2>\n<p>Before we can discuss density-based clustering, we first need to cover a topic that you may have seen in a topology course: \u025b-neighborhoods.<\/p>\n<p>The general idea behind \u025b-neighborhoods is given a data point, we want to be able to reason about the data points in the space around it. Formally, for some real-valued \u025b &gt; 0 and some point p, the \u025b-neighborhood of p is defined as the set of points that are at most distance \u025b away from p.<\/p>\n<p>If you think back to geometry, the shape in which all points are equidistant from the center is the circle. In 2D space, the \u025b-neighborhood of a point p is the set of points contained in a circle of radius \u025b, centered at p. In 3D space, the \u025b-neighborhood is a sphere of radius \u025b, centered at p, and in higher dimensional space, the \u025b-neighborhood is just the N-sphere of radius \u025b, centered at p.<\/p>\n<p>Let\u2019s consider an example to make this idea more concrete.<\/p>\n<p>I have scattered 100 data points in the interval [1,3]X[2,4]. Let\u2019s pick the point (3,2) to be our point p.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-7244\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image.png\" width=\"918\" height=\"638\"><\/p>\n<p>First, let\u2019s consider the neighborhood of p with radius 0.5 (\u025b = 0.5), the set of points that are distance 0.5 away from p.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-7245\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-1.png\" width=\"901\" height=\"624\"><\/p>\n<p>The opaque green oval represents our neighborhood, and there are 31 data points in this neighborhood. Since I scattered 100 data points and 31 are in the neighborhood, this means that a little under one-third of the data points are contained within the neighborhood of p with radius 0.5.<\/p>\n<p>Now, let\u2019s change our radius to 0.15 (\u025b = 0.15) and consider the resulting smaller neighborhood.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-7246\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-2.png\" width=\"900\" height=\"635\"><\/p>\n<p>We have shrunk the neighborhood, so now only 3 data points are contained within it. By decreasing \u025b from 0.5 to 0.15 (a 70% reduction), we have decreased the number of points in our neighborhood from 31 to 3 (a 90% reduction).<\/p>\n<p>Now that we have defined what we mean by a \u201cneighborhood\u201d, we\u2019ll introduce the next important concept: the notion of a \u201cdensity\u201d for a neighborhood (we\u2019re building up to describing \u201cdensity-based clustering, after all).<\/p>\n<p>In a grade-school science class, children are taught that density = mass\/volume. Let\u2019s use this idea of mass divided by volume to define density at some point p. If we consider some point p and its neighborhood of radius \u025b, we can define the mass of the neighborhood as the number of data points (or alternatively, the fraction of data points) contained within the neighborhood, and the volume of the neighborhood is volume of the resulting shape of the neighborhood. In the 2D case, the neighborhood is a circle, so the volume of the neighborhood is just the area of the resulting circle. In the 3D and higher dimensional case, the neighborhood is a sphere or n-sphere, so we can calculate the volume of this shape.<\/p>\n<p>For example, let\u2019s consider our neighborhood of p = (3,2) of radius 0.5 again.<\/p>\n<p><img loading=\"lazy\" class=\"wp-image-7247\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-3.png\" width=\"904\" height=\"635\"><\/p>\n<p>The mass is the number of data points in the neighborhood, so mass = 31. The volume is the area of the circle, so volume = \u03c00.25 = \u03c0\/4. Therefore, our local density approximation at *p = (3,2) is calculated as density = mass\/volume = 31\/(\u03c0\/4) = 124\/\u03c0 ~= 39.5.<\/p>\n<p>This value is meaningless by itself, but if we calculate the local density approximation for all points in our dataset, we could cluster our points by saying that points that are nearby (contained in the same neighborhood) and have similar local density approximations belong in the same cluster. If we decrease the value of \u025b, we can construct smaller neighborhoods (less volume) that would also contain fewer data points. Ideally, we want to identify highly dense neighborhoods where most of the data points are contained in these neighborhoods, but the volume of each of these neighborhoods is relatively small.<\/p>\n<p>While this is not exactly what either DBSCAN or the Level Set Tree algorithm does, it forms the general intuition behind density-based clustering.<\/p>\n<p>To recap, we discussed the \u025b-neighborhoods and how they allow us to reason about the space around a particular point. Then we defined a notion of density at a particular point for a particular neighborhood. In the next section, I\u2019ll discuss the DBSCAN algorithm where the \u025b-ball is a fundamental tool for defining clusters.<\/p>\n<h2><a id=\"post-1475-_a2fr9f2javq0\"><\/a><strong>DBSCAN<\/strong><\/h2>\n<p>DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is the most well-known density-based clustering algorithm, <a href=\"http:\/\/citeseerx.ist.psu.edu\/viewdoc\/summary?doi=10.1.1.71.1980\">first introduced in 1996 by Ester et. al<\/a>. Due to its importance in both theory and applications, this algorithm is one of three algorithms awarded the <a href=\"https:\/\/www.kdd.org\/awards\/view\/2014-sikdd-test-of-time-award-winners\">Test of Time Award<\/a> at the KDD conference in 2014.<\/p>\n<p>Unlike k-means, DBSCAN does not require the number of clusters as a parameter. Rather it infers the number of clusters based on the data, and it can discover clusters of arbitrary shape (for comparison, k-means usually discovers spherical clusters). As I said earlier, the \u025b-neighborhood is fundamental to DBSCAN to approximate local density, so the algorithm has two parameters:<\/p>\n<ul>\n<li><em>\u025b<\/em>: The radius of our neighborhoods around a data point <em>p<\/em>.<\/li>\n<li><em>minPts<\/em>: The minimum number of data points we want in a neighborhood to define a cluster.<\/li>\n<\/ul>\n<p>Using these two parameters, DBSCAN categories the data points into three categories:<\/p>\n<ul>\n<li><em>Core Points<\/em>: A data point <em>p<\/em> is a <em>core point<\/em> if <strong>Nbhd<\/strong>(<em>p<\/em>,<em>\u025b<\/em>) [\u025b-neighborhood of <em>p<\/em>] contains at least <em>minPts<\/em> ; |<strong>Nbhd<\/strong>(<em>p<\/em>,<em>\u025b<\/em>)| &gt;= <em>minPts<\/em>.<\/li>\n<li><em>Border Points: A data point *q<\/em> is a <em>border point<\/em> if <strong>Nbhd<\/strong>(<em>q<\/em>, <em>\u025b<\/em>) contains less than <em>minPts<\/em> data points, but <em>q<\/em> is <em>reachable<\/em> from some <em>core point<\/em> <em>p<\/em>.<\/li>\n<li><em>Outlier<\/em>: A data point <em>o<\/em> is an <em>outlier<\/em> if it is neither a core point nor a border point. Essentially, this is the \u201cother\u201d class.<\/li>\n<\/ul>\n<p>These definitions may seem abstract, so let\u2019s cover what each one means in more detail.<\/p>\n<h3><a id=\"post-1475-_dgaihd8ggmpe\"><\/a><strong>Core Points<\/strong><\/h3>\n<p><em>Core Points<\/em> are the foundations for our clusters are based on the density approximation I discussed in the previous section. We use the same \u025b to compute the neighborhood for each point, so the volume of all the neighborhoods is the same. However, the number of other points in each neighborhood is what differs. Recall that I said we can think of the number of data points in the neighborhood as its <em>mass<\/em>. The volume of each neighborhood is constant, and the mass of neighborhood is variable, so by putting a threshold on the minimum amount of mass needed to be a <em>core point<\/em>, we are essentially setting a minimum density threshold. Therefore, core points are data points that satisfy a minimum density requirement. Our clusters are built around our <em>core points<\/em> (hence the <em>core<\/em> part), so by adjusting our <em>minPts<\/em> parameter, we can fine-tune how dense our clusters\u2019 cores must be.<\/p>\n<h3><a id=\"post-1475-_3qeclt1p04bg\"><\/a><strong>Border Points<\/strong><\/h3>\n<p><em>Border Points<\/em> are the points in our clusters that are not core points. In the definition above for <em>border points<\/em>, I used the term <em>density-reachable<\/em>. I have not defined this term yet, but the concept is simple. To explain this concept, let\u2019s revisit our neighborhood example with epsilon = 0.15. Consider the point <em>r<\/em> (the black dot) that is outside of the point <em>p<\/em>\u2018s neighborhood.<\/p>\n<p><img loading=\"lazy\" width=\"719\" height=\"520\" class=\"wp-image-7248\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-4.png\"><\/p>\n<p>All the points inside the point <em>p<\/em>\u2018s neighborhood are said to be <em>directly reachable<\/em> from <em>p<\/em>. Now, let\u2019s explore the neighborhood of point <em>q<\/em>, a point <em>directly reachable<\/em> from <em>p<\/em>. The yellow circle represents <em>q<\/em>\u2018s neighborhood.<\/p>\n<p><img loading=\"lazy\" width=\"719\" height=\"515\" class=\"wp-image-7249\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-5.png\"><\/p>\n<p>Now while our target point <em>r<\/em> is not our starting point <em>p<\/em>\u2018s neighborhood, it is contained in the point <em>q<\/em>\u2018s neighborhood. This is the idea behind density-reachable: If I can get to the point <em>r<\/em> by jumping from neighborhood to neighborhood, starting at a point <em>p<\/em>, then the point <em>r<\/em> is density-reachable from the point <em>p<\/em>.<\/p>\n<p><img loading=\"lazy\" width=\"725\" height=\"523\" class=\"wp-image-7250\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-6.png\"><\/p>\n<p>As an analogy, we can think of <em>density-reachable<\/em> points as being the \u201cfriends of a friend\u201d. If the <em>directly-reachable<\/em> of a <em>core point p<\/em> are its \u201cfriends\u201d, then the <em>density-reachable<\/em> points, points in the neighborhood of the \u201cfriends\u201d of <em>p<\/em>, are the \u201cfriends of its friends\u201d. One thing that may not be clear is <em>density-reachable<\/em> points are not limited to just two adjacent neighborhood jumps. As long as you can reach the point doing \u201cneighborhood jumps\u201d, starting at a <em>core point p<\/em>, that point is <em>density-reachable<\/em> from <em>p<\/em>, so \u201cfriends of a friend of a friend \u2026 of a friend\u201d are included as well.<\/p>\n<p>It is important to keep in mind that this idea of <em>density-reachable<\/em> is dependent on our value of \u025b. By picking larger values of \u025b, more points become <em>density-reachable<\/em>, and by choosing smaller values of \u025b, less points become <em>density-reachable<\/em>.<\/p>\n<h3><a id=\"post-1475-_ttwys5nuakym\"><\/a><strong>Outliers<\/strong><\/h3>\n<p>Finally, we get to our \u201cother\u201d class. <em>Outliers<\/em> are points that are neither <em>core points<\/em> nor are they close enough to a cluster to be <em>density-reachable<\/em> from a <em>core point<\/em>. <em>Outliers<\/em> are not assigned to any cluster and, depending on the context, may be considered anomalous points.<\/p>\n<p>Now that I have covered all the preliminaries, we can finally talk about how the algorithm works in practice.<\/p>\n<h3><a id=\"post-1475-_r8fh7qwnn9p9\"><\/a><strong>Application<\/strong><\/h3>\n<p>DBSCAN is implemented in the popular Python machine learning library <a href=\"http:\/\/scikit-learn.org\/stable\/\">Scikit-Learn<\/a>, and because this implementation is scalable and well-tested, I will be using it to demonstrate how DBSCAN works in practice.<\/p>\n<p>The steps to the DBSCAN algorithm are:<\/p>\n<ul>\n<li>Pick a point at random that has not been assigned to a cluster or been designated as an <em>outlier<\/em>. Compute its neighborhood to determine if it\u2019s a <em>core point<\/em>. If yes, start a cluster around this point. If no, label the point as an <em>outlier<\/em>.<\/li>\n<li>Once we find a <em>core point<\/em> and thus a cluster, expand the cluster by adding all <em>directly-reachable<\/em> points to the cluster. Perform \u201cneighborhood jumps\u201d to find all <em>density-reachable<\/em> points and add them to the cluster. If an <em>outlier<\/em> is added, change that point\u2019s status from <em>outlier<\/em> to <em>border point<\/em>.<\/li>\n<li>Repeat these two steps until all points are either assigned to a cluster or designated as an <em>outlier<\/em>.<\/li>\n<\/ul>\n<p>Thanks to Scikit-Learn\u2019s easy-to-use API, we can implement DBSCAN in only a couple lines of code:<\/p>\n<blockquote>\n<pre>from sklearn.cluster import DBSCAN<\/pre>\n<\/blockquote>\n<p>To test out DBSCAN, I\u2019m going to use a <a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/Wholesale+customers\">dataset consisting of annual customer data for a wholesale distributor<\/a>.<\/p>\n<p>The dataset consists of 440 customers and has 8 attributes for each of these customers. I will use the Pandas library to load the .csv file into a DataFrame object:<\/p>\n<blockquote>\n<pre>import pandas as pd\ndata = pd.read_csv(\"data\/wholesale.csv\")\n#Drop non-continuous variables\ndata.drop([\"Channel\", \"Region\"], axis = 1, inplace = True)<\/pre>\n<\/blockquote>\n<p>After dropping two fields that identify the customer, we can examine the first few rows of this dataset:<\/p>\n<p><img loading=\"lazy\" width=\"554\" height=\"208\" class=\"wp-image-7251\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-7.png\"><\/p>\n<p>So we can visualize the data, I\u2019m going to use only two of these attributes:<\/p>\n<ul>\n<li>Groceries: The customer\u2019s annual spending (in some monetary unit) on grocery products.<\/li>\n<li>Milk: The customer\u2019s annual spending (in some monetary unit) on milk products.<\/li>\n<\/ul>\n<blockquote>\n<pre>data = data[[\"Grocery\", \"Milk\"]]\ndata = data.to_numpy().astype(\"float32\", copy = False)<\/pre>\n<\/blockquote>\n<p>Because the values of the data are in the thousands, we are going to normalize each attribute by scaling it to 0 mean and unit variance.<\/p>\n<blockquote>\n<pre>stscaler = StandardScaler().fit(data)\ndata = stscaler.transform(data)<\/pre>\n<\/blockquote>\n<p>Now, let\u2019s visualize the normalized dataset:<\/p>\n<p><img loading=\"lazy\" width=\"749\" height=\"566\" class=\"wp-image-7252\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-8.png\"><\/p>\n<p>As you can see, there is positive correlation between grocery purchases and milk product purchases. There is a cluster centered about the mean milk purchase (milk = 0) and the mean groceries purchase (groceries = 0). In addition, there are a handful of outliers pertaining to customers who buy more groceries or milk products compared to other customers.<\/p>\n<p>With DBSCAN, we want to identify this main cluster of customers, but we also want to flag customers with more unusual annual purchasing habits as outliers.<\/p>\n<p>We will construct a DBSCAN object that requires a minimum of 15 data points in a neighborhood of radius 0.5 to be considered a core point.<\/p>\n<p>dbsc = DBSCAN(eps = .5, min_samples = 15).fit(data)<\/p>\n<p>Next, we can extract our cluster labels and outliers to plot our results.<\/p>\n<blockquote>\n<pre>labels = dbsc.labels_\ncore_samples = np.zeros_like(labels, dtype = bool)\ncore_samples[dbsc.core_sample_indices_] = True<\/pre>\n<\/blockquote>\n<p><img loading=\"lazy\" width=\"754\" height=\"578\" class=\"wp-image-7253\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-9.png\"><\/p>\n<p>Lining up with our intuition, the DBSCAN algorithm was able to identify one cluster of customers who buy about the mean grocery and mean milk product purchases. In addition, it was able to flag customers whose annual purchasing behavior deviated too heavily from other customers.<\/p>\n<p>Because the outliers corresponded to customers with more extreme purchasing behavior, the wholesale distributor could specifically target these customers with exclusive discounts to encourage larger purchases.<\/p>\n<p>As a baseline, let\u2019s run k-means with two clusters on this dataset. The big blue dot represents the centroid for the black cluster, and the big gold dot represents the centroid for the white cluster:<\/p>\n<p><img loading=\"lazy\" width=\"723\" height=\"511\" class=\"wp-image-7254\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-10.png\"><\/p>\n<p>While the white clusters appear to capture most of the outliers, the cluster basically captures customers who purchase relatively more goods. If we designate the white cluster as the \u201canomalous\u201d cluster, then we basically flag any customer who purchases a lot of milk or groceries. If you were the wholesale distributor, then you would be calling your better customers, the ones whom you make more money from, anomalies.<\/p>\n<h3><a id=\"post-1475-_yupnimxrf1xy\"><\/a><strong>DBSCAN \u2013 Toy Example<\/strong><\/h3>\n<p>For a more contrived but impressive illustration of DBSCAN\u2019s capabilities, let\u2019s consider a toy example informally known as the \u201chalf-moons\u201d dataset where each data point belongs to one of the two \u201chalf-moons\u201d.<\/p>\n<blockquote>\n<pre>from sklearn.datasets import make_moons\n#moons_X: Data, moon_y: Labels\nmoons_X, moon_y = make_moons(n_samples = 2000)<\/pre>\n<\/blockquote>\n<p><img loading=\"lazy\" width=\"732\" height=\"535\" class=\"wp-image-7255\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-11.png\"><\/p>\n<p>This dataset is aesthetically pleasing, but it has no outliers. To rectify this problem, I am going to add random noise to 1% of the data using the following code:<\/p>\n<blockquote>\n<pre>def add_noise(X,y, noise_level = 0.01):\n#The number of points we wish to make noisy\namt_noise = int(noise_level*len(y))\n#Pick amt_noise points at random\nidx = np.random.choice(len(X), size = amt_noise)\n#Add random noise to these selected points\nnoise = np.random.random((amt_noise, 2) ) -0.5\nX[idx,:] += noise\nreturn X<\/pre>\n<\/blockquote>\n<p>Now the dataset has noticeable outliers.<\/p>\n<p><img loading=\"lazy\" width=\"730\" height=\"510\" class=\"wp-image-7256\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-12.png\"><\/p>\n<p>We can run DBSCAN on the data to get the following results. The algorithm successfully discovers both \u201chalf-moons\u201d and identifies almost every noisy data point as an outlier.<\/p>\n<p><img loading=\"lazy\" width=\"751\" height=\"546\" class=\"wp-image-7257\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-13.png\"><\/p>\n<p>In contrast, k-Means performs poorly on this dataset. The algorithm can not successfully discover the \u201chalf-moon\u201d structure, nor can it distinguish the noisy data points from the original data points.<\/p>\n<p><img loading=\"lazy\" width=\"731\" height=\"527\" class=\"wp-image-7258\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-14.png\"><\/p>\n<h2><\/h2>\n<h2><a id=\"post-1475-_p75a6ud7b4xh\"><\/a><strong>denpro<\/strong><\/h2>\n<p>The <a href=\"http:\/\/cseweb.ucsd.edu\/~dasgupta\/papers\/tree.pdf\">Level Set Tree clustering algorithm<\/a> was developed in 2010 by Chaudhuri and Dasgupta, two computer scientist at <a href=\"https:\/\/ucsd.edu\/\">UC San Diego<\/a>. For this section, I will be using the <a href=\"https:\/\/denpro.co.uk\/\">denpro<\/a> R package, developed by professor and author, Jussi Klemela. This library contains an efficient implementation of the Level Set Tree clustering algorithm. See the R file, denpro.R, in the <a href=\"https:\/\/try.dominodatalab.com\/u\/nmanchev\/clustering\/overview\">Domino project<\/a>.<\/p>\n<h3><a id=\"post-1475-_g05wpbwllkws\"><\/a><strong>Level Sets<\/strong><\/h3>\n<p>While DBSCAN is built on \u025b-balls as its foundation, Level Set Tree clustering is built on, as the name suggests, level sets.<\/p>\n<p>If we have some mathematical function <em>f<\/em> and some constant <em>c<\/em>, then the level set <em>L<\/em><sub>c<\/sub> is all the values of <em>x<\/em> such that f(x) = c. Formally, in mathematical set builder notation.<\/p>\n<p><em>L<\/em><sub>c<\/sub>(f) = { x in<strong> X<\/strong>| f(x) = c}<\/p>\n<p>For example, if f(x) = x<sup>2<\/sup> and <em>c<\/em> = 4, then <em>L<\/em><sub>c<\/sub>(f) = {-2, 2} because f(-2) = (-2)<sup>2<\/sup> = 4 and f(2) = 2<sup>2<\/sup> = 4.<\/p>\n<p><img loading=\"lazy\" width=\"722\" height=\"508\" class=\"wp-image-7259\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-15.png\"><\/p>\n<p>To build the Level Set Trees, we are going to use a special class of <em>level sets<\/em> called \u03bb-<em>upper level sets<\/em>. For some constant \u03bb and some function <em>f<\/em>, the \u03bb-<em>upper level set<\/em> of <em>f<\/em>, <em>L<\/em><sub>\u03bb<\/sub>(f), is defined as:<\/p>\n<p><em>L<\/em><sub>\u03bb<\/sub>(f) = { x in X| f(x) &gt;= \u03bb}<\/p>\n<p>Essentially, this is where a function is greater than or equal to some constant \u03bb. Returning back to our previous example of f(x) = x<sup>2<\/sup> and \u03bb = 4, the \u03bb-<em>upper level set<\/em> of <em>f<\/em> would be (-\u221e,-2]U[2,\u221e).<\/p>\n<p><img loading=\"lazy\" width=\"726\" height=\"515\" class=\"wp-image-7260\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-16.png\"><\/p>\n<p>Now, I\u2019m sure you are wondering how we can use this mathematical object to perform clustering. The basic idea is that we assume our data is generated according to some probability density function <em>f<\/em>, so if we pick \u03bb to a constant in the interval [0,1], our \u03bb-<em>upper level sets<\/em> correspond to regions of the data that are denser than our threshold \u03bb.<\/p>\n<p>Our goal is to use the threshold parameter \u03bb to identify <em>high-density clusters<\/em> that correspond to connected regions of the data with similar densities.<\/p>\n<p>Now, let\u2019s talk about the <em>Tree<\/em> part of the algorithm. Let\u2019s say we start with a high initial value of \u03bb, say \u03bb = 0.7, and compute the resulting \u03bb-<em>upper level set<\/em> of <em>f<\/em>. We end up with a set of points such that f(x) &gt;= 0.7 for each point in the <em>upper level set<\/em>. Now, let\u2019s decrease the value of \u03bb to 0.5 and compute the resulting \u03bb-<em>upper level set<\/em>. Now by the transitive property, all the points in the \u03bb = 0.7-<em>upper level set<\/em> are also members of the \u03bb = 0.5-<em>upper level set<\/em> because if f(x) &gt;= 0.7 and 0.7 &gt;= 0.5, then f(x) &gt;= 0.5.<\/p>\n<p>Generalizing this pattern, if \u03bb1 &lt; \u03bb2, then <em>L<\/em>\u03bb2 (f) is a subset of <em>L<\/em>\u03bb1 (f). Essentially, if we were to iterate through an increasing sequence of \u03bb\u2019s, we would create an \u03bb-upper<em> level set<\/em> that is a subset of the previous upper level sets. This is the tree structure: The root represents \u03bb = 0.0 (all the points), and each branch represents a higher value of \u03bb and thus a subset of the points from the previous level.<\/p>\n<h3><a id=\"post-1475-_9g0gc8usnswj\"><\/a><strong>kNN Density Estimation<\/strong><\/h3>\n<p>Since we don\u2019t know the true probability density function <em>f<\/em> that generates the data, we are going to need to estimate it. Previously, we used our \u025b-Balls to estimate density by computing the \u201cmass\u201d to \u201cvolume\u201d ratio of our neighborhoods.<\/p>\n<p>Here we use a similar method called k-Nearest Neighbor Density Estimation to estimate the underlying probability density function <em>f<\/em>. With the \u025b-Balls method, we first fixed the volume of the neighborhood and then counted the number of points within the neighborhood to determine the mass. With the k-Nearest Neighbor Density Estimation method, we\u2019re going to do the opposite; first we fix the mass to be some number <em>k<\/em>, then we compute the volume necessary to get <em>k<\/em> points within our ball.<\/p>\n<p>In machine learning, one of the most basic classification algorithms is k-Nearest Neighbors (k-NN) classification. A quick refresher on k-NN classification:<\/p>\n<ul>\n<li>Pick some integer <em>k<\/em> and some test datapoint <em>x<\/em><sub>i<\/sub>.<\/li>\n<li>Retrieve the <em>k<\/em> points in your training dataset that are closest to <em>x<\/em><sub>i<\/sub>.<\/li>\n<li>Get the majority label of these <em>k<\/em> points and assign that label to <em>x<\/em><sub>i<\/sub><\/li>\n<\/ul>\n<p>With k-NN density estimation, we are going to do something similar. Like I said above, we want <em>k<\/em> points in our neighborhood, so we are going to find the volume of the smallest neighborhood that contains exactly <em>k<\/em> points.<\/p>\n<p>To estimate the density of a given point using k-NN density estimation, we are going to find the distance to the K<sup>th<\/sup> nearest point, <em>d<\/em><sub>k<\/sub>, and use this as the radius of our neighborhood. By doing this, we get a neighborhood around our point with exactly <em>k<\/em> other points in it.<\/p>\n<p>The mathematical equation for the k-Nearest Neighbor estimator is included below:<\/p>\n<p><img loading=\"lazy\" width=\"165\" height=\"43\" class=\"wp-image-7261\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-17.png\"><\/p>\n<p>In this equation, <em>k<\/em> is the number of points we want in our neighborhood, <em>x<\/em><sub>i<\/sub> is our given point, <em>n<\/em> is the number of points in the dataset, <em>v<\/em><sub>d<\/sub> is the volume of the d-dimensional Euclidean ball, and <em>r<\/em><sub>k<\/sub>(<em>x<\/em><sub>i<\/sub>) is the distance to the k<sup>th<\/sup> nearest point.<\/p>\n<p>For 2-dimensional data, the volume of the 2-dimensional Euclidean ball is \u03c0R<sup>2<\/sup> where R = 1, so <em>v<\/em><sub>d<\/sub> = \u03c0, in this case.<\/p>\n<p>The following code computes the k-NN density estimate for 2-dimensional data using the <a href=\"https:\/\/cran.r-project.org\/web\/packages\/TDA\/index.html\">TDA<\/a> R package:<\/p>\n<blockquote>\n<pre>## kernel density estimator\ninstall.packages('TDA')\nlibrary(TDA)\nk &lt;- 500\nKNN &lt;- knnDE(crater_xy, crater_xy, k)\nqplot(x, y, data = crater, size = KNN, colour = KNN, main = 'KNN Density Estimate on Crater Dataset') +\nscale_colour_continuous(low = \"black\", high = \"red\")<\/pre>\n<\/blockquote>\n<p>To demonstrate how k-NN density estimation, let\u2019s consider a toy dataset informally called the \u201ccrater\u201d dataset. The dataset consists of a very dense core (\u201cimpact\u201d of the crater) with a less dense ring surrounding the core.<\/p>\n<p><img loading=\"lazy\" width=\"700\" height=\"350\" class=\"wp-image-7262\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-18.png\"><\/p>\n<p>If you look at the image above, you\u2019ll notice a tightly bundled set of data points in the center, and the points appear to get less bundled as we move away from the center. There are 2,000 red data points, and 1,000 blue data points, for reference.<\/p>\n<p>The code I used to create this toy dataset is provided below:<\/p>\n<blockquote>\n<pre>makeCraters &lt;- function(inner_rad = 4, outer_rad = 4.5,\ndonut_len = 2, inner_pts = 1000,\nouter_pts = 500){\n#Make the inner core\nradius_core &lt;- inner_rad*runif(inner_pts)\ndirection_core &lt;- 2*pi*runif(inner_pts)\n#Simulate inner core points\ncore_x &lt;- radius_core*cos(direction_core)\ncore_y &lt;- radius_core*sin(direction_core)\ncrater &lt;- data.frame(x=core_x, y=core_y, label='core')\n#Make the outer ring\nradius_ring &lt;- outer_rad + donut_len*runif(outer_pts)\ndirection_ring &lt;- 2*pi*runif(outer_pts)\n#Simulate ring points\nring_x &lt;- radius_ring*cos(direction_ring)\nring_y &lt;- radius_ring*sin(direction_ring)\ncrater_ring &lt;- data.frame(x=ring_x, y=ring_y, label='ring')\n\ncrater &lt;- rbind(crater, crater_ring)\nreturn(crater)\n}\ncrater &lt;- makeCraters()\ncrater_xy &lt;- crater[,c(1,2)]\nrequire(ggplot2)\nqplot(x, y, data = crater, colour = label, main = 'Crater Dataset')<\/pre>\n<\/blockquote>\n<p>From the above image, we should expect for the points near the center to have higher k-NN density estimates and the points in the outer ring to have lower k-NN density estimates.<\/p>\n<p>We run the k-NN density estimate code I included above on this \u201ccrater\u201d dataset and plot the points where larger and darker points correspond to higher density values.<\/p>\n<blockquote>\n<pre>## kernel density estimator\ninstall.packages('TDA')\nlibrary(TDA)\nk &lt;- 500\nKNN &lt;- knnDE(crater_xy, crater_xy, k)\nqplot(x, y, data = crater, size = KNN, colour = KNN) +\nscale_colour_continuous(low = \"black\", high = \"red\")<\/pre>\n<\/blockquote>\n<p><img class=\"wp-image-72 63\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-19.png\"><\/p>\n<p>As expected, the larger and darker points are bundled in the center, and as we move farther away from the center, the points become lighter and smaller. By choosing appropriate values of \u03bb, we can create an upper-level set consisting of only the darker points near the center, corresponding to a highly dense cluster.<\/p>\n<h3><\/h3>\n<h3><a id=\"post-1475-_lo0c8bopxxui\"><\/a><strong>Application<\/strong><\/h3>\n<p>The <strong>denpro<\/strong> library is available through CRAN, so we can install it with<\/p>\n<blockquote>\n<pre>install.packages('denpro')\nlibrary(denpro)<\/pre>\n<\/blockquote>\n<p>We will be applying the Level Set Tree algorithm to the \u201ccrater\u201d dataset from the previous section. Rather than only trying to cluster the data into the core and the outer ring, the hierarchical nature of the Level Set Tree algorithm will find \u201cdenser\u201d sub-clusters in the outer ring.<\/p>\n<p>We start by running pcf.kern which computes a multivariate kernel estimate and gives the output as a piecewise constant function object. It takes inputs of h, the smoothing parameters, and N, the grid dimensions for the kernel estimate. It uses a standard Gaussian kernel by default.<\/p>\n<blockquote>\n<pre>pcf&lt;-pcf.kern(crater_xy,h=0.5,N=c(32,32)) # kernel estimate<\/pre>\n<\/blockquote>\n<p>Next, we build the level set tree and then prune it.<\/p>\n<blockquote>\n<pre>lst&lt;-leafsfirst(pcf) # level set tree\ntd&lt;-treedisc(lst,pcf,ngrid=60) # pruned level set tree<\/pre>\n<\/blockquote>\n<p>Now, we plot the tree, its volume graph, and its barycenter plot which is a location plot of a shape tree.<\/p>\n<blockquote>\n<pre>plottree(td, col='blue') # plot tree\nplotvolu(td) # volume plot\nplotbary(td) # location plot<\/pre>\n<\/blockquote>\n<p><img loading=\"lazy\" width=\"700\" height=\"350\" class=\"wp-image-7264\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-20.png\"><img loading=\"lazy\" width=\"700\" height=\"350\" class=\"wp-image-7265\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-21.png\"><img loading=\"lazy\" width=\"700\" height=\"350\" class=\"wp-image-7266\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-22.png\"><\/p>\n<p>From the dendrogram, we see that the dataset was clustered into a main group followed by denser exterior clusters.<\/p>\n<p>We can now plot the data with these cluster labels.<\/p>\n<blockquote>\n<pre>cd&lt;-colors2data(crater_xy,pcf,lst)\nplot(crater_xy,col=cd$datacolo)<\/pre>\n<\/blockquote>\n<p><img loading=\"lazy\" width=\"700\" height=\"350\" class=\"wp-image-7267\" src=\"https:\/\/blog.dominodatalab.com\/wp-content\/uploads\/2015\/09\/word-image-23.png\"><\/p>\n<p>The Level Set Tree was able to successfully cluster most of the inner core as a single cluster and then identified \u201cdense\u201d sub-clusters in the outer ring.<\/p>\n<p>With Level Set Trees, we can cluster in datasets where the data exhibits large differences in the density. Whereas DBSCAN just flags outliers, Level Set Trees attempt to discover some cluster-based substructure in these outliers. In market segmentation, this could be useful in detecting an emerging market. In data security, researchers could identify new malware in which the small sample of recently infected computers behave differently from the rest of the data, but similar to each other.<\/p>\n<h2><a id=\"post-1475-_el8f7cmlq38u\"><\/a><strong>Distance Functions<\/strong><\/h2>\n<p>In the preliminary section about \u025b-neighborhoods, I said these neighborhoods exhibit circular or spherical shapes. When we use the standard Euclidean distance metric, sometimes called the <em>l<\/em><sub>2<\/sub> metric, our neighborhoods display this spherical shape. Why does this happen?<\/p>\n<p>Let\u2019s consider two points <em>p<\/em><sub>1<\/sub> = (x<sub>1<\/sub>, y<sub>1<\/sub>), <em>p<\/em><sub>2<\/sub> = (x<sub>2<\/sub>, y<sub>2<\/sub>) in 2D space. I am using points in 2D space for simplicity, but these distance metrics can be generalized to higher-dimensional spaces.<\/p>\n<p>The Euclidean distance for these two points is <em>d(p<sub>1<\/sub>,p<sub>2<\/sub>)<\/em> = [(x<sub>1<\/sub> \u2013 x<sub>2<\/sub>)<sup>2<\/sup> + (y<sub>1<\/sub> \u2013 y<sub>2<\/sub>)<sup>2<\/sup>]<sup>0.5<\/sup>.<\/p>\n<p>If we fix the distance to be less than \u025b, then we get the following inequality:<\/p>\n<p><em>d(p<sub>1<\/sub>, p<sub>2<\/sub>)<\/em> = [(x<sub>1<\/sub> \u2013 x<sub>2<\/sub>)<sup>2<\/sup> + (y<sub>1<\/sub> \u2013 y<sub>2<\/sub>)<sup>2<\/sup>]<sup>0.5<\/sup> &lt; \u025b<\/p>\n<p>Squaring both sides yields:<\/p>\n<p>(x<sub>1<\/sub> \u2013 x<sub>2<\/sub>)<sup>2<\/sup> + (y<sub>1<\/sub> \u2013 y<sub>2<\/sub>)<sup>2<\/sup> &lt; \u025b<sup>2<\/sup><\/p>\n<p>The above inequality should look familiar to the general equation of a circle of radius <em>r<\/em> centered at the point (c<sub>1<\/sub>, c<sub>2<\/sub>): (x \u2013 c<sub>1<\/sub>)<sup>2<\/sup> + (y- c<sub>2<\/sub>)<sup>2<\/sup> = r<sup>2<\/sup>.<\/p>\n<p>Thanks to the Euclidean distance metric, when we compute <strong>Nbhd<\/strong>(<em>p<\/em>, \u025b) , we end up with a spherical neighborhood centered at <em>p<\/em> with radius \u025b.<\/p>\n<p>However, if we were to use a different distance metric, we would end up with neighborhoods of different shapes. For example, if we used the Manhattan distance, or <em>l<\/em><sub>1<\/sub> metric, where the distance between two points is <em>d(p<\/em><sub>1<\/sub><em>, p<\/em><sub>2<\/sub><em>)<\/em> = |x<sub>1<\/sub> \u2013 x<sub>2<\/sub>| + |y<sub>1<\/sub> \u2013 y<sub>2<\/sub>| (|.| is the absolute value), then neighborhoods will display a rectangular shape.<\/p>\n<p>By changing the distance metric used, we can change the shape of neighborhoods. By changing the shape of neighborhoods, we can discover different sets of clusters. Depending on the problem, it may be beneficial to use a distance metric other than the Euclidean distance metric to discover different types of clusters.<\/p>\n<h2><a id=\"post-1475-_yafnmidlmp2n\"><\/a><strong>Conclusion<\/strong><\/h2>\n<p>Density-based clustering methods are great because they do not specify the number of clusters beforehand. Unlike other clustering methods, they incorporate a notion of outliers and are able to \u201cfilter\u201d these out. Finally, these methods can learn clusters of arbitrary shape, and with the Level Set Tree algorithm, one can learn clusters in datasets that exhibit wide differences in density.<\/p>\n<p>However, I should point out that these methods are somewhat harder to tune compared to parametric clustering methods like k-means. Things like the epsilon parameter for DBSCAN or the parameters for the Level Set Tree are less intuitive to reason about compared to the number of clusters parameter for k-means, so it\u2019s more difficult to pick good initial parameter values for these algorithms.<\/p>\n<p>Finally, for readers who are interested in general clustering methodology for different types of problems, <a href=\"https:\/\/www.crcpress.com\/Data-Clustering-Algorithms-and-Applications\/Aggarwal-Reddy\/9781466558212\">this book on data clustering<\/a> is a handy reference.<\/p>\n<p><!-- relpost-thumb-wrapper --><!-- close relpost-thumb-wrapper -->    <\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/blog.dominodatalab.com\/topology-and-density-based-clustering\/<\/p>\n","protected":false},"author":0,"featured_media":8024,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8023"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=8023"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/8023\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/8024"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=8023"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=8023"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=8023"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}