{"id":63,"date":"2020-08-04T12:06:28","date_gmt":"2020-08-04T12:06:28","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/08\/04\/the-machine-learning-field-guide\/"},"modified":"2020-08-04T12:06:28","modified_gmt":"2020-08-04T12:06:28","slug":"the-machine-learning-field-guide","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/08\/04\/the-machine-learning-field-guide\/","title":{"rendered":"The Machine Learning Field Guide"},"content":{"rendered":"<div id=\"post-\">\n<p><b>By <a href=\"https:\/\/twitter.com\/kamwithk_\" target=\"_blank\" rel=\"noopener noreferrer\">Kamron Bhavnagri<\/a>, Up and coming machine learning engineer\/data scientist<\/b>.<\/p>\n<p>We all start with either a dataset or a goal in mind. Once\u00a0<a href=\"https:\/\/www.kamwithk.com\/zero-to-hero-data-collection-through-web-scraping-ck78o0bmg08ktd9s1bi7znd19\" target=\"_blank\" rel=\"noopener noreferrer\">we&#8217;ve found, collected or scraped our data<\/a>, we pull it up, and witness the overwhelming sight of merciless cells of numbers, more numbers, categories, and maybe some words! A naive thought crosses our mind, to use our machine learning prowess to deal with this tangled mess&#8230; but a quick search reveals the host of tasks we&#8217;ll need to consider before\u00a0<em>training a model<\/em>!<\/p>\n<p>Once we overcome the shock of our unruly data, we look for ways to battle our formidable nemesis. We start by trying to get our data into Python. It is relatively simple on paper, but the process can be slightly&#8230;\u00a0<em>involved<\/em>. Nonetheless, a little effort was all that was needed (lucky us).<\/p>\n<p>Without wasting any time, we begin\u00a0<em>data cleaning<\/em>\u00a0to get rid of the bogus and expose the beautiful. Our methods start simple &#8211; observe and remove. It works a few times, but then we realise&#8230; it really doesn&#8217;t do us justice! To deal with the mess, though, we find a powerful tool to add to our arsenal: charts! With our graphs, we can get a feel for our data, the patterns within it, and where things are missing. We can\u00a0<em>interpolate<\/em>\u00a0(fill in) or remove missing data.<\/p>\n<p>Finally, we approach our highly anticipated challenge, data modelling! With a little research, we find out which tactics and models are commonly used. It is a little difficult to decipher which one we should use, but we still manage to get through it and figure it all out!<\/p>\n<p>We can&#8217;t finish a project without doing something impressive, though. So, a final product, website, app, or even a report will take us far! We know first impressions are important, so we fix up the GitHub repository and make sure everything&#8217;s well documented and explained. Now we are\u00a0<em>finally able to show off our hard work to the rest of the world<\/em>!<\/p>\n<p>\u00a0<\/p>\n<h3>Chapter 1 &#8211; Importing Data<\/h3>\n<p>\u00a0<\/p>\n<p>Data comes in all kinds of shapes and sizes, and so the process we use to get everything into code often varies.<\/p>\n<blockquote>\n<p><em>Let&#8217;s be real, importing data seems easy, but sometimes&#8230; it&#8217;s a little pesky.<\/em><\/p>\n<\/blockquote>\n<p>The hard part about data cleaning isn&#8217;t the coding or theory, but instead our preparation! When we first start a new project and download our dataset, it can be tempting to open up a code editor and start typing&#8230; but this won&#8217;t do us any good. If we want to get a head start, we need to prepare ourselves for the best and worst parts of our data. To do this, we&#8217;ll need to start basic, by manually inspecting our spreadsheet\/s. Once we understand the basic format of the data (filetype along with any particularities) we can move onto getting it all into Python.<\/p>\n<p>When we&#8217;re lucky and just have one spreadsheet we can use the Pandas\u00a0<a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.read_csv.html\" target=\"_blank\" rel=\"noopener noreferrer\">read_csv<\/a>\u00a0function (letting it know where our data lies):<\/p>\n<div>\n<pre>pd.read_csv(\"file_path.csv\")\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>In reality, we run into way more complex situations, so look out for:<\/p>\n<ul>\n<li>The file starts with unneeded information (which we need to skip)<\/li>\n<li>We only want to import a few columns<\/li>\n<li>We want to rename our columns<\/li>\n<li>Data includes dates<\/li>\n<li>We want to combine data from multiple sources into one place<\/li>\n<li>Data can be grouped together<\/li>\n<\/ul>\n<blockquote>\n<p><em>Although we&#8217;re discussing a range of scenarios, we normally only deal with a few at a time.<\/em><\/p>\n<\/blockquote>\n<p>Our first few problems (importing specific parts of our data\/renaming columns) are easy enough to deal with using a few parameters, like the number of rows to skip, the specific columns to import and our column names:<\/p>\n<div>\n<pre>pd.read_csv(\"file_path.csv\", skiprows=5, usecols=[0, 1], names=[\"Column1\", \"Column2\"])\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>Whenever our data is spread across multiple files, we can combine them using Pandas\u00a0<em><a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.concat.html\" target=\"_blank\" rel=\"noopener noreferrer\">concat<\/a>\u00a0<\/em>function. The\u00a0<em>concat\u00a0<\/em>function combines a list of\u00a0<a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.DataFrame.html\" target=\"_blank\" rel=\"noopener noreferrer\">DataFrame<\/a>&#8216;s together:<\/p>\n<div>\n<pre>my_spreadsheets = [pd.read_csv(\"first_spreadsheet.csv\"),pd.read_csv(\"second_spreadsheet.csv\")]\r\npd.concat(my_spreadsheets, ignore_index=True)\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>We parse to\u00a0<em>concat\u00a0<\/em>a list of spreadsheets (which we import just like before). The list can, of course, be attained in any way (so a fancy list comprehension or a casual list of every file both work just as well), but just remember that\u00a0<strong>we need dataframes, not filenames\/paths<\/strong>!<\/p>\n<p>If we don&#8217;t have a CSV file, Pandas still works! We can just\u00a0<em>swap out<\/em>\u00a0<a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.read_csv.html\" target=\"_blank\" rel=\"noopener noreferrer\">read_csv<\/a>\u00a0for\u00a0<a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.read_excel.html\" target=\"_blank\" rel=\"noopener noreferrer\">read_excel<\/a>,\u00a0<a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.read_sql.html\" target=\"_blank\" rel=\"noopener noreferrer\">read_sql<\/a>, or another\u00a0<a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/user_guide\/io.html\">option<\/a>.<\/p>\n<p>After all the data is inside a Pandas dataframe, we need to double-check that our data is\u00a0<em>formatted correctly<\/em>. In practice, this means checking each series datatype and making sure they are not generic objects. We do this to ensure that we can utilize Pandas inbuilt functionality for numeric, categorical, and date\/time values. To look at this, just run\u00a0<em>DataFrame.dtypes<\/em>. If the output seems reasonable (i.e., numbers are numeric, categories are categorical, etc), then we should be fine to move on. However, this normally is not the case, and as we need to change our datatypes! This can be done with Pandas\u00a0<a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.DataFrame.astype.html\" target=\"_blank\" rel=\"noopener noreferrer\">DataFrame.astype<\/a>. If this doesn&#8217;t work, there should be another Pandas function for that specific conversion:<\/p>\n<div>\n<pre>data[\"Rating\"] = data[\"Rating\"].as_type(\"category\")\r\ndata[\"Number\"] = pd.to_numeric(data[\"Number\"])\r\ndata[\"Date\"] = pd.to_datetime(data[\"Date\"])\r\ndata[\"Date\"] = pd.to_datetime(data[[\"Year\", \"Month\", \"Day\", \"Hour\", \"Minute\"]])\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>If we need to analyse separate groups of data (i.e., maybe our data is divided by country), we can use Pandas\u00a0<strong>groupby<\/strong>. We can use\u00a0<strong>groupby<\/strong>\u00a0to select particular data, and to run functions on each group separately:<\/p>\n<div>\n<pre>data.groupby(\"Country\").get_group(\"Australia\")\r\ndata.groupby(\"Country\").mean()\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><em>Other more niche tricks like multi\/hierarchical indices can also be helpful in specific scenarios but are more tricky to understand and use.<\/em><\/p>\n<p>\u00a0<\/p>\n<h3>Chapter 2 &#8211; Data Cleaning<\/h3>\n<p>\u00a0<\/p>\n<p>Data is useful, data is necessary. However, it\u00a0<em>needs to be clean and to the point<\/em>! If our data is everywhere, it simply won&#8217;t be of any use to our machine learning model.<\/p>\n<blockquote>\n<p><em>Everyone is driven insane by missing data, but there&#8217;s always a light at the end of the tunnel.<\/em><\/p>\n<\/blockquote>\n<p>The easiest and quickest way to go through data cleaning is to ask ourselves:<\/p>\n<blockquote>\n<p><em>What features within our data will impact our end-goal?<\/em><\/p>\n<\/blockquote>\n<p>By end-goal, we mean whatever variable we are working towards predicting, categorising or analysing. The point of this is to narrow our scope and not get bogged down in useless information.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/cdn.hashnode.com\/res\/hashnode\/image\/upload\/v1591933470258\/ZETFabnur.jpeg?auto=format&amp;q=60\" width=\"90%\"><\/p>\n<p>Once we know what our primary objective features are, we can try to find patterns, relations, missing data, and more. An easy and intuitive way to do this is graphing! Quickly use Pandas to sketch out each variable in the dataset, and try to see where everything fits into place.<\/p>\n<p>Once we have identified potential problems or trends in the data, we can try and fix them. In general, we have the following options:<\/p>\n<ul>\n<li>Remove missing entries<\/li>\n<li>Remove full columns of data<\/li>\n<li>Fill in missing data entries<\/li>\n<li>Resample data (i.e., change the resolution)<\/li>\n<li>Gather more information<\/li>\n<\/ul>\n<p>To go from identifying missing data to choosing what to do with it, we need to consider how it affects our end-goal. With missing data, we remove anything which doesn&#8217;t seem to have a major influence on the end result (i.e., we couldn&#8217;t find a meaningful pattern) or where there just seems\u00a0<em>too much missing to derive value<\/em>. Sometimes we also decide to remove very small amounts of missing data (since it&#8217;s easier than filling it in).<\/p>\n<p>If we&#8217;ve decided to get rid of information, Pandas\u00a0<a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.DataFrame.drop.html\" target=\"_blank\" rel=\"noopener noreferrer\">DataFrame.drop<\/a>\u00a0can be used. It removes columns or rows from a dataframe. It is quite easy to use, but remember that\u00a0<strong>Pandas does not modify\/remove data from the source dataframe by default<\/strong>, so\u00a0<em>inplace=True<\/em>\u00a0must be specified. It may be useful to note that the\u00a0<em>axis\u00a0<\/em>parameter specifies whether rows or columns are being removed.<\/p>\n<p>When not removing a full column, or particularly targeting missing data, it can often be useful to rely on a few nifty Pandas functions. For removing null values,\u00a0<em>DataFrame.dropna\u00a0<\/em>can be utilized. Do keep in mind though that, by default,\u00a0<em>dropna\u00a0<\/em>completely removes all missing values. However, setting either the parameter\u00a0<em>how\u00a0<\/em>to\u00a0<em>all\u00a0<\/em>or setting a threshold (<em>thresh<\/em>, representing how many null values are required for it to delete) can compensate for this.<\/p>\n<p>If we&#8217;ve got small amounts of irregular missing values, we can fill them in several ways. The simplest is\u00a0<em>DataFrame.fillna<\/em>\u00a0that sets the missing values to some preset value. The more complex, but flexible option is interpolation using\u00a0<em>DataFrame.interpolate<\/em>. Interpolation essentially allows anyone to simply set the\u00a0<em>method<\/em>\u00a0they would like to replace each null value with. These include the previous\/next value, linear, and time (the last two deduce based on the data). Whenever working with time, time is a natural choice, and otherwise, make a reasonable choice based on how much data is being interpolated and how complex it is.<\/p>\n<div>\n<pre>data[\"Column\"].fillna(0, inplace=True)\r\ndata[[\"Column\"]] = data[[\"Column\"]].interpolate(method=\"linear\")\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><em>As seen above, interpolate needs to be passed in a dataframe purely containing the columns with missing data<\/em>\u00a0(otherwise, an error will be thrown).<\/p>\n<p>Resampling is useful whenever we see regularly missing data or have multiple sources of data using different timescales (like ensuring measurements in minutes and hours can be combined). It can be slightly difficult to intuitively understand resampling, but it is essential when you average measurements over a certain timeframe. For example, we can get monthly values by specifying that we want to get the mean of each month&#8217;s values:<\/p>\n<div>\n<pre>data.resample(\"M\").mean()\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>The\u00a0&#8220;M&#8221;\u00a0stands for month and can be replaced with\u00a0&#8220;Y&#8221;\u00a0for the year and other options.<\/p>\n<p>Although the data cleaning process can be quite challenging, if we remember our initial intent, it becomes a far more logical and straight forward task! If we still don&#8217;t have the needed data, we may need to go back to phase one and collect some more.\u00a0<em>Note that missing data indicates a problem with data collection, so it&#8217;s useful to carefully consider and note down where it occurs.<\/em><\/p>\n<p><em>For completion, the Pandas\u00a0<a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.Series.unique.html\" target=\"_blank\" rel=\"noopener noreferrer\">unique<\/a>\u00a0and\u00a0<a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/reference\/api\/pandas.Series.value_counts.html\" target=\"_blank\" rel=\"noopener noreferrer\">value_counts<\/a>\u00a0functions are useful to decide which features to straight-up remove and which to graph and research further.<\/em><\/p>\n<p>\u00a0<\/p>\n<h3>Chapter 3 &#8211; Visualisation<\/h3>\n<p>\u00a0<\/p>\n<p>Visualisation sounds simple, and it is, but it&#8217;s hard to&#8230;\u00a0<em>not overcomplicate<\/em>. It&#8217;s far too easy for us to consider plots as a chore to create. Yet, these bad boys do one thing very, very well &#8211; intuitively demonstrate the inner workings of our data! Just remember:<\/p>\n<blockquote>\n<p><em>We graph data to find and explain how everything works.<\/em><\/p>\n<\/blockquote>\n<p>Hence, when stuck for ideas, or not quite sure what to do, we basically can always fall back on\u00a0<strong>identifying useful patterns and meaningful relationships<\/strong>. It may seem iffy, but it is really useful.<\/p>\n<blockquote>\n<p><em>Our goal isn&#8217;t to draw fancy hexagon plots, but instead to picture what is going on, so\u00a0<\/em><em>absolutely anyone<\/em><em>\u00a0can simply interpret a complex system!<\/em><\/p>\n<\/blockquote>\n<p>A few techniques are undeniably useful:<\/p>\n<ul>\n<li>Resampling when we\u00a0<em>have too much data<\/em>\n<\/li>\n<li>Secondary axis when plots have different scales<\/li>\n<li>Grouping when our data can be split categorically<\/li>\n<\/ul>\n<p>To get started graphing, simply use Pandas\u00a0<strong>.plot()<\/strong>\u00a0on any series or dataframe! When we need more, we can delve into MatPlotLib, Seaborn, or an interactive plotting library.<\/p>\n<div>\n<pre>data.plot(x=\"column 1 name\", y=\"column 2 name\", kind=\"bar\", figsize=(10, 10))\r\ndata.plot(x=\"column 1 name\", y=\"column 3 name\", secondary_y=True)\r\ndata.hist()\r\ndata.groupby(\"group\").boxplot()\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>For 90% of the time, this basic functionality will suffice (<a href=\"https:\/\/pandas.pydata.org\/pandas-docs\/stable\/user_guide\/visualization.html#plot-formatting\" target=\"_blank\" rel=\"noopener noreferrer\">more info here<\/a>), and where it doesn&#8217;t, a search should reveal how to\u00a0<em>draw particularly exotic graphs<\/em>.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/cdn.hashnode.com\/res\/hashnode\/image\/upload\/v1591932777077\/oKrkIQYS9.jpeg?auto=format&amp;q=60\" width=\"90%\"><\/p>\n<p>\u00a0<\/p>\n<h3>Chapter 4 &#8211; Modelling<\/h3>\n<p>\u00a0<\/p>\n<p><strong>A Brief Overview<\/strong><\/p>\n<p>Now finally, for the fun stuff &#8211; deriving results. It seems\u00a0<em>so simple to train a scikit-learn model, but no one goes into the details<\/em>! So, let&#8217;s be honest here, not every dataset, nor model are equal.<\/p>\n<p>Our approach to modelling will vary widely based on our data. There are three especially important factors:<\/p>\n<ul>\n<li>\n<strong>Type<\/strong>of problem<\/li>\n<li>\n<strong>Amount<\/strong>of data<\/li>\n<li>\n<strong>Complexity<\/strong>of data<\/li>\n<\/ul>\n<p>Our type of problem comes down to whether we are trying to predict a class\/label (called\u00a0<em>classification<\/em>), a value (called\u00a0<em>regression<\/em>), or to group data (called\u00a0<em>clustering<\/em>). If we are trying to train a model on a dataset where we already have examples of what we&#8217;re trying to predict, then we call our model\u00a0<em>supervised<\/em>, if not,\u00a0<em>unsupervised<\/em>. The amount of available data and how complex it is foreshadows how simple a model will suffice.\u00a0<em>Data with more features (i.e., columns) tend to be more complex<\/em>.<\/p>\n<blockquote>\n<p><em>The point of interpreting complexity is to understand which models are\u00a0<\/em><em>too good or too bad for our data.<\/em><\/p>\n<\/blockquote>\n<p>Models\u00a0<em>goodness of fit<\/em>\u00a0informs us of this! If a model struggles to interpret our data (too simple), we can say it\u00a0<em>underfits<\/em>, and if it is completely overkill (too complex), we say it\u00a0<em>overfits<\/em>. We can think of it as a spectrum from learning nothing to memorising everything. We need to\u00a0<em>strike a balance<\/em>, to ensure our model is\u00a0<strong>able to\u00a0<\/strong><strong><em>generalise<\/em><\/strong><strong>\u00a0our conclusions<\/strong>\u00a0to new information. This is typically known as the bias-variance tradeoff.\u00a0<em>Note that complexity also affects model interpretability.<\/em><\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/cdn.hashnode.com\/res\/hashnode\/image\/upload\/v1591931791416\/qtb6eievP.png?auto=format&amp;q=60\" width=\"90%\"><\/p>\n<p><strong>Complex models take substantially more time to train<\/strong>, especially with large datasets. So, upgrade that computer, run the model overnight, and chill for a while!<\/p>\n<p><strong>Preparation<\/strong><\/p>\n<p><em><strong>Splitting up data<\/strong><\/em><\/p>\n<p>Before training a model, it is important to note that we will need some dataset to test it on (so we know how well it performs). Hence, we often divide our dataset into\u00a0<strong>separate training and testing sets<\/strong>. This allows us to test\u00a0<em>how well our model can generalise to new unseen data<\/em>. This normally works because we know our data is decently representative of the real world.<\/p>\n<p>The actual amount of test data doesn&#8217;t matter too much, but 80% train and 20% test is often used.<\/p>\n<p>In Python with Scikit learn the\u00a0<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.train_test_split.html\" target=\"_blank\" rel=\"noopener noreferrer\">train_test_split<\/a>\u00a0function does this:<\/p>\n<div>\n<pre>train_data, test_data = train_test_split(data)\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>train_data, test_data = train_test_split(data)<\/p>\n<p>Cross-validation is where a dataset is split into several folds (i.e., subsets or portions of the original dataset). This tends to be more robust and\u00a0<em>resistant to overfitting<\/em>\u00a0than using a single test\/validation set! Several scikit-learn functions help with\u00a0<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/cross_validation.html\" target=\"_blank\" rel=\"noopener noreferrer\">cross-validation<\/a>. However, it&#8217;s normally done straight through a grid or random search (discussed below).<\/p>\n<div>\n<pre>cross_val_score(model, input_data, output_data, cv=5)\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>cross_val_score(model, input_data, output_data, cv=5)<\/p>\n<p><em><strong>Hyperparameter tuning<\/strong><\/em><\/p>\n<p>There are some factors our model cannot account for, and so we\u00a0<em>set certain hyperparameters<\/em>. These vary model to model, but we can either find optimal values through manual trial and error or a simple algorithm like grid or random search. With grid search, we try all possible values (brute force) and with random search random values from within some distribution\/selection. Both approaches typically use cross-validation.<\/p>\n<p><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.GridSearchCV.html\" target=\"_blank\" rel=\"noopener noreferrer\">Grid search<\/a>\u00a0in scikit-learn works through a\u00a0<strong>parameters<\/strong>\u00a0dictionary. Each entry key represents the hyperparameter to tune, and the value (a list or tuple) is the selection of values to choose from:<\/p>\n<div>\n<pre>parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}\r\nmodel = = SVC()\r\ngrid = GridSearchCV(model, param_grid=parameters)\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>After we&#8217;ve created the grid, we can use it to train the models, and extract the scores:<\/p>\n<div>\n<pre>grid.fit(train_input, train_output)\r\nbest_score, best_depth = grid.best_score_, grid.best_params_\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>The important thing here is to remember that we need to\u00a0<strong>train on the training and not testing data<\/strong>. Even though cross-validation is used to test the models, we&#8217;re ultimately trying to get the best fit on the training data and will proceed to test each model on the testing set afterward:<\/p>\n<div>\n<pre>test_predictions = grid.predict(test_input)\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.model_selection.GridSearchCV.html\" target=\"_blank\" rel=\"noopener noreferrer\">Random search<\/a>\u00a0in scikit-learn works similarly but is slightly more complex as we need to know what type of distribution each hyperparameter takes in. Although it, in theory,\u00a0<em>can yield the same or better results faster<\/em>, that changes from situation to situation.\u00a0<em>For simplicity, it is likely best to stick to a grid search.<\/em><\/p>\n<p><strong>Model Choices<\/strong><\/p>\n<p><em><strong>Using a model<\/strong><\/em><\/p>\n<p>With scikit-learn, it&#8217;s as simple as finding our desired model name and then just creating a variable for it. Check the links to the documentation for further details! For example,<\/p>\n<div>\n<pre>support_vector_regressor = SVR()\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p><em><strong>Basic Choices<\/strong><\/em><\/p>\n<ul>\n<li>Linear\/Logistic Regression<\/li>\n<\/ul>\n<p><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/classes.html#module-sklearn.linear_model\" target=\"_blank\" rel=\"noopener noreferrer\">Linear regression<\/a>\u00a0is trying to\u00a0<em>fit a straight line<\/em>\u00a0to our data. It is the most basic and fundamental model. There are several variants of linear regression, like lasso and ridge regression (which are regularisation methods to prevent overfitting). Polynomial regression can be used to fit curves of higher degrees (like parabolas and other curves). Logistic regression is another variant that can be used for classification.<\/p>\n<p>Just like with linear\/logistic regression,\u00a0<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/svm.html\" target=\"_blank\" rel=\"noopener noreferrer\">support vector machines (SVMs)<\/a>\u00a0try to fit a line or curve to data points. However, with SVM the aim is to maximise the distance between a boundary and each point (instead of getting the line\/curve to go through each point).<\/p>\n<p>The main advantage of support vector machines is their ability to\u00a0<em>use different kernels<\/em>. A kernel is a function that calculates the similarity. These kernels allow for both linear and non-linear data while staying decently efficient. The kernels map the input into a higher-dimensional space, so a boundary becomes present. This process is typically not feasible for large numbers of features. A neural network or another model will then likely be a better choice!<\/p>\n<p>All the buzz is always about deep learning and\u00a0<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/neural_networks_supervised.html\" target=\"_blank\" rel=\"noopener noreferrer\">neural networks<\/a>. They are complex, slow, and resource-intensive models that can be used for complex data. Yet, they are extremely useful when encountering large unstructured datasets.<\/p>\n<p>When using a neural net, make sure to watch out for overfitting. An easy way is through tracking changes in error with time (known as learning curves).<\/p>\n<p>Deep learning is an extremely rich field, so there is far too much to discuss here. In fact, scikit-learn is a machine learning library, with little deep learning abilities (compared to\u00a0<a href=\"https:\/\/www.kamwithk.com\/PyTorch.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">PyTorch<\/a>\u00a0or\u00a0<a href=\"https:\/\/www.tensorflow.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">TensorFlow<\/a>).<\/p>\n<p><a href=\"https:\/\/scikit-learn.org\/stable\/modules\/tree.html\" target=\"_blank\" rel=\"noopener noreferrer\">Decision trees<\/a>\u00a0are simple and quick ways to model relationships. They are basically a\u00a0<em>tree of decisions<\/em>\u00a0that help decide on what class or label a datapoint belongs too. Decision trees can be used for regression problems too. Although simple, to avoid overfitting, several hyperparameters must be chosen. These all, in general, relate to how deep the tree is and how many decisions are to be made.<\/p>\n<p>We can group unlabeled data into several\u00a0<em>clusters<\/em>\u00a0using\u00a0<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/clustering.html#k-means\" target=\"_blank\" rel=\"noopener noreferrer\">k-means<\/a>. Normally the number of clusters present is a chosen hyperparameter.<\/p>\n<p>K-means works by trying to optimize (reduce) some criterion (i.e., function) called inertia. It can be thought of as trying to minimize the distance from a set of\u00a0<em>centroids<\/em>\u00a0to each data point.<\/p>\n<p><strong>Ensembles<\/strong><\/p>\n<p>Random forests are combinations of multiple decision trees trained on random subsets of the data (bootstrapping). This process is called bagging and allows random forests to obtain a good fit (low bias and low variance) with complex data.<\/p>\n<p>The rationale behind this can be likened to democracy.<\/p>\n<blockquote>\n<p><em>One voter may vote for a bad candidate, but we&#8217;d hope that the majority of voters make informed, positive decisions.<\/em><\/p>\n<\/blockquote>\n<p>For\u00a0<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.RandomForestRegressor.html\" target=\"_blank\" rel=\"noopener noreferrer\">regression<\/a>\u00a0problems, we average each decision tree&#8217;s outputs, and for\u00a0<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.RandomForestClassifier.html\" target=\"_blank\" rel=\"noopener noreferrer\">classification<\/a>, we choose the most popular one. This\u00a0<em>might not always work, but we generally assume it will<\/em>\u00a0(especially with large datasets with multiple columns).<\/p>\n<p>Another advantage with random forests is that insignificant features shouldn&#8217;t negatively impact performance because of the democratic-like bootstrapping process!<\/p>\n<p>Hyperparameter choices are the same as those for decision trees but with the number of decision trees as well. For the reasons above, more trees equal less overfitting!<\/p>\n<p><em>Note that random forests use random subsets with the replacement of rows and columns!<\/em><\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/cdn.hashnode.com\/res\/hashnode\/image\/upload\/v1591933623422\/ih9FpjpMj.jpeg?auto=format&amp;q=60\" width=\"90%\"><\/p>\n<p><strong>Gradient Boosting<\/strong><\/p>\n<p>Ensemble models like AdaBoost or\u00a0<a href=\"https:\/\/xgboost.readthedocs.io\/\" target=\"_blank\" rel=\"noopener noreferrer\">XGBoost<\/a>\u00a0work by stacking one model on top of another. The assumption here is that each successive weak learner will correct for the flaws of the previous one (hence called boosting). Hence, the combination of models should provide the advantages of each model without its potential pitfalls.<\/p>\n<p>The iterative approach means previous models\u2019 performances effects current models, and better models are given a higher priority. Boosted models perform slightly better than bagging models (a.k.a. random forests), but are also slightly more likely to overfit. The scikit-learn library provides AdaBoost for\u00a0<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.AdaBoostClassifier.html\" target=\"_blank\" rel=\"noopener noreferrer\">classification<\/a>\u00a0and\u00a0<a href=\"https:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.ensemble.AdaBoostRegressor.html\" target=\"_blank\" rel=\"noopener noreferrer\">regression<\/a>.<\/p>\n<p>\u00a0<\/p>\n<h3>Chapter 5 &#8211; Production<\/h3>\n<p>\u00a0<\/p>\n<p>This is the last but potentially most important part of the process. We&#8217;ve put in all this work, and so we need to go the distance and\u00a0<strong>create something impressive<\/strong>!<\/p>\n<p>There are a variety of options.\u00a0<a href=\"https:\/\/www.streamlit.io\/\" target=\"_blank\" rel=\"noopener noreferrer\">Streamlit<\/a>\u00a0is an exciting option for data-oriented websites, and tools like Kotlin, Swift, and Dart can be used for Android\/IOS development. JavaScript with frameworks like VueJS can also be used for extra flexibility.<\/p>\n<p><em>After trying most of these, I honestly would recommend sticking to\u00a0Streamlit, because it is so much easier than the others!<\/em><\/p>\n<p>Here it is important to start with a vision (simpler the better) and try to find out which parts are most important. Then try and specifically work on those. Continue until completion! For websites, a hosting service like\u00a0<a href=\"https:\/\/www.heroku.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Heroku<\/a>\u00a0will be needed, so the rest of the world can see the amazing end-product of all our hard work.<\/p>\n<p>Even if none of the above options above suit the scenario, a report or article covering what we&#8217;ve done, what we&#8217;ve learned, and any suggestions or lessons learned along with a well documented GitHub repository are indispensable!\u00a0<em>Make sure that readme file is up to date.<\/em><\/p>\n<p><a href=\"https:\/\/www.kamwithk.com\/machine-learning-field-guide-ckbbqt0iv025u5ks1a7kgjckx\">Original<\/a>. Reposted with permission.<\/p>\n<p>\u00a0<\/p>\n<p><b>Related:<\/b><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2020\/08\/machine-learning-field-guide.html<\/p>\n","protected":false},"author":0,"featured_media":65,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/63"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=63"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/63\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/65"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=63"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=63"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=63"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}