{"id":1851,"date":"2020-09-23T14:50:01","date_gmt":"2020-09-23T14:50:01","guid":{"rendered":"https:\/\/data-science.gotoauthority.com\/2020\/09\/23\/how-i-consistently-improve-my-machine-learning-models-from-80-to-over-90-accuracy\/"},"modified":"2020-09-23T14:50:01","modified_gmt":"2020-09-23T14:50:01","slug":"how-i-consistently-improve-my-machine-learning-models-from-80-to-over-90-accuracy","status":"publish","type":"post","link":"https:\/\/wealthrevelation.com\/data-science\/2020\/09\/23\/how-i-consistently-improve-my-machine-learning-models-from-80-to-over-90-accuracy\/","title":{"rendered":"How I Consistently Improve My Machine Learning Models From 80% to Over 90% Accuracy"},"content":{"rendered":"<div id=\"post-\">\n<p><b>By <a href=\"https:\/\/towardsdatascience.com\/@terenceshin\" target=\"_blank\" rel=\"noopener noreferrer\">Terence Shin<\/a>, Data Scientist | MSc Analytics &amp; MBA student<\/b>.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/0*uXPBOzIEcIJX8MdI\" width=\"90%\"><\/p>\n<p><em>Photo by\u00a0<a class=\"ci jh kt ku kv kw\" href=\"https:\/\/unsplash.com\/@jrarce?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">Ricardo Arce<\/a>\u00a0on\u00a0<a class=\"ci jh kt ku kv kw\" href=\"https:\/\/unsplash.com\/s\/photos\/target?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText\" target=\"_blank\" rel=\"noopener nofollow noreferrer\">Unsplash<\/a>.<\/em><\/p>\n<p>If you\u2019ve completed a few data science projects of your own, then you probably realized by now that achieving an accuracy of 80% isn\u2019t too bad! But in the real world, 80% won\u2019t cut it. In fact, most companies that I\u2019ve worked for expect a minimum accuracy (or whatever metric they\u2019re looking at) of at least 90%.<\/p>\n<p>Therefore, I\u2019m going to talk about 5 things that you can do to significantly improve your accuracy.\u00a0<strong>I highly recommend that you read all five points thoroughly<\/strong>\u00a0because there are a lot of details that I\u2019ve included that most beginners don\u2019t know.<\/p>\n<p>By the end of this, you should understand that there are many more variables than you think that play a role in dictating how well your machine learning model performs.<\/p>\n<p>With that said, here are 5 things that you can do to improve your machine learning models!<\/p>\n<p>\u00a0<\/p>\n<h3>1. Handling Missing Values<\/h3>\n<p>\u00a0<\/p>\n<p>One of the biggest mistakes I see is how people handle missing values, and it\u2019s not necessarily their fault. A lot of material on the web says that you typically handle missing values through\u00a0<strong>mean imputation,\u00a0<\/strong>replacing null values with the mean of the given feature, and this usually isn\u2019t the best method.<\/p>\n<p>For example, imagine we have a table showing age and fitness score and imagine that an eighty-year-old has a missing fitness score. If we took the average fitness score from an age range of 15 to 80, then the eighty-year-old will appear to have a much higher fitness score that they actually should.<\/p>\n<p>Therefore, the first question you want to ask yourself is\u00a0<strong>why<\/strong>\u00a0the data is missing to begin with.<\/p>\n<p>Next, consider other methods in handling missing data, aside from mean\/median imputation:<\/p>\n<ul>\n<li>\n<strong>Feature Prediction Modeling<\/strong>: Referring back to my example regarding age and fitness scores, we can model the relationship between age and fitness scores and then use the model to find the expected fitness score for a given age. This can be done via several techniques, including regression, ANOVA, and more.<\/li>\n<li>\n<strong>K Nearest Neighbour Imputation<\/strong>: Using KNN imputation, the missing data is filled with a value from another similar sample, and for those who don\u2019t know, the similarity in KNN is determined using a distance function (i.e., Euclidean distance).<\/li>\n<li>\n<strong>Deleting the row<\/strong>: Lastly, you can delete the row. This is not usually recommended, but it is acceptable when you have an\u00a0<strong>immense<\/strong>amount of data to start with.<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<h3>2. Feature Engineering<\/h3>\n<p>\u00a0<\/p>\n<p>The second way you can significantly improve your machine learning model is through feature engineering. Feature engineering is the process of transforming raw data into features that better represent the underlying problem that one is trying to solve. There\u2019s no specific way to go about this step, which is what makes data science as much of an art as it as a science. That being said, here are some things that you can consider:<\/p>\n<ul>\n<li>Converting a DateTime variable to extract just the day of the week, the month of the year, etc\u2026<\/li>\n<li>Creating bins or buckets for a variable. (e.g., for a height variable, can have 100\u2013149 cm, 150\u2013199 cm, 200\u2013249 cm, etc.)<\/li>\n<li>Combining multiple features and\/or values to create a new one. For example, one of the most accurate models for the titanic challenge engineered a new variable called \u201cIs_women_or_child\u201d which was True if the person was a woman or a child and false otherwise.<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<h3>3. Feature Selection<\/h3>\n<p>\u00a0<\/p>\n<p>The third area where you can vastly improve the accuracy of your model is feature selection, which is choosing the most relevant\/valuable features of your dataset. Too many features can cause your algorithm to overfit, and too little features can cause your algorithm to underfit.<\/p>\n<p>There are two main methods that I like to use that you can use to help you with selecting your features:<\/p>\n<ul>\n<li>\n<strong>Feature importance<\/strong>: some algorithms, like random forests or XGBoost, allow you to determine which features were the most \u201cimportant\u201d in predicting the target variable\u2019s value. By quickly creating one of these models and conducting feature importance, you\u2019ll get an understanding of which variables are more useful than others.<\/li>\n<li>\n<strong>Dimensionality reduction<\/strong>: One of the most common dimensionality reduction techniques, Principal Component Analysis (PCA) takes a large number of features and uses linear algebra to reduce them to fewer features.<\/li>\n<\/ul>\n<p>\u00a0<\/p>\n<h3>4. Ensemble Learning Algorithms<\/h3>\n<p>\u00a0<\/p>\n<p>One of the easiest ways to improve your machine learning model is to simply choose a better machine learning algorithm. If you don\u2019t already know what ensemble learning algorithms are, now is the time to learn it!<\/p>\n<p><strong>Ensemble learning<\/strong>\u00a0is a method where multiple learning algorithms are used in conjunction. The purpose of doing so is that it allows you to achieve higher predictive performance than if you were to use an individual algorithm by itself.<\/p>\n<p>Popular ensemble learning algorithms include random forests, XGBoost, gradient boost, and AdaBoost. To explain why ensemble learning algorithms are so powerful, I\u2019ll give an example with random forests:<\/p>\n<p>Random forests involve creating multiple decision trees using bootstrapped datasets of the original data. The model then selects the mode (the majority) of all of the predictions of each decision tree. What\u2019s the point of this? By relying on a \u201cmajority wins\u201d model, it reduces the risk of error from an individual tree.<\/p>\n<p><img class=\"aligncenter size-large\" src=\"https:\/\/miro.medium.com\/max\/875\/0*6PTjwrvixQU1Hyp_.png\" width=\"90%\"><\/p>\n<p>For example, if we created one decision tree, the third one, it would predict 0. But if we relied on the mode of all 4 decision trees, the predicted value would be 1. This is the power of ensemble learning!<\/p>\n<p>\u00a0<\/p>\n<h3>5. Adjusting Hyperparameters<\/h3>\n<p>\u00a0<\/p>\n<p>Lastly, something that is not often talked about, but is still very important, is adjusting the hyperparameters of your model. This is where it\u2019s essential that you clearly understand the ML model that you\u2019re working with. Otherwise, it can be difficult to understand each hyperparameter.<\/p>\n<p>Take a look at all of the hyperparameters for Random Forests:<\/p>\n<div>\n<pre>class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, \r\nmin_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', \r\nmax_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, \r\noob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, \r\nclass_weight=None, ccp_alpha=0.0, max_samples=None\r\n\r\n<\/pre>\n<\/div>\n<p>\u00a0<\/p>\n<p>For example, it would probably be a good idea to understand what min_impurity_decrease is, so that when you want your machine learning model to be more forgiving, you can adjust this parameter! \ud83d\ude09<\/p>\n<p><a href=\"https:\/\/towardsdatascience.com\/how-i-consistently-improve-my-machine-learning-models-from-80-to-over-90-accuracy-6097063e1c9a\" target=\"_blank\" rel=\"noopener noreferrer\">Original<\/a>. Reposted with permission.<\/p>\n<p>\u00a0<\/p>\n<p><b>Related:<\/b><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>https:\/\/www.kdnuggets.com\/2020\/09\/improve-machine-learning-models-accuracy.html<\/p>\n","protected":false},"author":0,"featured_media":1852,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/1851"}],"collection":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"replies":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/comments?post=1851"}],"version-history":[{"count":0,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/posts\/1851\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media\/1852"}],"wp:attachment":[{"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/media?parent=1851"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/categories?post=1851"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/wealthrevelation.com\/data-science\/wp-json\/wp\/v2\/tags?post=1851"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}