From e64d7ab5596300ad28e7c39f7838d94ab96ea6a9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Nu=C3=B1o=20Sempere?= Date: Sat, 12 Oct 2019 19:34:51 +0200 Subject: [PATCH] Update readme.md --- maths-prog/MachineLearningDemystified/readme.md | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/maths-prog/MachineLearningDemystified/readme.md b/maths-prog/MachineLearningDemystified/readme.md index b2bf061..827dce3 100644 --- a/maths-prog/MachineLearningDemystified/readme.md +++ b/maths-prog/MachineLearningDemystified/readme.md @@ -11,7 +11,7 @@ Otherwise, the current files in this directory are: - Algorithms: Naïve Bayes (Bernoulli & Gaussian), Nearest Neighbours, Support Vector Machines, Decision Trees, Random Forests (and Extrarandom forests), and multilayer perceptron (simple NN). - [AlgorithmsRegression.py](https://github.com/NunoSempere/nunosempere.github.io/blob/master/maths-prog/MachineLearningDemystified/AlgorithmsRegression.py). I try to predict the healthcare costs of a particular individual, using all the features in the dataset. - Algorithms: Linear Regression, Lasso, Nearest Neighbours Regression, LinearSVR, SVR with different kernels, Tree regression, Random forest regression (and extra-random forest regression), and multilayer perceptron regression (simple NN). -- [Clustering.py](https://github.com/NunoSempere/nunosempere.github.io/blob/master/maths-prog/MachineLearningDemystified/Clustering.py). I then studied some of the most common clustering algorithms. The area seems almost pre-Aristotelian. Clustering algorithms get the task to *[send a message to Garcia](https://courses.csail.mit.edu/6.803/pdf/hubbard1899.pdf)*, and they undertake the task, no questions asked. +- [Clustering.py](https://github.com/NunoSempere/nunosempere.github.io/blob/master/maths-prog/MachineLearningDemystified/Clustering.py). I then studied some of the most common clustering algorithms. The area seems almost pre-Aristotelian. Clustering algorithms get the task to *[send a message to Garcia](https://courses.csail.mit.edu/6.803/pdf/hubbard1899.pdf)*, and they undertake the task, no questions asked. Heroically. I also take the opportunity here to create some visualizations, with the seaborn library. - Algorithms: KMeans, Affinity Propagation, Mean Shift, Spectral Clustering, Agglomerative Clustering, DBSCAN, Birch, Gaussian Mixture. ## Thoughts on sklearn @@ -23,3 +23,17 @@ The exercise proved highly, highly instructive, because sklearn is really easy t It came as a surprise to me that understanding and implementing the algorithm were two completely different steps. ## Some visualizations and findings about the dataset. + +- Those who have 4+ children get charged less by insurance, and smoke less. +![](children-charge-smoking.png) + +- The disgreggation by age seems interesting, because there are three prongs, roughly: 1) normal people who don't smoke, 2) those who get charged more: made out of those who don't smoke, and 3) those who get charged a lot, which only comprises smokers. The Gaussian Mixture & K-Means algorithms do better than most others at discriminating between these threee groups, and made me realize the difference. + +![](GaussianMixture-age.png) +![](GaussianMixture-smoker_numeric.png) + +![](age_charge_smoking.png) +![](AgglomerativeClustering-age.png) + +- BMI is interesting, because there seems to be a line at BMI = 30, almost as if someone used that to make decisions about how much to charge, or what to diagnose. Normally, we'd expect something more continuous. +![](AgglomerativeClustering-age.png)