nunosempere.github.io/maths-prog/MachineLearningDemystified/readme.md

26 lines
3.1 KiB
Markdown
Raw Normal View History

2019-10-09 18:53:29 +00:00
# Machine Learning Demystified
2019-10-12 17:12:27 +00:00
Several friends encouraged me to apply to a Data Scientist position at ID Insights, an organization I greatly admire, and for a position which I would be passionate about. I decided to apply. Beforehand, I familiarized myself throrougly with numpy, pandas and sklearn, three of the most important libraries for machine learning in Python.
2019-10-09 18:53:29 +00:00
2019-10-09 18:59:41 +00:00
I used a dataset from Kaggle: [Health Care Cost Analysis](https://www.kaggle.com/flagma/health-care-cost-analysys-prediction-python/data), referenced as "insurance.csv" thoughout the code. The reader will also have to change the variable "directory" to fit their needs.
2019-10-09 18:53:29 +00:00
Otherwise, the current files in this directory are:
2019-10-09 18:54:28 +00:00
- [CleaningUpData.py](https://github.com/NunoSempere/nunosempere.github.io/blob/master/maths-prog/MachineLearningDemystified/CleaningUpData.py). I couldn't work with the dataset directly, so I tweaked it somewhat.
2019-10-09 19:08:04 +00:00
- [AlgorithmsClassification.py](https://github.com/NunoSempere/nunosempere.github.io/blob/master/maths-prog/MachineLearningDemystified/AlgorithmsClassification.py). As a first exercise, I try to predict whether the medical bills of a particular individual are higher than the mean of the dataset. Some algorithms, like Naïve Bayes, are not really suitable for regression, but are great for predicting classes. After the first couple of examples, I wrapp everything in a function.
2019-10-12 17:12:27 +00:00
- Algorithms: Naïve Bayes (Bernoulli & Gaussian), Nearest Neighbours, Support Vector Machines, Decision Trees, Random Forests (and Extrarandom forests), and multilayer perceptron (simple NN).
- [AlgorithmsRegression.py](https://github.com/NunoSempere/nunosempere.github.io/blob/master/maths-prog/MachineLearningDemystified/AlgorithmsRegression.py). I try to predict the healthcare costs of a particular individual, using all the features in the dataset.
- Algorithms: Linear Regression, Lasso, Nearest Neighbours Regression, LinearSVR, SVR with different kernels, Tree regression, Random forest regression (and extra-random forest regression), and multilayer perceptron regression (simple NN).
- [Clustering.py](https://github.com/NunoSempere/nunosempere.github.io/blob/master/maths-prog/MachineLearningDemystified/Clustering.py). I then studied some of the most common clustering algorithms. The area seems almost pre-Aristotelian. Clustering algorithms get the task to *[send a message to Garcia](https://courses.csail.mit.edu/6.803/pdf/hubbard1899.pdf)*, and they undertake the task, no questions asked.
- Algorithms: KMeans, Affinity Propagation, Mean Shift, Spectral Clustering, Agglomerative Clustering, DBSCAN, Birch, Gaussian Mixture.
2019-10-09 18:53:29 +00:00
2019-10-09 18:58:29 +00:00
## Thoughts on sklearn
2019-10-09 19:07:30 +00:00
The exercise proved highly, highly instructive, because sklearn is really easy to use, and the [documentation](https://scikit-learn.org/stable/) is also extremely nice. The following captures my current state of mind:
2019-10-09 19:05:30 +00:00
![](https://data36.com/wp-content/uploads/2018/06/machineLearning.png)
2019-10-12 17:12:27 +00:00
It came as a surprise to me that understanding and implementing the algorithm were two completely different steps.
## Some visualizations and findings about the dataset.