nunosempere.github.io/maths-prog/MachineLearningDemystified
2019-10-09 21:07:30 +02:00
..
AlgorithmsClassification.py Create AlgorithmsClassification.py 2019-10-09 20:41:24 +02:00
AlgorithmsRegression,py Create AlgorithmsRegression,py 2019-10-09 20:42:03 +02:00
CleaningUpData.py Create CleaningUpData.py 2019-10-09 20:40:56 +02:00
readme.md Update readme.md 2019-10-09 21:07:30 +02:00

Machine Learning Demystified

Several friends encouraged me to apply to a Data Scientist position at ID Insights, an organization I greatly admire, and for a position which I would be passionate about. Unfortunately, they require Python, and I'm most proficient with R. I decided to apply anyways, but before, I familiarized myself throrougly with numpy, pandas and sklearn, three of the most important libraries for machine learning in Python.

I used a dataset from Kaggle: Health Care Cost Analysis, referenced as "insurance.csv" thoughout the code. The reader will also have to change the variable "directory" to fit their needs.

Otherwise, the current files in this directory are:

  • CleaningUpData.py. I couldn't work with the dataset directly, so I tweaked it somewhat.
  • AlgorithmsClassification.py. As a first exercise, I try to predict whether the medical bills of a particular individual are higher than the mean of the dataset. Some algorithms, like Naïve Bayes, are not really suitable for regression, but are great for predicting classes.
  • AlgorithmsRegression,py. I try to predict the healthcare costs of a particular individual, using all the features in the dataset.

Thoughts on sklearn

The exercise proved highly, highly instructive, because sklearn is really easy to use, and the documentation is also extremely nice. The following captures my current state of mind: