Setting up a Local Environment for Machine Learning

Before we can get started doing Machine Learning stuff, we need to ensure we have the right environment setup. Here’s a list of things you can do to get started coding ML:

  • Install Git
  • Install Python / Anaconda
  • Get comfortable with the Command Line

Install Git

Git is a version control system that allows tracking and managing changes on a project. Git isn’t the only VCS in existence, there are others like Mercurial, Subversion, CVS; but git happens to be the most popular. Frankly, I wouldn’t advise you to code a fully-fledged project without some form of version control. You can learn more about git (including installation steps) in this article Git Fundamentals.

Installing Anaconda / Python

Python is lowkey one of the coolest programming languages in the world. It is pretty simple and easy to learn and at the same time powerful enough to accomplish almost anything. Python is the programming language of choice for many projects involving Data Science and Machine Learning. However, it is not the only language with which one can code ML. C++, Javascript, Java, R, Julia, Scala, and many other languages have been successfully applied in machine learning applications.  Python has a large community with a number of libraries that implement some of the regular stuff ML engineers need to go about their work. 
You can follow this link for instructions on installing Python on your local environment:  https://cloud.google.com/python/setup

If you want to do any serious ML coding, I would advise installing Anaconda. It is a tool that greatly eases the management of packages and dependencies. In fact, Anaconda comes bundled with Python and all the basic packages you need for data science and ML development. To install anaconda follow instructions in this link

Once you complete installation, you can verify the list of installed packages by entering the following command:

$ conda list

At the very least, you should have the following packages installed:

  • NumPy: a package that allows fast processing of arrays. As opposed to using for loops to execute operations on each element in an array, NumPy introduces universal functions for performing vectorized operations. These operations are different from regular for loops executed by the Python interpreter. Instead, they rely on executing pre-compiled C routines on the array elements to deliver much greater speed.
  • Pandas: a very powerful package for data analysis and extraction of information from datasets. A number of operations available in Pandas are similar to what we already do with SQL databases. Pandas exposes methods for manipulating and querying datasets.
  • Matplotlib: useful for data visualization and plotting of graphs.
  • Scikit-Learn: ships with already implemented machine learning algorithms and helpful classes to ease machine learning development.
  • Jupyter: an interactive web-based development environment for coding Jupyter notebooks. Jupyter notebooks can be shared easily and also support interactive, live coding.
  • iPython: an interactive shell (REPL console) where you can run snippets of python code and get quick feedback. Jupyter is built on iPython. In fact, iPython provides a Python kernel for Jupyter.

Without anaconda, you will have to manually install each of these packages.

Get comfortable with the Command Line

The command line is where you get to run commands and do all kinds of cool things. A developer should really know their way around the command line. So if you don’t already have a grasp of basic Linux or Windows commands, there are tons of resources online. You can learn as you go.

To fire up your jupyter environment, on your command line navigate to the folder of choice and enter command:

$ jupyter notebook

This will start up a server for Jupyter notebooks in that folder and then you can edit your notebooks from your browser.
You may decide to code your Jupyter notebooks in the jupyter environment or you may decide to use a particularly helpful VSCode extension I recently discovered which allows you to code notebooks in VSCode. Whatever works for you!

Other Resources to get you started with ML

Kaggle is the largest community of data scientists and ML enthusiasts. More like a social network for data scientists. You could sign up on the platform and be an active member; all kinds of people doing interesting things on there. 

Jake VanderPlas’ Python Data Science Handbook is another great resource. I used it when I was starting out in ML. It helped with learning to work with packages like NumPy, Pandas, and Matplotlib. 

Another great book I will recommend is Ethem Alpaydin’s Introduction to Machine Learning. A friend of mine recommended the book to me and I found it a great resource for understanding the mathematics behind a number of machine learning techniques.

And then there’s Python Machine Learning by Sebastian Rashka & Vahid Mirjalili. Useful for beginners with a strong math background and gives a fine mix of theory and application.

There are loads of courses on tutorial sites that can also help you get started. I found two courses on Udemy that are great for beginners, especially if you prefer video tutorials to ebooks:
Machine Learning A-Z: Hands On by Kirill Eremenco
Deep Learning A-Z: Hands On by Kirill Eremenco

Machine Learning is really wide and deep. It can feel intimidating sometimes but don’t you fret. Take your time to understand the basics and build a strong foundation. So as you progress, you won’t struggle with the more advanced concepts.