Data Science for Developers: Part 2 - Fundamentals - SixSidedDice.com

Jun 27, 2024 DataScience ML AI DeveloperGuide

Fundamentals of Data Science for Developers

In part 1 of this series we covered the foundational skills a developer needs to get started with, understand, and apply data science, machine learning, and AI to their everyday work.

In part 2 of the guide we will explore the language of data science and outline the various types of machine learning. The goal of this part is to enable developers to effectively communicate with data scientists and understand the core concepts behind machine learning models.

Models

Data scientists create machine learning models. But what exactly is a model?

As a developer, I find it useful to think of a model as an algorithm. An algorithm has input parameters, performs a set of operations on those input parameters and returns one or more output values. Machine learning and traditional programming represent two fundamentally different approaches to solving problems by creating algorithms.

In traditional programming, a developer explicitly writes code to instruct a computer on how to perform a specific task. The program consists of a set of rules and logic defined by the developer and it operates strictly according to these predefined rules so that the output is deterministic. Because of the deterministic nature of the programmed algorithms, the developer can write tests to assert the behaviour of the program. Techniques such as Test Driven Development can be used to ensure the algorithm conforms to the expected behaviour whilst promoting good design and maintainability.

Models as Algorithms

In machine learning, the data scientist create an algorithm (or model) that learns the relationship between the input and output during the training process from the provided data. The model identifies patterns and relationships within the data to make predictions or decisions without explicit instructions from a developer. Because the output is based on learned patterns and can vary with different input data, machine learning models are probabilistic. The behaviour of a model may not be entirely predictable and can change with new data. Testing machine learning models must also be probabilistic and generally involves validation against a known dataset and comparison of performance metrics against agreed thresholds.

Extending the analogy further, you can think of the training process of machine learning as being equivalent to the compilation step in traditional programming. During this process the internal state of the model is created by the data. Model validation is the equivalent of unit testing - where we verify the model behaves correctly. And finally, inference is equivalent to running or executing the generated algorithm.

To confuse things slightly, algorithm is a term in machine learning that is also used to describe the technique that is used to convert data into a model. There are many machine learning algorithms such as linear and logistic regression, random forests, gradient boosting, k-nearest neighbours , support vector machine, and deep learning.

To avoid this confusion, it can be helpful to think of a machine learning algorithm being like a compiler. A complier translates source code to create an executable whereas a machine learning algorithm translates data (the equivalent of the source code) to create a model (the executable).

When to use Machine Learning

Programming and machine learning both create algorithms, so when should machine learning be used instead of traditional programming techniques?

As a rule of thumb, machine learning can and should be used if:

The problem involves complex pattern recognition that is difficult to describe with rules
The environment is dynamic and the system needs to adapt to new data
You have access to sufficiently large datasets that can be used for training
Probabilistic predictions are acceptable and valuable

Data

Different machine learning tasks need to process different types of data. Data science generally groups data into two categories: structured and unstructured.

Unstructured data is information that is raw and unorganised. Text, images, video and audio are classified as unstructured.

Structured data is information that is organised. You can think of structured data as a table (i.e. a relational database table, spreadsheet or CSV file). It adheres to a pre-defined schema with fixed fields where each each row is a sample and each column represents a distinct value or measurement for the sample row.

These definitions are not clear cut. It is possible to have unstructured data embedded within structured data. Examples of this include text or image fields within an otherwise structured database table.

Data Encoding

Computers inherently understand and process information in numerical form. To be able to train a machine learning model, data must be converted into numerical representations. There are many ways to encode data. Different data types require different encoding techniques to ensure that the information is accurately captured and understood by the algorithms.

A structured data table may contain values that are either continuous or categorical.

Continuous data can take on any value within a defined range, is measured on a continuous scale, and represented in code as a numeric type (int, float, double etc.). Examples include age, salary, or house price. Continuous data can be directly used by machine learning algorithms.

Categorical data consists of discrete values that fall into distinct categories or groups, such as gender, country, or product types. In code a categorical data might be represented by an enum type or a set of constant string values. Categorical data can also be ordinal or nominal.

In ordinal categorical data, categories have a meaningful order. For example, a rating may have values of "bad", "OK", and "good". Good is better than OK and OK is better than bad. For this type of data label encoding is used to transform the values into numbers that can then be used by machine learning. This can be done manually by assigning a number to each unique value (bad=1, OK=2, good=3) or by using the LabelEncoder from scikit-learn

In nominal categorical data categories do not have a meaningful order - in a list of colours, containing the values "red", "green" and "blue", no one colour can be said to be better than another and all have equal value. For this type of data one-hot encoding is used to transform the values into numbers that can then be used by machine learning. One-hot encoding creates multiple values to represent each category. Given N unique categories N-1 values are needed to encode them. In our colour example, we can use the values "R" and "G" to encode the 3 unique colours: Red = (R=1, G=0), Green = (R=0, G=1), Blue = (R=0, G=0). Encoding data in this manner can be done using the OneHotEncoder from scikit-learn or the get_dummies method from Pandas

Unstructured image data can be represented as numerical matrices where each pixel's intensity is a number. For colour images, three matrices representing the RGB channels are used.

Transformation of unstructured text data is complicated subject area worthy of its own article in the series. Typically text is tokenised i.e. split into smaller tokens such as characters, words or subwords. Tokens may then be one-hot encoded, passed through other machine learning models to create word embeddings, classified using sentiment analysis, or processed by other classical encoding techniques such as TF-IDF

Data Labelling

Some machine learning tasks need their training data to be labelled i.e. both the input data and the corresponding correct output value are required.

For structured data the expected output value is usually included directly in the training data as an extra column in the table of the dataset.

For unstructured data, which is usually represented as files, the output labels are typically implicitly encoded in the folder structure (e.g. for a data set intended to be used to categorise images as cats or dogs, the images of cats may be in a "cats" folder and dogs in a "dogs" folder). Alternatively the mapping of files to expected outputs may be encoded in another structure data file.

Types of Machine Learning

Now that we've a basic understanding that machine learning models are algorithms that are trained from data, let's look at the types of problems machine learning excels at solving.

Supervised Learning

Supervised learning is a type of machine learning where the model is trained on a labelled dataset. In this approach, each training example consists of an input and the corresponding correct output (label). The goal is for the model to learn a mapping from inputs to outputs so that it can make accurate predictions on new, unseen data.

Supervised learning encompasses both classification and regression tasks.

Classification is where a model is used to predict a category. Examples of classification tasks include:

Predicting if an image is a cat or a dog.
Determining if an email is spam (or ham).
Deducing which passengers survived the Titanic shipwreck based on attributes such as age, gender, ticket price and cabin number.

Regression is where a model is used to predict a continuous value. Examples of regression tasks include:

Estimating house prices based on attributes such as location, number of rooms, and overall square footage.
Predicting a person's age based on an image of their face.
Calculate the rating of a film given a set of textual reviews.

Unsupervised Learning

Unsupervised learning is a type of machine learning where the model is trained on a unlabelled data without any specific outputs or target variables to guide the learning process. The goal of unsupervised learning is to find patterns, relationships, or structures within the data.

Unsupervised learning tasks are varied. Most can be classified as clustering, dimensionality reduction, or autoregression tasks.

Clustering involves grouping similar data points together based on their features. Examples of clustering tasks include:

Segmenting customers into groups based on purchasing behaviour, demographics, or browsing patterns
Partition an image into segments or clusters of pixels that represent different objects or regions within the image
Fraud detection by identifying anomalies, outliers or unusual data points that do not fit into any of the established clusters.
Recommender systems that group users or items based on similarity to provide personal recommendations.

Dimensionality reduction is a technique in machine learning and data analysis that is used to reduce the number of features (variables, attributes) in a dataset while preserving as much of the relevant information as possible. This process simplifies the dataset, making it easier to visualize, understand, and analyse, while often improving the performance of machine learning models. Dimensionality reduction techniques include principal component analysis, linear discriminant analysis, and t-distributed stochastic neighbour embedding.

Autoregressive models predict the next element in a sequence based on previous elements. Autoregression is arguable the most important learning algorithm today as many large language models (LLMs) are fundamentally based on autoregressive principles. Examples of autoregression tasks include:

Given a sequence of words or tokens, predicts the probability distribution of the next word in the sequence.
Modelling and predicting financial time series data such as stock prices, exchange rates, or interest rates.
Forecasting future temperatures, precipitation, or other weather-related metrics based on past weather data.

Comparison of Supervised and Unsupervised Learning

Supervised and unsupervised learning are the two main types of machine learning. We know that these techniques differ in their data requirements: supervised learning requires labelled data whereas unsupervised learning can use unlabelled data. But, if we consider the major types of tasks these techniques can perform there are some similarities.

Both supervised classification and unsupervised clustering aim to assign labels to data points. In classification, the labels are predefined and known, whereas in clustering, the labels (clusters) are discovered from the data itself.

Both supervised regression and unsupervised autoregression aim to predict continuous values. Regression predicts a value based on input features, while autoregression predicts a value based on previous values.

Other Types of Machine Learning

Supervised and unsupervised learning are not the only types of machine learning. Two other types of machine learning are semi-supervised learning and reinforcement learning.

Semi-supervised learning is a type of machine learning that falls between supervised and unsupervised learning. It involves training a model on a dataset that includes both labelled and unlabelled data. The goal is to leverage the large amount of unlabelled data, along with the smaller amount of labelled data, to improve learning accuracy. The small set of labelled data to provide initial guidance then a larger set of unlabelled data to enhance learning and improve model performance. This can reduces the need for extensive labelling, which can be expensive and time-consuming.

Reinforcement learning is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize some notion of cumulative reward. Unlike supervised learning, which relies on a dataset of labelled examples, reinforcement learning is based on the interaction between the agent and the environment, where the agent learns from the consequences of its actions.

Reinforcement learning will be examined more closely in a future post in this series.

Next Steps

In the next article in the series (coming soon) we will explore the machine learning lifecycle and compare it to software delivery in order to understand how to build effective and reliable machine learning solutions.

Fundamentals of Data Science for Developers