Machine Learning Algorithms (Not extensive)



Machine Learning Algorithms

Machine Learning (ML), a subfield of artificial intelligence (AI), involves the development of statistical algorithms that "can learn from data and generalize to unseen data and thus perform tasks without explicit instructions." Recent advancements in AI are largely driven by neural networks. 

Here are some of the key ML algorithms, categorized into their respective subfields, to aid in understanding and selecting the appropriate algorithm for a given problem.

Main Branches of Machine Learning

Machine Learning is broadly divided into two main areas:

Supervised Learning
Supervised learning occurs when a dataset contains both independent variables (features/input variables) and a dependent variable (target/output variable) that is to be predicted. The algorithm is trained on a "training data set where we know the True Values for the output variable also called labels."

Analogy: "Showing a little kid what a typical cat looks like and what a typical dog looks like and then giving it a new picture and asking it what animal it sees."

Examples: Predicting house prices based on features like square footage, location, and year of construction; categorizing an object as a cat or a dog based on height, weight, ear size, and eye color.

Subcategories:
Regression: Predicts a continuous numeric target variable.

Example: Predicting house prices based on various features, determining relationships (e.g., square footage is proportional to price, age has no influence).

Classification: Assigns a discrete categorical label (class) to a data point.

Example: Assigning "spam" or "no spam" to an email; categorizing emails into "junk," "primary," "social," "promotions," and "updates."

Unsupervised Learning
Unsupervised learning encompasses problems where "no truth about the data is known," meaning there are no labels or target variables to predict. The goal is to find underlying structures or patterns within the data.

Analogy: "Giving a kid with no idea of what cats and dogs are a pile of pictures of animals and asking it to group by similarity without any further instructions."

Examples: Sorting emails into unspecified categories that can later be inspected and named; finding inherent groupings in customer data.

Key Supervised Learning Algorithms
Linear Regression
Concept: The "mother of all machine learning algorithms," it aims to find a linear relationship between input and output variables.

Mechanism: "Fit a linear equation to the data by minimizing the sum of squares of the distances between data points and the regression line." This minimizes the average distance between real data and the predictive model.

Example: Predicting a person's height based on shoe size, where "for every one unit of shoe size increase the person will be on average 2 inches taller."

Complexity: Can be extended to multi-dimensional data by including additional features (e.g., gender, age, ethnicity for shoe size prediction). Many advanced algorithms, including neural networks, are extensions of this basic idea.

Logistic Regression
Concept: A variant of linear regression used for classification tasks, predicting a categorical output variable.

Mechanism: Instead of fitting a line, it fits a "sigmoid function" to the data.

Output: The equation provides the "probability of a data point falling into a certain class given the value of the input variable."

Example: Predicting the gender of a person based on height and weight, where the output might be an 80% likelihood of an adult male with a height of 180 cm being male.

K-Nearest Neighbors (KNN)
Concept: A "non-parametric" algorithm (no equations or model parameters are fitted) used for both regression and classification.

Mechanism: For a new data point, it predicts the target based on the average (regression) or majority class (classification) of its 'K' nearest neighbors.

Hyperparameter K: 'K' is a hyperparameter; choosing the right 'K' is crucial.
Small 'K' (e.g., 1-2) can lead to overfitting (good on training data, poor on unseen data).
Large 'K' (e.g., 1000) can lead to underfitting (poor overall fit).
Optimal 'K' depends on the problem and often requires methods like cross-validation.

Examples:Classification: A person's gender is the same as the majority of their five closest neighbors in weight and height.

Regression: A person's weight is the average weight of their three closest neighbors in height and chest circumference.

Support Vector Machine (SVM)
Concept: A supervised algorithm primarily for classification, also applicable to regression. It draws a "decision boundary between data points that separates data points of the training set as well as possible."

Mechanism: Aims to find the decision boundary (line in 2D, hyperplane in higher dimensions) that "separates the classes with the largest margin possible," maximizing space between classes. This improves generalization and robustness to noise/outliers.

Support Vectors: Data points "that sit on the edge of the margin" are called support vectors; knowing these is sufficient for classification, making it memory efficient.

Strengths: Powerful in high-dimensional data.

Kernel Functions: Crucially, SVMs use "kernel functions" (e.g., linear, polynomial, RBF, sigmoid) to identify "highly complex nonlinear decision boundaries." These implicitly transform original features into new, more complex ones (e.g., BMI from weight/height squared), a process called "implicit feature engineering."

Naive Bayes Classifier
Concept: A simple classifier based on Bayes' theorem, named "naive" due to a key assumption.
Mechanism: Calculates the probability of certain words appearing in different classes (e.g., spam vs. non-spam emails). It then classifies new data by "multiplying the different probabilities of all words in the email together."

"Naive" Assumption: Assumes "the probabilities of the different words appearing are independent of each other."

Strengths: Computationally efficient and effective for many use cases like "spam classification and other text-based classification tasks."

Decision Trees
Concept: A fundamental algorithm forming the basis of more complex models. It's "a series of yes no questions that allow us to partition a data set in several Dimensions."

Mechanism: Aims to create "Leaf nodes at the bottom of the tree that are as pure as possible," meaning splits are chosen to minimize misclassified data points within resulting groups.

Ensemble Methods (Combining Decision Trees):Bagging (Bootstrap Aggregating): Trains multiple models on different subsets of training data.

Random Forest: A famous bagging method where "many decision trees vote on the classification of your data by majority vote." Randomness comes from randomly excluding features for different trees, preventing overfitting and increasing robustness by decorrelating trees. Powerful for classification and regression.

Boosting: Trains models sequentially, where "each model focuses on fixing the errors made by the previous model." Combines a series of weak models into a strong one.

Characteristics: Often achieves higher accuracies than random forests but is "more prone to overfitting" and slower to train due to its sequential nature.
Examples: AdaBoost, Gradient Boosting, XGBoost.

Neural Networks
Concept: The "reigning king of AI," neural networks take implicit feature engineering to an advanced level, automatically designing complex features without human guidance.

Challenge with Traditional Methods: For complex tasks like digit classification (e.g., handwritten '1's), pixel intensities alone are insufficient because variations exist. Humans understand abstract features (vertical line, no crossing lines for '1'), but computers don't.

Mechanism (Implicit Feature Engineering):Perceptron (Simplest Form): A multi-feature regression task.

Hidden Layers: By adding "additional layers of unknown variables between the input and output variables," neural networks learn "hidden features." For instance, a hidden feature might represent a "horizontal line" even if never explicitly defined.

Deep Learning: Involves "even more layers," resulting in "very complex hidden features" that can represent abstract information (e.g., "there is a face in the picture").

Learning Process: The network is trained to predict the final target as accurately as possible; the exact meaning of hidden features often remains unknown, but they lead to good predictions.

Unsupervised Learning Algorithms

Clustering
Concept: Identifies underlying structures in data when no specific target variable is available. The goal is to "find unknown clusters just by looking at the overall structure of the data."

Distinction from Classification:Classification: Knows target classes and has labeled training data.

Clustering: No labels, aims to discover natural groupings based on similarity.

K-Means Clustering: The "most famous clustering algorithm."

Hyperparameter K: Represents "the number of clusters you are looking for." Finding the right 'K' is problem-dependent and often involves trial and error.Mechanism:Randomly select 'K' cluster centers.
Assign all data points to the closest cluster center.

Recalculate cluster centers based on the assigned data points.

Repeat steps 2 and 3 until cluster centers stabilize.Other Clustering Algorithms: Hierarchical clustering and DBScan can find clusters of arbitrary shape and don't require specifying 'K' beforehand, but are not detailed here.

Dimensionality Reduction
Concept: Reduces "the number of features or dimensions of your data set keeping as much information as possible."

Mechanism: Finds correlations between existing features and removes redundant dimensions without significant information loss.

Purpose: Can simplify data for better interpretability, serve as a pre-processing step to make supervised learning algorithms more efficient and robust, and reveal relationships within features.

Example (Image Recognition): Recognizing an airplane without needing a high-resolution image by reducing the number of pixels.

Principal Component Analysis (PCA): A prominent dimensionality reduction algorithm.

Example: Predicting fish types based on length, height, color, teeth. If length and height are highly correlated, PCA can combine them into a single "shape feature."

Mechanism: Finds directions (principal components) in which "most variance in the data set is retained." The first principal component explains the most variance and becomes a new feature. Subsequent principal components are orthogonal and explain less variance; those that "don't contribute much to the variant" can be excluded.



Comments