I find new and interesting algorithms and forget about them all the time! So from now on I’m going to save them all here.


webp


DBSCAN

DBSCAN, aka Density-based spatial clustering of applications with Noise, is a clustering algorithm that identifies clusters by finding regions that are densely packed together, in other words, points that have many close neighbors.

DBSCAN is the best clustering algorithm (better than k-means clustering or hierarchical clustering) for several reasons:

  • It can determine the optimal number of clusters on its own.
  • It can find clusters of abnormal shapes, not just circular ones.
  • It’s robust enough to not be affected by outliers.

Apriori Algorithm

The Apriori Algorithm is an association rule algorithm and is most commonly used to determine groups of items that are most closely associated with each other in an itemset.

To give an example, suppose we had a database of customer purchases at a grocery store. The Apriori Algorithm can be used to determine which pairs or groups of items are most frequently purchased together.

There are two main parameters: support and confidence. Support refers to the frequency that the item occurs, while the confidence represents the conditional probability that one item was purchased given that one or more other items were purchased.

Holt-Winters Exponential Smoothing

Holt-Winters Exponential Smoothing, aka Triple Exponential Smoothing, is a popular forecasting technique for time series data that exhibits both a trend and seasonality.

It’s called triple exponential smoothing because it takes into consideration the level of the data, the trend of the data, and the seasonality of the data.

The benefits of this method of forecasting over other methods, like ARIMA, are:

  • It’s simple to understand and implement.
  • It’s fairly accurate.
  • And it’s computationally inexpensive and non-resource intensive.

Matrix Factorization

Matrix factorization algorithms are a type of collaborative filtering algorithm commonly used to build recommendation systems.

The idea behind collaborative filtering is that is predicts the interests of a given user based on the interests of other similar users. This is known as a memory-based approach, but another approach is a model-based approach where machine learning algorithms are used to predict users’ ratings of unrated items.

Levenshtein Distance

The Levenshtein distance a simple algorithm used to determine the similarity between two strings.

Specifically, it is equal to the minimum number of single character edits (substitutions, additions, deletions) to change one word to another.

For example, the Levenshtein distance between “taco” and “eggs” is 4. The Levenshtein distance between “cross” and “crossword” is also 4. Intuitively, it’s odd that these pairs are ranked the same, which shows this algorithms limitations.

And so, two better string similarity algorithms that I recommend looking into are the Trigram and Jaro-Winkler algorithms.

Epsilon-Greedy Algorithm

The Epsilon-Greedy Algorithm is a simple approach to the multi-armed bandit problem, which is a representation of the exploration vs exploitation dilemma.

The idea behind the problem is that there are k different alternatives that each return a different reward, but you don’t know the reward for any of the alternatives. And so, you start by exploring different alternatives, and as time goes by, there is a tradeoff between exploring more options and exploiting the highest paying option.

With the Epsilon-Greedy Algorithm, a randomly alternative is chosen a fraction, $\varepsilon$, of the time. For the rest of the time $(1-\varepsilon)$, the alternative with the highest known payout (reward) is chosen. $\varepsilon$ is a parameter that you have to set.

Better solutions include the upper confidence bound solution and Bayesian Thompson sampling.

Harris Corner Detector

The Harris Corner Detector is an operator that is used in computer vision algorithms to identify corners in an image. This is important for image processing and computer vision because corners are known to be important features in an image.

  • In the flat region, there is no gradient change (color change) in any direction.
  • In the edge region, there is no gradient change in the direction of the edge.
  • Only in the corner region is there a gradient change in all directions.

Thus, this method is used throughout an image to determine where the corners in the image are.

You want to know the coolest SQL hack I learned this week?

In SQL, we often need to search character fields using the percentage (%) wildcard. When I put the wildcard at the end of the search string (String%) it uses the index, but if I put it at the front (%String) it does a scan of the index.

This significantly increases your run time!

Example:

SELECT *
FROM table
WHERE column1 LIKE 'Some_string%'

vs.

SELECT *
FROM table
WHERE column1 LIKE '%Some_string'

We say that option 1 is Sargable (Search ARGument ABLE), and option 2 is not Sargable, meaning option 2 could not leverage the index on the column.

To ensure all your wildcard queries are Sargable and to significantly decrease your run time, do the following:

SELECT *
FROM table
WHERE REVERSE(column1) LIKE REVERSE('Some_string') + '%'

And boom, your query runs much, much faster!

Newton-Horner

Newton-Horner method in C++.

// Horners Method
double horner(double poly_coeff[], int poly_deg, double x)
{
    double b = poly_coeff[0]; // Equation (3)
    // Compute value of defined polynomial using Horner's method
    for (int i=1; i<=poly_deg; i++)
    {
        b = b*x + poly_coeff[i]; // Equation (4)
    }
    return b;
}

// Newton and Horners Method

double *newton( double poly_coeff[], int poly_deg, int nmax, double x )
{
    double *p, *q; // Pointers
    const double TOL = 10e-8;
    double error = 1;
    int i, iter = 0;
    p = new double [poly_deg+1]; //Allocate memory based on user input
    q = new double [poly_deg+1];

    while( iter < nmax && error > TOL )
    {
        for( i=0; i<=poly_deg; i++ )
        {
            p[i] = poly_coeff[i] + p[i-1]*x;
        }
        for( i=0;  i<poly_deg; i++ )
        {
            q[i] = p[i] + q[i-1]*x;
        }
        double x_old = x; // Store previous iterate to calculate error
        x = x_old - p[poly_deg]/q[poly_deg-1]; // Newtons method
        error = fabs(x - x_old);
        iter++;
    }
    std::cout << "The root of the polynomial = " << x << std::endl;
    std::cout << "The process took " << iter << " iterations." << std::endl;
    return p;
//delete[] b; // Delete storage
delete[] p;
delete[] q;
}