Many of the Supervised and Unsupervised machine learning models such as K-Nearest Neighbor and K-Means depend upon the distance between two data points to predict the output. Therefore, the metric we use to compute these distances plays an important role in these particular models.
The distance metric uses distance function which provides a relationship metric between each element in the dataset.
A good distance metric helps in improving the performance of Classification, Clustering, and Information Retrieval process significantly. In this article, we will discuss different Distance Metrics and how do they help in Machine Learning Modelling.
In this blog, we are going to understand distance metrics, such as Euclidean and Manhattan Distance used in machine learning models, in-depth.
Euclidean Distance Metric:
Euclidean Distance represents the shortest distance between two points.
The “Euclidean Distance” between two objects is the distance you would expect in “flat” or “Euclidean” space; it’s named after Euclid, who worked out the rules of geometry on a flat surface.
This is often the “default” distance used in e.g., K-nearest neighbors (classification) or K-means (clustering) to find the “k closest points” of a particular sample point. The “closeness” is defined by the difference (“distance”) along the scale of each variable, which is converted to a similarity measure. This distance is defined as the Euclidian distance.
It is only one of the many available options to measure the distance between two vectors/data objects. However, many classification algorithms, as mentioned above, use it to either train the classifier or decide the class membership of a test observation and clustering algorithms (for e.g. K-means, K-medoids, etc) use it to assign membership to data objects among different clusters.
Mathematically, it’s calculated using Pythagoras’ theorem. The square of the total distance between two objects is the sum of the squares of the distances along each perpendicular co-ordinate.
Manhattan Distance Metric:
Manhattan Distance is the sum of absolute differences between points across all the dimensions.
Manhattan distance is a metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates. In a simple way of saying it is the total sum of the difference between the x-coordinates and y-coordinates.
It is also known as Manhattan length, rectilinear distance, L1 distance or L1 norm, city block distance, Minkowski’s L1 distance, taxi-cab metric, or city block distance.
Applications of Manhattan distance metric include,
- Regression analysis: It is used in the linear regression to find a straight line that fits a given set of points
- Compressed sensing: In solving an underdetermined system of linear equations, the regularisation term for the parameter vector is expressed in terms of Manhattan distance. This approach appears in the signal recovery framework called compressed sensing
- Frequency distribution: It is used to assess the differences in discrete frequency distributions.
We’ll calculate the Euclidean and Manhattan distance, from the example given below, which would give an intuition about both.
Considering the figure given below,
For both distance metrics calculations, our aim would be to calculate the distance between A and B,
Let’s look into the Euclidean Approach to calculate the distance AB.
Now, Considering the Manhattan approach for the same,
The Approach we saw, was the mathematical approach to find Euclidean and Manhattan distances.
Let’s jump into the practical approach about how can we implement both of them in form of python code, in Machine Learning, using the famous Sklearn library.
Now, apart from these distance metrics, we also have other popular distance metrics, which are,
- Hamming Distance: Used to Calculate the distance between binary vectors.
- Minkowski Distance: Generalization of Euclidean and Manhattan distance.
- Cosine distance: Cosine similarity measures the similarity between two vectors of an inner product space.
This was all from my side, If you really liked the Blog, please do give a “Like”, which motivates me to come up with new blogs, as a part of my contribution to the Data-Science community.
I hope you have enjoyed reading. Please be kind enough to like it and you can comment on any of your doubts and queries and we will reply to them at the earliest. Do lookout for more learning in our reading list and subscribe to TryCatchBlog website to learn more.
- Euclidean & Manhattan distance metrics in ML
- Understanding TF-IDF in NLP.
- Math Behind Content Based Recommendation System.
- Natural Language Processing: The Basics
- AlarmManager in newer Android versions