jaccard similarity Algorithm

The Jaccard similarity algorithm, also known as the Jaccard coefficient, is a statistical measure used to quantify the similarity between two sets. This algorithm is particularly useful in various applications such as natural language processing, document clustering, and collaborative filtering. The Jaccard similarity is calculated by dividing the number of elements in the intersection of the two sets (i.e., the common elements between the two sets) by the number of elements in the union of the two sets (i.e., the total unique elements in both sets). The resulting value ranges from 0 to 1, where 0 indicates no similarity and 1 indicates that the two sets are identical. One of the main advantages of the Jaccard similarity algorithm is its simplicity and ease of interpretation. The algorithm does not consider the frequency of elements in the sets, making it particularly suitable for binary or categorical data. However, this also means that the Jaccard similarity is not well-suited for continuous data or situations where the frequency of elements is an essential aspect of the analysis. Despite its limitations, the Jaccard similarity algorithm remains a popular choice for comparing sets in various domains, thanks to its straightforward approach and ease of implementation.
"""
The Jaccard similarity coefficient is a commonly used indicator of the
similarity between two sets. Let U be a set and A and B be subsets of U,
then the Jaccard index/similarity is defined to be the ratio of the number
of elements of their intersection and the number of elements of their union.

Inspired from Wikipedia and
the book Mining of Massive Datasets [MMDS 2nd Edition, Chapter 3]

https://en.wikipedia.org/wiki/Jaccard_index
https://mmds.org

Jaccard similarity is widely used with MinHashing.
"""


def jaccard_similariy(setA, setB, alternativeUnion=False):
    """
    Finds the jaccard similarity between two sets.
    Essentially, its intersection over union.

    The alternative way to calculate this is to take union as sum of the
    number of items in the two sets. This will lead to jaccard similarity
    of a set with itself be 1/2 instead of 1. [MMDS 2nd Edition, Page 77]

    Parameters:
        :setA (set,list,tuple): A non-empty set/list
        :setB (set,list,tuple): A non-empty set/list
        :alternativeUnion (boolean): If True, use sum of number of
        items as union

    Output:
        (float) The jaccard similarity between the two sets.

    Examples:
    >>> setA = {'a', 'b', 'c', 'd', 'e'}
    >>> setB = {'c', 'd', 'e', 'f', 'h', 'i'}
    >>> jaccard_similariy(setA,setB)
    0.375

    >>> jaccard_similariy(setA,setA)
    1.0

    >>> jaccard_similariy(setA,setA,True)
    0.5

    >>> setA = ['a', 'b', 'c', 'd', 'e']
    >>> setB = ('c', 'd', 'e', 'f', 'h', 'i')
    >>> jaccard_similariy(setA,setB)
    0.375
    """

    if isinstance(setA, set) and isinstance(setB, set):

        intersection = len(setA.intersection(setB))

        if alternativeUnion:
            union = len(setA) + len(setB)
        else:
            union = len(setA.union(setB))

        return intersection / union

    if isinstance(setA, (list, tuple)) and isinstance(setB, (list, tuple)):

        intersection = [element for element in setA if element in setB]

        if alternativeUnion:
            union = len(setA) + len(setB)
        else:
            union = setA + [element for element in setB if element not in setA]

        return len(intersection) / len(union)


if __name__ == "__main__":

    setA = {"a", "b", "c", "d", "e"}
    setB = {"c", "d", "e", "f", "h", "i"}
    print(jaccard_similariy(setA, setB))

LANGUAGE:

DARK MODE: