Fingerprint Grouping Mechanism

This page describes the way in which the fingerprint grouping mechanism works

General Description

This mechanism changes the way in which the assessment models are uploaded to the Assessment DB:

When a new execution is inserted, the fingerprint for that execution will be grouped into the best group possible (if there isn’t a sufficiently appropriate group, a new group will be created).
These groups of fingerprints are represented through the FingerprintGroup table: The Fingerprint table now has a FingerprintGroupId that references the Id of the FingerprintGroup to which the Fingerprint belongs.
Each Fingerprint Group belongs to a Customer: Each group can be associated to its Customer by using the CustomerId column in the FingerprintGroup table.
Some Fingerprint Groups are not assigned to a customer (this happens mostly because there was no license associated to the execution).
Finally, the fingerprints that are in the same group are at least 80% similar, according to the FingerprintComparison mechanism

There are some considerations that need to be taken into account when interpreting data related to the Fingerprint Groups. Please read the Considerations section that is included after the Implementation section

Implementation

Component Diagram (with Steps)

Sequence Diagram

Implementations of IFingerprintLabellingAlgorithm

The interface IFingerprintLabellingAlgorithm implements a single method:

public interface IFingerprintLabellingAlgorithm
{
    long? LabelFingerprint(Fingerprint.Fingerprint fingerprint, IEnumerable<FingerprintGroup> fingerprintGroups);
}

The algorithm receives the Fingerprint that is going to be labelled, and an enumerable with all the known fingerprints. It returns a number with the id of the FingerprintGroup in which the fingerprint should be.

AverageSimilarityLabellingAlgorithm

This is the current algorithm used for fingerprint labelling. The procedure is equivalent to this:

For each existing FingerprintGroup:
- If the fingerprint "fits" in the group:
  - The average similarity is computed for the fingerprint, and the fingerprint is considered "elegible" (the average similarity of the similarity between the new fingerprint and each fingerprint in the group).
- If the average similarity is more than the max average similarity seen till now:
  - The group becomes the best group at the moment
After visiting each FingerprintGroup, the Id of the best FingerprintGroup is returned.
- If no elegible group was found
  - Create a new FingerprintGroup in the database and return the Id for that group
- Else
  - The Id of the best FingerprintGroup is returned

Considerations

A research was conducted to understand how well the mechanism behaved. There are some considerations that is important to mention:

The possibility of false positives and false negatives exists:
- False Positives happen when two Fingerprints in the same Fingerprint Group actually belong to different codebases. In these cases, the main problem is that the FingerprintComparison module (that was implemented before this mechanism was even designed) detects a similarity of more than 80% between the two Fingerprints, even if they belong to different codebases. There are some actions we could take to improve the performance of this Fingerprint Comparison module. These will be mentioned in the Future Work section.
- False Negatives happen when two Fingerprints F and G are in different Fingerprint Groups and are actually similar, even though they did not end up in the same Fingerprint Group. This can happen because there is a fingerprint H in the group where F is, and the similarity between G and H is not enough, which prevented the fingerprint G from being included in the same group as F and H. In these cases, these could be count not as false negatives but rather as cases in which the codebase associated to the fingerprint G is the same codebase associated to F and H, but it has been modified so much that it does not resemble H. This is the expected behavior of the fingerprint comparison mechanism: a fingerprint should not match a previous version of the same codebase if the changes have been significant.
  - Smaller codebases tend to be less similar after some minor changes.
  - Larger codebases are more resistant to changes.

Future Work

The way in which the fingerprint is compared and calculated could be extended to lower the probability of False positives
In order to lower the probability of False positives, some other actions can be evaluated and taken, such as:
- Including the total size of the project in the comparison
- Including some other metadata to decide how to label each fingerprint

PreviousFree licenses definition NextData Analysis Vision

Last updated 2 years ago