Mechanism to Group Fingerprints

Initial Research

Methodology

Three different algorithms were used the group the fingerprints that can be found in the Assessment DB (Test)

DBSCAN
Two variants of an algorithm that maps from strongly connected components on a graph to groups of fingerprints
- Algorithm A: The variant that creates an edge between two fingerprints whenever the two fingerprints are similar enough
- Algorithm B: The variant that creates an edge between two fingerprints whenever the two fingerprints are similar enough and both of them are similar enough to the original fingerprint that was in the strongly connected component

Hyper-parameters for DBSCAN

Minimum number of data points in a cluster: 1
How to compute the distance between fingerprints : It is computed with the FingerprintComparator class, which returns a list of values that represent the percentage of the similary between each profile. The distance was 1 - x, where x is the mean of the similarities between profiles.
Epsilon: 0.2 (the counter-part of the 80% minimum similarity that is used for the other algorithms)

Hyper-parameters for Algorithms A and B

Minimum similarity required to state that two fingerprints are similar: 80%

Results

For each algorithm, the following data was gathered:

Consistent Groups: Groups in which all the fingerprints are similar to each others.
Groups Created: The number of groups/clusters created by the algorithm.

Due to the way in which the three algorithms work, the following property is true for any two clusters A and B created by the algorithms: There is at least a fingerprint in A which does not match a fingerprint in B.

Algorithm

Consistent Groups

Groups created

Details

DBSCAN

129

136

Most of the groups created were consistent

Algorithm A

Almost all of the fingerprints were grouped together in a single cluster, which was not consistent.

Algorithm B

131

137

The only difference between this and DBSCAN is that this algorithm divided in two a cluster that was not divided by DBSCAN (and that was not consistent), and that resulted in two consistent groups

Conclusions

Both DBSCAN and Algorithm B provide "good" clusters. The efficiency of these can be better understood using some metric that allows us to evaluate the quality of the clustering without knowing the "ground truth" (Performance Metrics for Clustering).
Other hyper-parameters could be tested.
Algorithm A did not work at all (probably because the similarity between two fingerprints is not transitive).
The Algorithm B is equivalent to the incremental computation that was proposed before.

Main Research

Methodology

The algorithms for fingerprint grouping were tested multiple times with different parameters
1. The groups that were created were analyzed
2. The behavior of the fingerprint comparison mechanism was also analyzed
An experiment was conducted during which two codebases were modified to see how these changes affected the similarity between fingerprints
1. A large codebase
2. A small codebase

Results

Analysis of fingerprint grouping and fingerprint comparison mechanisms

There are false positives and false negatives.
False positives: Two fingerprints for codebases that are not similar have a good percentage of similarity.
False negatives: Two fingerprints for similar codebases have a bad percentage of similarity.
We were able to analyze the groups generated.
- The fingerprints taken from Generic Scanner are consistent: the same number of files appear in the different executions grouped together, except in a few cases.
We were not able to deeply analyze executions associated to Snow Convert, due to the lack of information about files.
- We were able to compare number of files and size of the codebase, and these seem fine in most cases.

Experiment: Changing codebases and analyzing the similarity of the fingerprints for each version

Small codebase
- First analysis [7361]
- A new comment [7362] (decrease of less than 0.1%)
  - 99.9% with original
  - 99.9% with previous one
- A multi-line comment [7363] (decrease of 8%)
  - 91% with original
  - 92% with previous one
- Duplicate file [7364] (decrease of 11%)
  - 81% with original
  - 89% with last one
- Delete file [7365] (decrease of 20%)
  - 60% with original
  - 80% with last one
Large codebase
- First analysis [7367]
- Large comment [7368] (decrease of > 0.1%)
  - 99.9% with original
  - 99.9% with previous one
- Duplicate file [7369] (decrease of of > 0.1%)
  - 99.9% with original
  - 99.9% with previous one
- Delete folder [7370] (decrease of 15%)
  - 84% with original
  - 85% with previous one
- Delete two files [7371] (decrease of 1%)
  - 99% with previous one
  - 84% with original

Conclusions

Experiment: Changing codebases and analyzing the similarity of the fingerprints for each version

Small codebases are more affected by changes than large codebases

False positives and false negatives

We talked with Andrés Rojas about the false positives and false negatives, and he understand that this is a first approach.
The false positives should be avoided by using fingerprint extensibility in the future.
The false positives appear much less when the grouping takes into account the customers.
The false negatives are mainly related to small codebases: small changes affect these codebases a lot.
- We could maybe "duplicate" the files in cases were the codebase is small. We might have to do tests with this eventually.

General

We will be using the algorithm which (given a Fingerprint and the Customer Id of the customer who uploaded the assessment model):
- Analyzes the average similarity for each fingerprint group linked to the appropriate customer
  - Average Similarity: the average of the similarity between the new fingerprint and each fingerprint in the group
- From the elegible groups, chooses with the max similarity
  - The elegible groups are those in which each fingerprint is at least 80% similar to the new fingerprint
- If there are no elegible groups, a new group is created, with only the new fingerprint
This algorithm is easily modified in the code for the Assessment API
- It is only necessary to implement an IFingerprintLabellingAlgorithm.
- The FingerprintDataManager class uses an instance of the IFingerprintLabellingAlgorithm: the concrete type of this instance can be changed to the one you want to use.

Diagrams

Component Diagram (with Steps)

Sequence Diagram

Implementations of IFingerprintLabellingAlgorithm

The interface IFingerprintLabellingAlgorithm implements a single method:

public interface IFingerprintLabellingAlgorithm
{
    long? LabelFingerprint(Fingerprint.Fingerprint fingerprint, IEnumerable<FingerprintGroup> fingerprintGroups);
}

The algorithm receives the Fingerprint that is going to be labelled, and an enumerable with all the known fingerprints. It returns a number with the id of the FingerprintGroup in which the fingerprint should be.

AverageSimilarityLabellingAlgorithm

This is the current algorithm used for fingerprint labelling. The procedure is equivalent to this:

For each existing FingerprintGroup:
- If the fingerprint "fits" in the group:
  - The average similarity is computed for the fingerprint, and the fingerprint is considered "elegible" (the average similarity of the similarity between the new fingerprint and each fingerprint in the group).
- If the average similarity is more than the max average similarity seen till now:
  - The group becomes the best group at the moment
After visiting each FingerprintGroup, the Id of the best FingerprintGroup is returned.
- If no elegible group was found
  - Create a new FingerprintGroup in the database and return the Id for that group
- Else
  - The Id of the best FingerprintGroup is returned

PreviousArchitecture for Light Scanners NextBifrost Initiative

Last updated 2 years ago