Rapid Scan Report

Proposed Report Structure

#1 Summary & Terminology

Files Inventory Summary (pie chart):
1. Identified Text Files
2. Identified Binary Files
3. Unidentified Files
Identified Technologies
1. Pie Chart (files by technology) * Configure units? (files, bytes, lines) * For more information see "1.1 Identified Text Files" reports below.

All terminology must be clear: identified technologies & file types (glossary?)
"You can customize RapidScan" => link to how RapidScan Works below.

#2 Index of Plots/Datasets

Files Inventory:
1. Identified Text Files
  1. Configurable Bar Chart & Dataset. * Params = Units (Files, Lines, Bytes) & Group by: (Extension, Technology) * Lineas separadas por categoria: Content, Comment, Blank
  2. Content Lines vs Comment Lines Scatter Plot
2. Identified Binary Files
  1. Configurable Bar Chart & Dataset * Params = Units (Files, ~~Lines~~, Bytes) & Group by: (Extension, Technology)
3. Unidentified Files
  1. Configurable Bar Chart & Dataset * Params = Units (Files, Lines, Bytes) * Group by Extension only
Keywords Inventory
1. Configurable Keywords list by technology
  - Params = Technology

#3 Marketing Section 1

#4 Actual Plots/Datasets

Plots/datasets following the same structure defined in #2

#5 Marketing Section 2

#6 How RapidScan Works

How it works
How to customize
How to make deeper analysis of specific technologies
Links to more info and contacts

Dimensions

File Inventory

Measures

Files
Bytes
Lines

Type of Extension

Known Binary
- Grouped by Technology
- Not grouped by Technology
Known Text
- Grouped by Technology
- Not grouped by Technology
Unknown

Keywords

Columns: technology, extension, file, keyword.

Can be grouped by technology, extension.

Only for known text.

Analysis Ideas

File clustering by keywords
Rate of branching keywords by file
Comparisons with other known real life applications
- Create database of known apps
- Update database of known apps
- Comparisons
  - Rates of content vs comment lines
  - Distribution of keywords
  - Distribution of branching keywords
Keywords Sequence Analysis?
Folder Tree Localization Patterns
- Distributions in the tree branches of technologies, extensions, etc.

Next Steps

Collect data for data comparison
1. Decide how we collect data: Manually
2. Collect Data from Projects that have already been analyzed (Brandon/Monica)
3. Collect Data from new Projects (github)
4. Store Data for the projects
  1. XML seems like ~~the best~~ an option (XLSX is not very readable, CSV would have to be split into different files)
    We want to store summary data
  2. We will store a couple of CSVs with the information from all the other projects
  3. How can we store the information about which keywords are branching keywords
    This could be in the configuration (it is important to add a disclaimer explaining that changing the configuration of existing languages/technologies can result in a report that is not as accurate).
    In this case, the KeywordCounts.csv has another column isBranching which can be true or false
New Charts/Changes in Existing Charts
1. Add trend in content vs comment lines scatter plot
2. Add branching keywords vs content lines scatter plot
  1. Add trend
3. Modify Keyword Bar Chart. Add options with a new combo box:
  1. Normal
  2. My Keywords Usage vs Average Keyword Usage (Keyword Usage = # of times a keyword appears/ # of times any keyword appears)
  3. My Keywords Density vs Average keyword density (Keywords Density = # of times a keyword appears/ # of content lines for the technology)
4. Box Plot
  1. We also considered Violin Plot and Distribution Plot
Consider if we can update the database with new scans from RapidScan
1. How can we make sure the data for a project is updated only once?
2. How can we make sure the extensions are being analyzed correctly? Maybe we have a .java file that is actually not related to java (for some reason)

Trends

We can use linear regressions or logistic regression. We should not show the trend if the trend is not "good" enough (using the Coefficient of determination) or if the number of data points is very low.

Box Plots

We could generate plots for these variables:

Content Lines (by Technology)
Comment Lines (by Technology)
Control Flow Keywords (by Technology)
Comment Lines / Content Lines (by Technology)
Control Flow Keywords / Content Lines (by Technology)

PreviousGeneric Infrastructure Release Process NextMultiplatform Path Validation

Last updated 2 years ago