Rapid Scan Report

Proposed Report Structure

#1 Summary & Terminology

  1. Files Inventory Summary (pie chart):

    1. Identified Text Files

    2. Identified Binary Files

    3. Unidentified Files

  2. Identified Technologies

    1. Pie Chart (files by technology) * Configure units? (files, bytes, lines) * For more information see "1.1 Identified Text Files" reports below.

  • All terminology must be clear: identified technologies & file types (glossary?)

  • "You can customize RapidScan" => link to how RapidScan Works below.

#2 Index of Plots/Datasets

  1. Files Inventory:

    1. Identified Text Files

      1. Configurable Bar Chart & Dataset. * Params = Units (Files, Lines, Bytes) & Group by: (Extension, Technology) * Lineas separadas por categoria: Content, Comment, Blank

      2. Content Lines vs Comment Lines Scatter Plot

    2. Identified Binary Files

      1. Configurable Bar Chart & Dataset * Params = Units (Files, Lines, Bytes) & Group by: (Extension, Technology)

    3. Unidentified Files

      1. Configurable Bar Chart & Dataset * Params = Units (Files, Lines, Bytes) * Group by Extension only

  2. Keywords Inventory

    1. Configurable Keywords list by technology

      • Params = Technology

#3 Marketing Section 1

#4 Actual Plots/Datasets

Plots/datasets following the same structure defined in #2

#5 Marketing Section 2

#6 How RapidScan Works

  1. How it works

  2. How to customize

  3. How to make deeper analysis of specific technologies

  4. Links to more info and contacts

Dimensions

File Inventory

Measures

  • Files

  • Bytes

  • Lines

Type of Extension

  • Known Binary

    • Grouped by Technology

    • Not grouped by Technology

  • Known Text

    • Grouped by Technology

    • Not grouped by Technology

  • Unknown

Keywords

Columns: technology, extension, file, keyword.

Can be grouped by technology, extension.

Only for known text.

Analysis Ideas

  • File clustering by keywords

  • Rate of branching keywords by file

  • Comparisons with other known real life applications

    • Create database of known apps

    • Update database of known apps

    • Comparisons

      • Rates of content vs comment lines

      • Distribution of keywords

      • Distribution of branching keywords

  • Keywords Sequence Analysis?

  • Folder Tree Localization Patterns

    • Distributions in the tree branches of technologies, extensions, etc.

Next Steps

  1. Collect data for data comparison

    1. Decide how we collect data: Manually

    2. Collect Data from Projects that have already been analyzed (Brandon/Monica)

    3. Collect Data from new Projects (github)

    4. Store Data for the projects

      1. XML seems like the best an option (XLSX is not very readable, CSV would have to be split into different files)

        1. We want to store summary data

      2. We will store a couple of CSVs with the information from all the other projects

      3. How can we store the information about which keywords are branching keywords

        1. This could be in the configuration (it is important to add a disclaimer explaining that changing the configuration of existing languages/technologies can result in a report that is not as accurate).

          1. In this case, the KeywordCounts.csv has another column isBranching which can be true or false

  2. New Charts/Changes in Existing Charts

    1. Add trend in content vs comment lines scatter plot

    2. Add branching keywords vs content lines scatter plot

      1. Add trend

    3. Modify Keyword Bar Chart. Add options with a new combo box:

      1. Normal

      2. My Keywords Usage vs Average Keyword Usage (Keyword Usage = # of times a keyword appears/ # of times any keyword appears)

      3. My Keywords Density vs Average keyword density (Keywords Density = # of times a keyword appears/ # of content lines for the technology)

    4. Box Plot

      1. We also considered Violin Plot and Distribution Plot

  3. Consider if we can update the database with new scans from RapidScan

    1. How can we make sure the data for a project is updated only once?

    2. How can we make sure the extensions are being analyzed correctly? Maybe we have a .java file that is actually not related to java (for some reason)

We can use linear regressions or logistic regression. We should not show the trend if the trend is not "good" enough (using the Coefficient of determination) or if the number of data points is very low.

Box Plots

We could generate plots for these variables:

  • Content Lines (by Technology)

  • Comment Lines (by Technology)

  • Control Flow Keywords (by Technology)

  • Comment Lines / Content Lines (by Technology)

  • Control Flow Keywords / Content Lines (by Technology)

Last updated