Rapid Scan Report
Proposed Report Structure
#1 Summary & Terminology
Files Inventory Summary (pie chart):
Identified Text Files
Identified Binary Files
Unidentified Files
Identified Technologies
Pie Chart (files by technology) * Configure units? (files, bytes, lines) * For more information see "1.1 Identified Text Files" reports below.
All terminology must be clear: identified technologies & file types (glossary?)
"You can customize RapidScan" => link to how RapidScan Works below.
#2 Index of Plots/Datasets
Files Inventory:
Identified Text Files
Configurable Bar Chart & Dataset. * Params = Units (Files, Lines, Bytes) & Group by: (Extension, Technology) * Lineas separadas por categoria: Content, Comment, Blank
Content Lines vs Comment Lines Scatter Plot
Identified Binary Files
Configurable Bar Chart & Dataset * Params = Units (Files,
Lines, Bytes) & Group by: (Extension, Technology)
Unidentified Files
Configurable Bar Chart & Dataset * Params = Units (Files, Lines, Bytes) * Group by Extension only
Keywords Inventory
Configurable Keywords list by technology
Params = Technology
#3 Marketing Section 1
#4 Actual Plots/Datasets
Plots/datasets following the same structure defined in #2
#5 Marketing Section 2
#6 How RapidScan Works
How it works
How to customize
How to make deeper analysis of specific technologies
Links to more info and contacts
Dimensions
File Inventory
Measures
Files
Bytes
Lines
Type of Extension
Known Binary
Grouped by Technology
Not grouped by Technology
Known Text
Grouped by Technology
Not grouped by Technology
Unknown
Keywords
Columns: technology, extension, file, keyword.
Can be grouped by technology, extension.
Only for known text.
Analysis Ideas
File clustering by keywords
Rate of branching keywords by file
Comparisons with other known real life applications
Create database of known apps
Update database of known apps
Comparisons
Rates of content vs comment lines
Distribution of keywords
Distribution of branching keywords
Keywords Sequence Analysis?
Folder Tree Localization Patterns
Distributions in the tree branches of technologies, extensions, etc.
Next Steps
Collect data for data comparison
Decide how we collect data: Manually
Collect Data from Projects that have already been analyzed (Brandon/Monica)
Collect Data from new Projects (github)
Store Data for the projects
XML seems like
the bestan option (XLSX is not very readable, CSV would have to be split into different files)We want to store summary data
We will store a couple of CSVs with the information from all the other projects
How can we store the information about which keywords are branching keywords
This could be in the configuration (it is important to add a disclaimer explaining that changing the configuration of existing languages/technologies can result in a report that is not as accurate).
In this case, the KeywordCounts.csv has another column isBranching which can be true or false
New Charts/Changes in Existing Charts
Add trend in content vs comment lines scatter plot
Add branching keywords vs content lines scatter plot
Add trend
Modify Keyword Bar Chart. Add options with a new combo box:
Normal
My Keywords Usage vs Average Keyword Usage (Keyword Usage = # of times a keyword appears/ # of times any keyword appears)
My Keywords Density vs Average keyword density (Keywords Density = # of times a keyword appears/ # of content lines for the technology)
Box Plot
We also considered Violin Plot and Distribution Plot
Consider if we can update the database with new scans from RapidScan
How can we make sure the data for a project is updated only once?
How can we make sure the extensions are being analyzed correctly? Maybe we have a .java file that is actually not related to java (for some reason)
Trends
We can use linear regressions or logistic regression. We should not show the trend if the trend is not "good" enough (using the Coefficient of determination) or if the number of data points is very low.
Box Plots
We could generate plots for these variables:
Content Lines (by Technology)
Comment Lines (by Technology)
Control Flow Keywords (by Technology)
Comment Lines / Content Lines (by Technology)
Control Flow Keywords / Content Lines (by Technology)
Last updated