Abstract:
The cost of a skilled and competent software engineer is high, and it is desirable to
minimize dependency on such costly human resources. One of the ways to reduce such
costs is via automation of various software development tasks.
Recent advances in Artificial Intelligence (AI) and the availability of a large volume
of knowledge bearing data at various software development related venues present a ripe
opportunity for developing tools that can automate software development tasks. For instance,
there is significant latent knowledge present in raw or unstructured data associated
with Version Control Systems (VCS) artifacts such as source files, code commit logs, and
defect reports, available in the Open Source Software (OSS) repositories.
We have leveraged such knowledge-bearing data, the latest advances in AI and hardware,
to create knowledge warehouses and expert systems for the software development
domain. Such tools can help develop applications for performing various software development
tasks such as defect prediction, effort estimation, and code review.
Contributions
We have proposed novel approaches and tools to address the following software development
tasks:
1. Automating the Software Development Effort Estimation (SDEE): We propose
an efficient SDEE method for open source software, which provides accurate and
fast effort estimates. Given the software description of a newly-envisioned software,
our tool yields an effort estimate for developing it, along with the information about
the existing functionally-similar software. To derive the effort estimates, we leverage
the developer activity information of software developed in the past. A software
similarity detection model is trained using the Paragraph Vectors Algorithm (PVA)
on various software product descriptions to detect the existing software with similar
functionality. For this method, we develop the SDEE dataset, which comprises the SDEE metrics’
values derived from more than 13000 GitHub software repositories belonging
to 150 different software categories and the PVA vector representations of software
product descriptions for the considered set of GitHub repositories.
2. Detecting source code defectiveness: We present a novel system to detect the
defects in source code and the attributes of possible defects, such as the severity of
defects. We develop models using 12 different state-of-the-art ML algorithms with
50+ different combinations of their key parameters to perform the source code’s
defect estimation in various scenarios. PROCON dataset (see below) was used to
train these models. The best performing model for each of the considered set of
defect estimation scenarios and the considered set of programming languages was
identified and chosen to perform the task.
We develop a dataset of PROgramming CONstruct (PROCON) metrics for this
method, which we have defined. This dataset’s PROCON metrics’ values were
extracted by processing more than 30000 source files taken from 20+ OSS repositories
at GitHub. These source files were written in four major programming languages,
viz., C, C++, Java, and Python.
3. Detecting bloat libraries in a software distributable: We present an obfuscation
resilient method to detect bloat libraries present in a software distributable. Our
approach’s novel aspects are i) Computing a vector representation of a .class file
using a model that we call Jar2Vec. The Jar2Vec model is trained using the wellknown
PVA, ii) Before using it for training the Jar2Vec models, a .class file is
converted to a normalized form via semantics-preserving transformations.
To perform this task, we trained 27 different models using different PVA parameter
combinations on > 30000 .class files taken from > 100 different Java libraries
available at MavenCentral.
4. Automation of source code review: We propose a code review assisting tool which
assists a programmer in generating better and informed code reviews, backed by StackOverflow (SO) posts as evidence. To detect the similarity present in source
code, 57 PVA models were trained for each of the considered programming languages
on source code present in 188200+ GitHub source files. The best performing
model for each of the considered programming languages was identified and
chosen to perform the source code’s similarity detection. To perform the similarity
detection for a given source code sample c, we compare the similarity in source
code between c and the code samples present in the SOposts dataset (see below).
We created the SOposts dataset that comprises the code, text, and metadata portions
extracted from > 3 million SO posts for this tool. It also contains the PVA
vector representations of the source code collected from the SO posts and the sentiment
analysis information of narrative text present in the posts. We considered
the source code written in five popular programming languages, viz., C, C#, Java,
JavaScript, and Python to develop our dataset.
Each of the proposed methods described in the preceding section has been implemented
as a web-based tool.