Towards a knowledge warehouse and expert system for the automation of SDLC tasks

Kapur, R.

DSpace Home
→
Ph.D Theses
→
Year-2020
→
View Item

Towards a knowledge warehouse and expert system for the automation of SDLC tasks

Kapur, R.

URI: http://localhost:8080/xmlui/handle/123456789/2333

Date: 2021-08-02

Abstract:

The cost of a skilled and competent software engineer is high, and it is desirable to minimize dependency on such costly human resources. One of the ways to reduce such costs is via automation of various software development tasks. Recent advances in Artificial Intelligence (AI) and the availability of a large volume of knowledge bearing data at various software development related venues present a ripe opportunity for developing tools that can automate software development tasks. For instance, there is significant latent knowledge present in raw or unstructured data associated with Version Control Systems (VCS) artifacts such as source files, code commit logs, and defect reports, available in the Open Source Software (OSS) repositories. We have leveraged such knowledge-bearing data, the latest advances in AI and hardware, to create knowledge warehouses and expert systems for the software development domain. Such tools can help develop applications for performing various software development tasks such as defect prediction, effort estimation, and code review. Contributions We have proposed novel approaches and tools to address the following software development tasks: 1. Automating the Software Development Effort Estimation (SDEE): We propose an efficient SDEE method for open source software, which provides accurate and fast effort estimates. Given the software description of a newly-envisioned software, our tool yields an effort estimate for developing it, along with the information about the existing functionally-similar software. To derive the effort estimates, we leverage the developer activity information of software developed in the past. A software similarity detection model is trained using the Paragraph Vectors Algorithm (PVA) on various software product descriptions to detect the existing software with similar functionality. For this method, we develop the SDEE dataset, which comprises the SDEE metrics’ values derived from more than 13000 GitHub software repositories belonging to 150 different software categories and the PVA vector representations of software product descriptions for the considered set of GitHub repositories. 2. Detecting source code defectiveness: We present a novel system to detect the defects in source code and the attributes of possible defects, such as the severity of defects. We develop models using 12 different state-of-the-art ML algorithms with 50+ different combinations of their key parameters to perform the source code’s defect estimation in various scenarios. PROCON dataset (see below) was used to train these models. The best performing model for each of the considered set of defect estimation scenarios and the considered set of programming languages was identified and chosen to perform the task. We develop a dataset of PROgramming CONstruct (PROCON) metrics for this method, which we have defined. This dataset’s PROCON metrics’ values were extracted by processing more than 30000 source files taken from 20+ OSS repositories at GitHub. These source files were written in four major programming languages, viz., C, C++, Java, and Python. 3. Detecting bloat libraries in a software distributable: We present an obfuscation resilient method to detect bloat libraries present in a software distributable. Our approach’s novel aspects are i) Computing a vector representation of a .class file using a model that we call Jar2Vec. The Jar2Vec model is trained using the wellknown PVA, ii) Before using it for training the Jar2Vec models, a .class file is converted to a normalized form via semantics-preserving transformations. To perform this task, we trained 27 different models using different PVA parameter combinations on > 30000 .class files taken from > 100 different Java libraries available at MavenCentral. 4. Automation of source code review: We propose a code review assisting tool which assists a programmer in generating better and informed code reviews, backed by StackOverflow (SO) posts as evidence. To detect the similarity present in source code, 57 PVA models were trained for each of the considered programming languages on source code present in 188200+ GitHub source files. The best performing model for each of the considered programming languages was identified and chosen to perform the source code’s similarity detection. To perform the similarity detection for a given source code sample c, we compare the similarity in source code between c and the code samples present in the SOposts dataset (see below). We created the SOposts dataset that comprises the code, text, and metadata portions extracted from > 3 million SO posts for this tool. It also contains the PVA vector representations of the source code collected from the SO posts and the sentiment analysis information of narrative text present in the posts. We considered the source code written in five popular programming languages, viz., C, C#, Java, JavaScript, and Python to develop our dataset. Each of the proposed methods described in the preceding section has been implemented as a web-based tool.

Show full item record