Semantic-based Analysis of Rules for Decision Support for Software Maintenance

Decision support systems assist users in making suitable and competent decisions regarding variety of issues. They embed knowledge about facts and features related to topics that are subjects of decision-making processes and use this knowledge to identify most appropriate alternatives. Representation of knowledge could take multiple forms. Quite often, if-then rules are used, as they are perceived as one of the easiest ways of capturing knowledge. In general, multiple methods can be employed to construct rules – some of them are fully depended on the experts’ knowledge, some on available data, yet some on a combination of both approaches. The paper introduces a methodology for identifying most suitable and representative if-then rules. Semanticbased analysis of these rules is described. All rules are evaluated based on their classification performance, as well as their ability to represent knowledge. This constitutes a step towards an automatic construction of rule-based decision-making systems. Production and hosting by ISPACS GmbH. Oxford Journal of Intelligent Decision and Data Science 2016 No. 2 (2016) 69-102 Available online at www.ispacs.com/ojids Volume 2016, Issue 2, Year 2016 Article ID ojids-00007, 34 Pages doi:10.5899/2016/ojids-00007


Introduction
The cost of maintenance activities accounts for more than half of the typical software budget [16,31].This fact alone means that any effort leading to increase effectiveness of maintenance activities is essential for any software developing company.Application of decision-making tools that support maintenance tasks is important, while a good understanding of what makes software more or less maintainable is of profound significance.
There is an ongoing research focused on finding a single metric that estimates level of maintainability of software.The most popular one is the SEI Maintenance Index (MI) built based on the Coleman-Oman model [9,30].The MI is calculated using a combination of common software measures: MI = 171 -5.2 *ln(aveV) -0.23 * aveV(g) -16.2 *ln(aveLOC) + 50 *sin( 2.4 * perCM ) (1.1) where aveV is the average Halstead Volume per module, aveV(g) is the average extended cyclomatic complexity per module, aveLOC is the average lines of code per module, and perCM is the average percent of lines of comment per module.The coefficients of the equation are the result of calibration using data from numerous software systems being maintained by Hewlett-Packard [10].In a major research effort due to Hewlett Packard, the following thresholds for the evaluation of the MI, calculated by means of the previous models, have been determined: MI < 65 for poor maintainability, 65 < MI < 85 for fair maintainability, and 85< MI for excellent maintainability.Although, the concept of MI is interesting, an approach based on the human estimation of maintenance levels is an appealing alternative to a formula-based evaluation.In such a case, programmers or maintenance engineers are asked to evaluate maintainability of software objects via visual inspection.This kind of maintenance estimation is influenced by the person's individual experience and background.However, it is very motivating to perform such evaluation in order to better understand what aspects of software programming are critical for deciding about excellent/fair/poor levels of maintainability of software.In other words, we would like to find out what attributes of software objects are critical/influential for individuals when they identify maintainability levels of software objects, and what attributes should be used to develop decision-making tools.The issues presented above motivated us to focus on a thorough analysis of maintenance data in order to identify the most essential factors defining excellent/fair/poor levels of maintainability.The approach presented in the paper is the result of that activity.The core element of the approach is extraction of IF-THEN rules from data and semantic-based analysis of those rules.http://www.ispacs.com/journals/ojids/2016/ojids-00007/International Scientific Publications and Consulting Services The extraction of rules from data in a well-known research topic [2,5,6,7,15,19,22,23,24,26,27,33], and it is not covered in this paper.We relay on existing rule-generating tools.Availability of numerous tools led to one of the challenges that the proposed approach addressesdealing with large number of different rules constructed by variety of rule-generating tools.Each of these tools 'looks' at data differently and each of them generates different IF-THEN rules.The proposed approach is designed in a way that a large variety of rules leads to even more interesting results.In the paper, we propose a novel methodology for analyzing rules extracted from data.The analysis process targets such issues as: good coverage, similarity of rules, and inclusion of rules.Similarity and inclusion are not preformed at the level of syntax, but at the level of semantics of rules.It means that we compare the rules based on data points cover by those rules.We use the proposed method to analyze rules extracted from software maintenance data.This software maintenance data contain evaluation of maintainability levels of software objects preformed independently by three software engineers.In the first experiment, we analyze the obtained data in order to find out how each evaluator 'recognizes' three levels of maintainability: excellent, fair, and poor.Additionally, we compare these findings, identify software attributes that evaluators implicitly recognize as important, and find out if there is an agreement among engineers in identifying objects of different levels of maintainability.In the second experiment, we analyze rules representing a single engineer constructed using different rulegenerating tools and compare findings in order to find generic rules describing an individual.The paper is organized in the following way.Section 2 is dedicated to a very briefly description of analysis of maintenance data.The concepts of similarity and inclusion of rules are presented in Section 3. A detailed description of maintenance data used here is included in Section 4. Section 5 explains the approach we used to select the most important rules and compare them.Section 6 is dedicated to the description of our system for automatic comparison of rules, and identification of the most unique rules.Sections 7 and 8 contain the description of main rules and all findings related to different levels of maintainability of software objects done for each programmer, and for different rule-generating tools, respectively.The conclusions constitute Section 9.

Software Maintenance: Data Models and Measures
An importance of software maintenance is reflected in a number of research activities dedicated to this topic.The two activities related to this paper are modeling of maintenance data and development of a maintenance measure.Data models representing different software maintenance activities are used for many years [14,21,28].One of the first such papers dedicated to this topic was [20].This paper reported on the development and use of several software maintenance models that were built applying regression analysis, neural networks, and the optimized set reduction method.The variables included in the models were: a cause of task, a degree of change on a code, a type of operation on a code, and confidence of maintainer.An explanation of efforts associated with software changes made to correct faults while software is undergoing development was investigated in [12,13].In this case, the ordinal response models were developed to predict efforts needed to isolate and fix a defect.The model input variables included extent of change, a type of change, an internal complexity of the software components undergoing the change, as well as fault locality and characteristics of the software components being changed.A model for estimating adaptive software maintenance efforts in person hours was described in [18].It was found that a number of metrics such as the lines of code changed, and the number of operators changed are strongly correlated to maintenance efforts.Another maintenance related research is focused on attempts to measure maintainability.Great effort has http://www.ispacs.com/journals/ojids/2016/ojids-00007/International Scientific Publications and Consulting Services been put into constructing formulas for describing maintainability.Following the opinion that maintainability is the set of attributes, these formulas describe maintainability as a function of directly measurable attributes.Many researchers have tried to quantify maintainability in different types of measures [1,3,25], of which the most noticeable is the Maintainability Index, MI [30].The Halstead source code measures (Halstead Length, Halstead Volume, Halstead Effort) proposed in the seventies [17,29] have been also used for describing maintainability [29,30].To the best of authors' knowledge, there is no reported work on the topic of using methods of data modeling to describe what maintainability means for humans.

Similarity and Inclusions of Rules
A rule generation process can be performed using a number of approaches and methods.Each of them leads to generation of different rules.The rules differ in a number of ways.If we assume that an atomic component is built in the following way: value of attribute a is larger than number n, then we can identify a variety of differences among rules: rules can have different number of atomic components in antecedents, as well as different attributes and their values.All these differences are related to the syntax of rules.We would like to focus on a different kind of comparison of rulesthe semantic similarity.This similarity 'looks inside' rules and compares them based on a number and identity of points that satisfy the antecedent parts of rules.

Tversky's Similarity Model
The two main approaches for assessing similarity are content models and distance models.Content models conceptualize the characteristics with respect to which objects are similar "as more or less discrete and common elements" [4].Distance models conceptualize these characteristics "as dimensions on which the objects have some degree of proximity" [4].Many set theoretic measures in the content model category are generalized by Tversky's parameterized ratio model of similarity [32].It expresses similarity between objects as a ratio of the measures of their common and distinctive features: X and Y represent sets describing respective objects x and y. (X ∩ Y) represents the features that objects x and y have in common.(X -Y) represents the features that X has but Y does not.(Y -X) represents the features that Y has, but X does not.The function f measures the contribution of any specific feature to the value of similarity between objects.The value f (X) for object x is considered a measure of the overall salience of that object.Factors adding to an object's salience include "intensity, frequency, familiarity, good form, and informational content" [32].The function f is additive on disjoint sets, for example, set cardinality.The factors a and b are nonnegative valued unbounded parameters specifying the importance of these two components.This measure is normalized, 0 ≤ S(X ,Y ) ≤ 1.For a = b =1, S becomes the Jaccard index: With a = 1, b =0, S becomes the degree of inclusion for X, i.e., the proportion of X overlapping with Y.
Similarly with a = 0, b = 1, S becomes the degree of inclusion for Y, the proportion of Y overlapping with X.This parameterization is not necessary, however, since Eq. 4 can be formulated as S-inclusion (Y, X).

Tversky's Model for Similarity & Inclusion of Rules
Presented Tversky's model of similarity can be easily applied for comparison of rules.In order to compare the rules semantically the contingency table is used.An example of such a table is presented below, Table 1.The number a represents a number of data points that are satisfied by antecedents of both rules R1 and R2.
The number b identifies a number of data points that satisfy the rule R2 but do not the rule R1, while the number c oppositea number of points that satisfy the rule R1 but do not R2.The numbers a, b and c can be directly plug in into the Eq. 3. The equation is presented below Tversky's model allows us to calculate inclusions of rules.The formulas can be presented in the following forms:

Data Description and Experimental Setup
The data used in the paper were collected during an experiment conducted in National Research Council (NRC), Canada.In the experiment, three software engineers have independently analyzed software objects of the system EvIdent® (is a user-friendly, algorithm-rich, graphical environment for the detection, investigation, and visualization of novelty in a set of images as they evolve in time or frequency.EvIdent® is written in Java and C++ and is based upon VIStA, an application-programming interface (API) developed at the National Research Council).
The aim of that evaluation was estimation of easiness of performing maintenance tasks on software objects.For each of the 366 software objects, three software engineers, named here 'A', 'D' and 'V', were asked to independently rank easiness of objects' maintainability in the scale from 1 to 5 (where 1 means POOR maintainability, 2 means POOR-FAIR maintainability, 3 means FAIR maintainability, 4 means FAIR-EXCELLENT maintainability, and 5 means EXCELLENT maintainability).The engineers determined maintainability of objects based on their own judgment.At the same time, 64 software metrics were calculated for each software object.As the result, the collected data set consists of 366 data points represented by a set of 64 software metrics and three values assigned to each object by three engineers.Some of the extracted metrics are shown in  The graph in Figure 1 shows the distribution of data points identified as 1, 2, 3, 4 or 5.The graph clearly shows that there are very few data points classified as 1 (POOR maintainability).Consequently, it would be difficult to generate rules for such a small data subset.In other words, rules generated for 1 will be specific to these data points and will lack generality.
In order to create rules that are more general, we have processed the data.The data points were grouped together in a way to increase uniform distribution of data points over categories (scale values).The data points classified as 2 earlier were added to the group 1, and data points classified as 4 earlier were added to the group 3.In all, maintainability of software objects is classified as 1, 3 or 5 (where 1 means POOR maintainability, 3 means FAIR maintainability, 5 means EXCELLENT maintainability).Such division into 3 categories is used in all experiments presented in the paper.

Concept
The purpose of data analysis is to discover relations describing phenomena represented by data, and further use this knowledge to support decision-making processes.One of the methods commonly used for data analysis is extraction of IF-THEN rules.In such a scenario, we are interested in obtaining useful rules, i.e., rules that have good coverage, good classification capabilities, and are unique.Importance of these aspects depends on intended utilization of extracted rules.For example, we can be interested in obtaining a diversified set of rules in order to discover multiple views of relationships existing among the same data points.
There are multiple techniques and tools that are used to extract IF-THEN rules from data.Each of these methods 'sees' data differently and generates different rules.The difference in rules is judged based on their visual inspection.This means that rules are different syntactically.At that point a number of interesting questions could be asked: How these rules are different?Do they 'cover' different data points?Can they be used together, for example, for prediction purposes?Mentioned above issuesfinding useful, similar or different rulesled us to taking a deeper look at rules.This section is focused on describing a proposed methodology for semantic-based analysis of rules.We use this approach to identify the most interesting rules based on two characteristics of rules: how good a given rule is; and how unique that rule is among all rules extracted from data.The process of finding good rules can be done in a number of different ways.The most common one is a method that uses measures representing performance of rules (see Appendix for some possible metrics useful for rule performance evaluation).The evaluation metric applied in this paper is a simple measure called the Laplace ratio (Eq.8, Appendix).
A method that can be applied to find unique rules is proposed in this paper.It is based on comparison of rules using concepts of similarity and inclusion.These two concepts are determined at a semantic level.The term semantic is used here because comparison is performed based on data points covered by rules.It means that values of similarity and inclusion are obtained by applying equations 5, 6, and 7 with numeric values representing a number of data points exclusively and mutually covered by rules.The methodology is described below.

Methodology
The process of semantic-based analysis of rules is divided into a number of steps.The algorithmic representation of the methodology is shown below: The very first step (Step 0) involves utilization of rule-generating tools.There are multiple tools that can be used to generate IF-THEN rules.There are no any special requirements that these tools should satisfy, the only thing that is needed is the ability to create rules.Description of such tools is outside the scope of this paper.Here, we use two rule-generating tools: C5.0/See5 and 4cRuleBuilder (The C5.0/See5 tool is an updated version of the well-known C4.5 algorithm [27,34].It has capability of generating classifiers that are expressed as decision trees or sets of IF-THEN rules.The commercially available tool called 4cRuleBuilder [35] is used to build the second data model.This tool uses supervised learning to generate a data model from discrete numerical or nominal data, and has build-in descretization schemes for continuous attributes).http://www.ispacs.com/journals/ojids/2016/ojids-00007/International Scientific Publications and Consulting Services Tasks that constitute other steps are specific to the proposed approach and are described below.Step 1.1.1embraces activities related to evaluation of rules based on their ability to perform a proper classification.In such a case, any technique that allows us to determine how good a given rule is can be applied (Appendix).The evaluation process is performed on a set of data points.Each rule, created by any rule-generating tool, is checked against that set.Therefore, comparison and analysis of rules is sound, i.e., all rules 'see' the same data points, and calculated 'rule goodness' measures are comparable.After goodness of rules is determined, the next step is to select the best rules, Step 1.1.2.The term best refers to rules with the highest scores obtained during their evaluation.That step requires setting up a parameter.A threshold value has to be identified in order to distinguish best rules form all other rules.The value of this threshold depends on a type of rule performance measure used in the Step 1.1.1.The threshold value should be set up specifically for it.It is possible that the value has to be adjusted during the analysis process.One of the possible reasons for that adjustment is concern about a number of best rules.If the threshold value is set too high and all rules have values of performance measures less than itthe threshold value could be reduced to ensure that some rules are selected for further analysis.
Once the best rules are identified, a comparison process of the best rules takes place -Steps 1.1.3and 1.1.4.Equations 5, 6 and 7 are used for identifying sets of different and similar rules representing a phenomenon under investigation.The procedure consists of two phases: 3) is dedicated to identifying rules that are semantically similar to each other (Eq.5); calculation of similarity measures for all pairs of rules is performedthe pairs with values of similarity measure larger than specified by a user are considered for removal (a value of a threshold for identifying similar rules depends on applicationa user can indicate that he/she is interested only in a perfect similarity (1.0), or he/she can tolerate deviation up to 10-20% from a perfect similarity);  the second phase (Step 1.1.4)is focused on selecting rules that will be removed; this process is performed based on inclusion measures (Eqs.6 and 7); for each rule identified as considered for removal (previous phase) values of inclusion of this rule in other rules are calculated; rules with the highest values of inclusion measure are removed, (once again a value of inclusion threshold is set up based on user's preferences regarding deviation form a perfect inclusion (1.0).
The steps descried above (from 1.1.1 to 1.1.4)are repeated for each category, and for each set of generated rules.
Step 2 is a simple process of combining all rules for a single category into one set.Each set contains unique rules that have been selected across rules constructed for one category by all tools.The next two steps: Step 3.1 and Step 3.2 are repetitions of Steps 1.1.3and 1.1.4but this time they are performed for all previously selected rules for a single category without distinguishing rule-generating tool.
As the final result, Step 4, we obtain rules (for each category) that are the most unique across all rules constructed by rule-generating tools.

Example
To illustrate the proposed methodology a simple example is presented.0.649 that is below the threshold, and the rule R9 needs to be discarded.We are left with 4 rules for Class 0, and 4 rules for Class 1. Now, we need to compare rules within each category, Steps 1.1.3and 1.1.4.In other words, we need to select rules on the basis of their importance and uniqueness for representing each category.In the end we will be left with best rules for Class 0 and Class 1 respectively.We start the process by building a contingency table for rules to be compared, Table 3.How to build this table is explained in Section 3.2.a -# of points covered by both rules; b -# of points covered exclusively by R'; c -# of points covered by R'' only, Sstands for similarity, Incfor inclusion.
For each of these pairs, we plug in the values a, b, c, and d into equations 5, 6, and 7 to calculate the similarity (S(R',R'')), and inclusion between each pair of rules (Inc(R'->R''), Inc(R''->R')).The results for some pairs are shown in Table 3.The purpose is to keep rules that are important in terms of defined 'goodness' measure, the Laplace ratio in our case, and uniqueness.The pairs with high semantic similarity value are the ones that will be considered for further analysis.For example, the pair of rules R1 and R3 shows a semantic similarity of about 76% with each other.It means that the chances of one rule being redundant or inclusive in the other are high.Rule inclusion values clearly show that R3 is 100% included in R1.In other words, all the data points that are classified by R3 are also classified by R1.Thus, there will be no coverage degradation if R3 is discarded because all data points covered by R3 are being covered by R1.On the other hand, these two rules are interesting because they are different syntactically but yet cover the same 29 data points.It bring an interesting observation that if a point has a41>963 then the same point has a3>819, and a39>11253.
For the next pair, we notice 90% inclusion of R4 in R1 even if the similarity of those two rules is not so high.In such case, it depends on a domain expert if he/she is willing to take some loss of coverage if he/she chooses to remove R4.Next the two pairs -(R2, R3) and (R2, R4) show fair and low ranges of similarity and inclusion, respectively.
6 System for Semantic-based Rule Analysis

Motivation and Objectives
Extraction of IF-THEN rules from data is of significant importance for discovering important/interesting relationships existing among different data attributes.Therefore, we decided to develop a system that could perform the whole task automatically, using multiple rule-generating tools, provide high consistency of The following objectives ware taken into account during a design process of the SSACR:  ability to handle different rule-generating tools, this means capability to evaluate rules generated by different tools and represented using different formats;  ability to evaluate different rules (generated by different tools, above) against the same set of data points, so a uniform comparison of rules is possiblethis is necessary for semantic-based analysis of rules;  ability to handle different splits of data, it means that generation and evaluation of extracted rules can be done based on different subsets of data;  ability to alternate actions of the SSACR in a simple way, this means that there is a simple way of modifying a rule analysis processadditions, replacements, and deletions of different stages and steps.

System Architecture and Design
Presented above requirements led us to the selection of an architectural pattern called the blackboard [8] as the architectural framework of the SSACR.That pattern is suited to address issues of information sharing, flexibility in performing different tasks, as well as modifiability and extendibility of a system itself.Blackboard systems were designed to resolve complex Artificial Intelligence (AI) problems.The blackboard system works based on the following metaphor [11]: "Imagine a group of human specialists seated next to a large blackboard.The specialists are working cooperatively to solve a problem, using the blackboard as the workplace for developing the solution.
Problem solving begins when the problem and initial data are written onto the blackboard.The specialists watch the blackboard, looking for an opportunity to apply their expertise to the developing solution.When a specialist finds sufficient information to make a contribution, she records the contribution on the blackboard, hopefully enabling other specialists to apply their expertise.This process of adding contributions to the blackboard continues until the problem has been solved".
There are a number of features that make this architectural pattern suitable for solving complex problems.The blackboard systems are flexible in selection of steps that should be taken to accomplish a specific task.They can be defined as repositories of solutions and contributions to the current problem that are repeatedly updated.The blackboard architecture consists of three major components:  The specialized modules, which are build by the experts to provide specific expertise needed by application systems. The blackboards, which contain all the data, problem statements, updated solutions and any contributions leading towards a solution.The blackboards are being updated and watched continuously. Control unit that controls the flow of problem-solving activities.This can be seen as a way to organize the use of data in other a flow of data and intermediate solutions among working units in effective and coherent manner.When compared with the methodology presented in Section 5, the SSACR has ability to utilize multiple data splits.This concept has been introduced to address the issue of generalization of rules.Multiple splits allows for application of different splits for generation and evaluation of rules.For example, a dataset can be divided into three subsets: training, validation, and testing.The training subset is used to generate rules, the http://www.ispacs.com/journals/ojids/2016/ojids-00007/International Scientific Publications and Consulting Services validation subset is used to evaluate them, and the testing subset is left to evaluate rules that have been selected through an analysis process.Such an approach leads to selection of more generic rules.Figure 3 below represents the blackboard architecture of the SSACR.A number of blackboards are used to ease the readability of the solution.The blackboards are split on the basis of the type of information they contain:  data_blackboard includes datasets under consideration; every time we run a system for a different dataset, a new data are copied into the data_blackboard replacing the existing dataset;  results_blackboard contains the results obtained for a given dataset; it also contains all intermediate results (classification results of all rules) necessary for performing the final semantic-based analysis of rules;  goal_blackboard is a blackboard that controls all tasks involved in generation, evaluation and analysis of rules;  main_config contains basic information about data files and their formats, as well as rule-generation tools that are used.The Control Unit is in charge of the whole analysis process, and supervises execution of individual steps/activities.It is responsible for calling different specialized modules in an ordered fashion.It may also be defined as a central control component that evaluates the current state of processing and coordinates activities of specialized modules [8].The Specialized Modules form the main structure of the SSACR blackboard architecture.These modules are used collectively to select the best rules among large number of rules generated by rule-generating tools.These modules are built as independent units.Following is the description of the major modules of the SSACR  write_xyz_datathis module is responsible for reading the datasets -training, validation, and/or test datasetand writing them into the data_blackboard.A version of this modulewith a different nameexists for each dataset.For example: write_val_data will deal with the validation data, and similarly the write_test_data will deal the test dataset. RuleBuildersthis module converts rules into a C code.Every rule-generating tool generates rules in a specific format.Therefore, there is a need to translate these different formats into a generic format that will be used by the SSACR.Each RuleBuilder module takes as its input a file with rules generated by a specific rule-generating tool and creates based on this file a C program containing these rules as conditional statements.In other words, each RuleBuilder module "translates" IF-THEN rules into a program.A generated C program is then compiled and used as dynamic library.The main reason for such a solution is to efficiently evaluate rulesonce they are transformed into a program they can be used to classify any data points.Therefore, rules are compared based on coverage calculated using the same data sets. fire_rulesthis module implements one of the important steps in the rule analysis process.The fire_rules module picks a single data point, "takes" a rule form a respective dynamic library, and "fires" the rule.The result is <ruleID, category>the information identifying a rule has been fired <ruleID, …>, and the category predicted by that rule <…, category>. comapare_rulesthis module performs the most important step in the process of analysis of rules.Each rule is semantically compared with all the other rules of the same category.They are compared based on their goodness, similarity and inclusion as discussed in Section 5. Rules that are similar to other rules and the rules that are included in others are marked for removal.The SSACR is implemented in Python 2.4, and Extensible Markup Language (XML) is used as the format for storing data, results, and control information in blackboards.

Maintainability Data -Individual Programmer Approach
The experimental section of the paper describes analysis of software maintenance data representing individual software engineers.In the first experiment, we extract rules representing (modeling) each engineer and find out commonalities among these rules.These activities illustrate a process of identifying most suitable rules for transparentwhite-box typedecision-making processes.All 366 data points that belong to three categories 1 (POOR), 3 (FAIR), and 5 (EXCELLENT) are used for generation of rules (see Section 4 for details).Two experiment scenarios are defined to generate rules for each category: a. all-category scenario: rules for all three categories -1, 3, 5are generated in one run of the rule-generating tool (The rule generation tool used in this section is C5.0/See5); the input file is a single data file consisting of data points identified as 1, 3 or 5; the rules for each category 1, 3 and 5 are generated; b. two-category scenario: the original data file is used to create three data files: each new file consists of data points that "belong" to two groupsa group representing one of the categories 1, 2 or 3, and a group of points that do not belong to the category 1, 2, or 3; for example a data file for category 1 consists just two categories -1 and 0 (the data points identified as 3 or 5 in original file are now points that belong to the category 0).
Once multiple rules are generated for each category and for each software engineer, we select the best rules.We focus only on two categories Poor Maintainability and Excellent Maintainability.The goal is to identify just two rules for each engineer, each category.Therefore, we do not specify any specific threshold values (Section 5.3).We want to emphasize the fact that these thresholds are flexible, and it is up to a domain expert to decide on their values.Those values depend on expert's needs and his/her ability to tolerate imperfections in similarity and inclusion of rules.http://www.ispacs.com/journals/ojids/2016/ojids-00007/International Scientific Publications and Consulting Services

Software Engineer V
The software engineer V indicated that 58 objects (15.8 % out of 366) are of Poor Maintainability.The application of both scenarios for generating rules results in two rules for the all-category scenario, and two rules for the two-category scenario.The values of Laplace ratios for the rules are in the range of 0.88 to 0.90 for rules generated by all-category scenario, and 0.95 to 0.97 for rules generated by two-category scenario.
The initial selection of rules is done solemnly on the basis of "goodness" of rules.From all these rules, the rules with the Laplace ratio above 0.9 are selected: a single rule V-L-1A (The letter A indicates that the rules with the letter were generated using all-category scenario) from the first set, and two rules V-L-1 and V-L-2 from the second set.All three rules are shown below.The inspection of these three rules leads to the conclusion that the rule V-L-1 is the best.It properly classifies 29 points and does not misclassify any points.A contingency table for all three rules is built, Table 4.It can be observed what the rules V-L-1 and V-L-1A are the most similar ones.

RuleV
High similarity of rules V-L-1 and V-L-1A means that one of them can be redundant and removed from the set of rules.The inclusion measures are calculated for V-L-1 and V-L-1A in reference to the other two rules, Table 5.The values of inclusion measures indicate that the rule V-L-1 is fully included in the rule V-L-1A.This means that the rule V-L-1 can be removed from the set of rules without any degradation of coverage.
Overall, two rules V-L-1A and V-L-2 (rules which are underlined) are used to represent the software engineer V. Poor Maintainability if it has the sum of each attribute's value (ATCO) higher than 52, and its Halstead Program Volume (HLVL) is higher than 11,253.

Table 5: Inclusion table for rules V-L-1 and V-L-1A
Inclusion pair Inclusion measure 0.5000

Software Engineer A
The software engineer A identified the smallest number of software objects as of Poor Maintainability.He pointed to only 18 such objects (4.9% out of 366).The application of both scenarios resulted in four rules for the all-category scenario, and three for the two-category scenario.The Laplace ratios for these rules are much lower then for rules representing the software engineer V.The best three rules across all of them are presented below.The semantic similarities are presented in Table 6.As it can be observed all the rules are quite different from each other.The highest value of the measure is obtained for the pair A-L-1 and A-L-2A.The calculations of inclusion measures for those two rules resulted in the elimination of the rule A-L-2A (its inclusion in the rule A-L-1 is 0.7500  According to the proposed analysis, similarities measures are calculated, Table 7.Following this, inclusion measures of rules D-L-1 and D-L-1A in reference to other rules are obtained.As the result, the rule D-L-1 is removed (its inclusion in D-L-1A is 0.8333).

RuleA
The simplest of rules that represent the software engineer D indicates that Poor Maintainability software objects are ones with number of remote methods (REMM) exceeded 101.The second rule is more complex: the Poor Maintainability software objects have more than 401 lines of code (LOC), less or equal to two depth of inheritance (DINH), more than one siblings (SIBL), more than 80 methods executed in response to a message received by the class (RFO), and a mean method name length (MNL3) over 13 characters.The similarity measures for all three pairs indicate that the rules are alike.The rules of the last pair, Table 8, are the most similar.After calculating inclusion measures it becomes obvious that the rule V-H-1 can be removed (its inclusion measure with the rule V-H-2 is 0.9200).The similarities measures are quite low indicating that the rules cover quite different objects, Table 9.In order to eliminate one rule we pick the first pair and after calculating the inclusion measure we remove the rule A-H-1 (the measure of inclusion in the rule A-H-2 is 0.8529).The pair D-H-1 and D-H-2 has the highest value of similarity measure (Table 10), and the rule D-H-1 has a high value of inclusion measure (with the rule D-H-3 -0.9437).

Comparison of Rules Generated for Software Engineers V, A, and D
The comparison of all rules generated for all programmers shows some similarities.The similarity value calculated for the three rules V-H-2, A-H-2, D-H-3 (0.6462) indicates that these rules are quite similar.This leads to the statment that three software engineers identified reasonable number of the same software objects as of Excellent Maintainability.These objects, that are practically the same, are represented by rules that look quite differently.This means that each engineer "saw" different aspects of software objects.A better description of Excellent Maintainability objects can be obtained when those three rules can be combined.The second experiment illustrates the ability of the SSACR to utilize multiple data sets and rule-generating tools.The SSARC is applied to process the maintenance data associated with the software engineer V. Two different rule-generating tools are used to extract IF-THEN rules from the data.One of them is C5.0/See5 already known from Sections 5 and 7.The other one is 4cRuleBuilder.The maintainability data were pre-processed and three output categories were identified: POOR, FAIR, and EXCELLENT (Section 4).The approach described here is a little bit different than the one presented in Section 7. The whole data set -366 data pointshas been split into two subsets: training and testing.The sizes of these sets are 245 and 121 data points respectively.The training subset is used to generate rules, and the testing subset is utilized during semantic analysis of rules for calculating similarity and inclusion measures.This time, the values of thresholds (Section 5.3) are fixed.The threshold for "goodness" is set up to 0.7 for C5.0/See5, and 0.6 for 4cRuleBuilder.In the case of similarity and inclusion, the value of 1.0 indicates an automatic removal of a rule.http://www.ispacs.com/journals/ojids/2016/ojids-00007/International Scientific Publications and Consulting Services

Rules Generated by C5.0/See5
The C5.0/See5 generates 11 rules.Among those 11 rules, five rules are for the category Poor Maintainability, three rules for the category Fair Maintainability, and remaining three rules are for the category Excellent Maintainability.
Excellent Maintainability.The rules generated for category Excellent Maintainability are as follows: RuleC5-M-1: (17, 0, 0.947) 3IF Those rules are evaluated using test dataset.It is important to verify how they perform on the previously "unseen" data.The values of Laplace ratio calculated for test data for all the three rules range from .882 to .956.All these three rules have the Laplace ratio values above the threshold value, therefore, all of them are considered further for finding the best rules for prediction of Excellent Maintainability.The values of similarity and inclusion measures are presented in Table 11.
The above comparison of rules for Excellent Maintainability clearly shows that the rule C5-M-2 dominates the other two rules.The rule C5-M-1 is completely included in the rule C5-M-2 (Figure 6), and hence can be discarded without any loss of coverage.The rule C5-M-3 is more than 86 % included in the rule C5-M-2, and if the analysis is focused on finding a small number of rules with possible loss of coverage, the rule can be also discarded.The fact that the rule C5-M-2 appears to be the most important and relevant is supported by the high value of its Laplace ratio -0.956 which it the highest among all three rules.Elimination of the rule C5-M-3 depends on the type of application and requirements of domain experts.There is a slight loss of coverage if the rule C5-M-3 is ignored, thus domain experts have to critically analyze a possible loss before making any decision regarding elimination of this rule.Another interesting observation is made when we look at those rules from the point of view of their syntax.These rules are very different syntactically.For example, the first two rules hardly have the same attributes.The only attribute in common is MNL1 (Maximum Method Name Length), but if we look at the results of comparison of those rules, Figure 4, we find that the rule C5-M-1 is 100 percent included in the rule C5-M-2.All the data points that are covered by rule C5-M-1 are also covered by rule C5-M-2.This proves the redundancy of these rules -they cover the same points.Thus, an important observation is that two rules might be very different syntactically but based on semantic measures they both cover the same data points.In other words, two syntactically different rules provide us with different descriptions of the same data pointsthis can be useful during a process of knowledge extraction.The values of similarity and inclusion obtained for both rules are presented in Table 13. Figure 6, illustrates that the rules are quite different.None of them is fully included in the other.The inclusion measures, Table 13, are 0.5000 and 0.8570although the measures are relatively high we do not discard any ruleat the beginning of Section 8 we stated that removal of a rule happens only when inclusion equals to 1.0.Thus, both the rules should be used to predict Poor Maintainability.16 is the summary of all rules, generated by both tools -C5.0/See5 and 4cRule Builder, that have been found to be unique.

Combined Analysis of Rules generated by C5.0/See5 and 4cRuleBuilder
The SSACR is able to perform one more important taskcomparison of rules generated by different tools.
In the presented work, there are two different tools C5.0/See5 and 4cRuleBuilder.Figure 9 shows the comparison chart of inclusions of C5.0/See5 and 4cRuleBuilder rules.Figure 9 does not show inclusions among all rules to ease readability.Clearly, we see that the rule C5-M-2, generated by C5, dominates as it covers a good percentage of data points covered by other rules.There are two possible ways of looking at the results.We already know that the rule C5-M-2 dominates all the rules generated by both tools and thus, can be used alone to predict Excellent Maintainability.Alternatively, domain experts can use the set of other rules, i.e., C5-M-3, 4C-M-1, 4C-M-2, and 4C-M-3.Comparisons of these rules clearly illustrates that this set of rules is very distinctive and the rules are dissimilar to each other.Thus, this set of four rules can be used instead of using the rule C5-M-2.These four rules are very different syntactically when compared with the rule C5-M-2 what means that we have a chance to obtain a very different description (in the form of different software attributes) of software objects that belong to the category Excellent Maintainability.Now, we show the comparison of rules for the category Fair Maintainability.From Table 16, we have two rules generated by C5.0/See5: C5-M-6 and C5-M-7, and 4C-M-4, 4C-M-5, 4C-M-6the rules generated by 4cRuleBuilder.The values of similarity and inclusion between these rules are shown in Table 18.http://www.ispacs.com/journals/ojids/2016/ojids-00007/International Scientific Publications and Consulting Services The analysis clearly indicates that all these rules are quite different.Only a single case of high similarity and inclusion is observed.The rules C5-M-7 and 4C-M-4 have 70.4% similarity.They also display high levels of inclusion -95% of data points covered by the rule 4C-M-4 are also covered by the rule C5-M-7.Therefore, 4C-M-4 can be discarded.This leaves behind four best rules -C5-M-6, C5-M-7, 4C-M-5, and 4C-M-6.

Conclusions
The paper presents a methodology for semantic-based comparison of rules using measures of similarity and inclusion.Application of these measures gives a chance to look 'inside' rules and creates possibility of making a selection of good and unique rules representing analyzed data.Such sets of rules represented the most suitable rules for constructing rule-based decision support systems.A system for analysis of rules based on the proposed methodology has been designed and implementedit is called the System for Semantic-based Analysis and Comparison of Rules (SSACR).The SSACR offers a mechanism for finding generalized rules based on their semantic comparison.This mechanism provides a better understating of the rules' performance and thus enables the domain experts to choose rules that interest them the most.The system presented considers the Laplace ratio as a distinguishable measure of goodness of rules, and identifies a set of rules that can be considered for removal.The proposed methodology is applied to software maintenance data that represent evaluation of maintainability levels of software objects that was performed independently by three software engineers.The paper presents rules that have been identified as the best and most unique ones.Sets of rules representing software attributes that each software engineer implicitly considered as the most important for a given level of maintainability are shown.The results of the comparison of these rules are also included.The comparison of rules constructed by diffident rule-generating tools is discussed.

Appendix
One of important ways of evaluating a rule is to numerically represent its classification capability.In other words, this means calculating "goodness" of rules.The "goodness" can be calculated in a number of ways using variety of measures.Some of these measures are directly obtained from the classification process: a number of correctly classified data points, a number of misclassified points; while some could be obtained via simple arithmetic calculations, for example: accuracy, recall, and precision, as well as confidence and support.There are a number of ways how correctness of the rule can be evaluated.http://www.ispacs.com/journals/ojids/2016/ojids-00007/International Scientific Publications and Consulting Services The values of accuracy, recall and precisions are calculated based on the confusion matrix.In the confusion matrix for the case of two-class classification process, let a represents TruePositive, b -FalsePositive, c -FalseNegative, and d -TrueNegative.In such case the following formulas can be used: The method used in the paper is based on application of the Laplace ratio.This ratio is very easily applicable to evaluate "goodness" of a single rule.In such case, the direct measures: number of properly classified data points and number of misclassified data points, are used.The formula used for calculating value of the ratio presented below: LaplaceRatio = Properly_Classified_Points +1 Properly_Classified_Points + Misclassified_Points + 2 (10.13)This formula provides a balance between rule's ability to properly classify data points, and its popularity.The popularity relates to a number of data points that satisfy the rule.It is assumed that such behavior of the Laplace ratio is very much related to human's perception of "goodness" of the rule that is a combination of classification ability of the rule and its generality.

Figure 1 :
Figure 1: Distribution of maintainability data

1 . 1 .1 calculate rule performance measures 1 . 1 . 2 1 . 1 .3 evaluate similarity of rules 1 . 1 . 4
generate sets of rules using different rule-generating tools 1 for each set of rules do 1.1 for each category (for each unique value from consequences of rules) do select rules with the highest values of measures evaluate inclusion of rules and identify unique rulesthe rules that are not similar to others, and the rules that are not included in other rules end 1.2 gather all unique rules for each category end 2 combine rules for each category that have been selected from sets of rules generated by different tools 3 for each category do 3.1 analyze rules (similarity and inclusion) 3.2 identify unique rules end 4 gather all unique rules for each category

Figure 4 :
Figure 4: Rule Inclusion Diagram for Rules generated by C5.0/See5 for the category Excellent_Maintainability

Fair
Maintainability.For the category Fair Maintainability, there are three rules generated by C5.0/See5 based on the training dataset.Firing these rules on test dataset and ignoring the ones with Laplace ratio values less than 0.70 (threshold), leads to two rules.They are as follows: RuleC5-M-6: (12, 4, 0.722) IF MNL1 > 25 & DAC <= 3 THEN category: fair maintainability RuleC5-M-7: (52, 19, 0.726) IF HLOR > 55 THEN category: fair maintainability The rules are compared, Table 12 and Figure 5. Rules C5-M-6 and C5-M-7 have low similarity and inclusion values.Based on the above comparison of these two rules none of them can be ignored.Both rules are used to predict the category Fair Maintainability on maintainability data.http://www.ispacs.com/journals/ojids/2016/ojids-00007/International Scientific Publications and Consulting Services

Figure 9 :
Figure 9: Comparison of the best rules generated by C5.0/See5 and 4cRuleBuilder the rule is the relative frequency of rules that contain A and C: (10.11)(10.12)

Table 1 :
A contingency table for rules R1 and R2

Table 3 :
Values of Similarity and Inclusion http://www.ispacs.com/journals/ojids/2016/ojids-00007/International Scientific Publications and Consulting Services results, use less time, and handle large datasets.The developed system is called the SSACRthe System for Semantic-based Analysis and Comparison of Rules.

Table 4 :
Contingency table forPoor_Maintainability rules representing the software engineer V

Pair of rules (R1, R2) Contingency table entries Similarity measure a
).The two remaining rules define Poor Maintainability, from the point of view of the software engineer A, in the following way: a software object is of Poor Maintainability if its Halstead Difficulty (HLDF) is above 135, or if it is a GUI object with more than 695 lines of code (LOC), median number of decisions (MWDC) less or equal one, no children (CHLD), and more than 1 overridden methods (DVRM).http://www.ispacs.com/journals/ojids/2016/ojids-00007/ International Scientific Publications and Consulting Services

Table 6 :
Contingency table forPoor_maintainability rules representing the software engineer A The software engineer D evaluated 35 (9.6 % out of 366) software objects as of Poor Maintainability.Both rule generation approaches resulted in five rules for the all-category scenario, and two for the two-category scenario.The best rules are shown:

Table 7 :
Contingency table for Poor_Maintainability rules representing the programmer D International Scientific Publications and Consulting Services 7.1.

4. Comparison of Rules Generated for Software Engineers V, A, and D
The analysis of all generated rules has led to selection of three pairs of rules, with each pair "representing" a single engineer.The next step is comparison of those three pairs of rules.This comparison should provide us with an indication if there is any common ground used to identify Poor Maintainability objects.The similarity measures are calculated for eight triples.We can conclude, based on the results, that there is a very little "agreement" among all three software engineers in identifying Poor Maintainability software objects.The values of similarity measures are very smallin just a few cases the similarity measures are around 0.25.

Table 8 .
Excellent Maintainability for software engineer V indicate that these software objects have: a number of methods (METH) less than three, a number of decisions (WDC) less than or equal 11, a maximum method name length (MNL1) less than 22, and a number of methods (WMC2) is less or equal to five; or a maximum method name length (MNL1) less than 24, Halstead Program Length (HLPL) less than 142, and Halstead number of unique operators (HLVR) less than 11.Contingency table for Excellent_Maintainability rules representing the software engineer V The software engineer A identified 248 objects (67.8% out of 366) as of Excellent Maintainability.The best rules that represent his decisions are shown: International Scientific Publications and Consulting Services7.2.2.Software Engineer A

Table 10 :
Contingency table for Excellent_Maintainability rules representing the programmer A

Table 11 :
Contingency table forExcellent_Maintainability rules representing the software engineer V and generated by International Scientific Publications and Consulting Services

Table 12 .
Contingency table for Fair_Maintainability rules representing the software engineer V and generated by

Table 13 :
Contingency table forPoor_Maintainability rules representing the software engineer V and generated by a total of nine rules.One rule is for the category Poor Maintainability, five rules for the category Fair Maintainability, and three rules are for the category Excellent Maintainability.Excellent Maintainability.Given below are the three rules for the category Excellent Maintainability generated on training data by 4cRuleBuilder.We include these rules to illustrate how different they are syntactically.Semantic similarity of those rules with the ones generated by C5.0/See5 is analyzed in Section 8.3 All the rules above for the category Excellent Maintainability have high values of Laplace ratio (threshold Laplace ratio for 4cRuleBuilder is .60)andhencetheyarecompared with each other, Table14and Figure7.
Figure 7: Rule Inclusion Diagram for rules generated by 4cRuleBuilder for the category

Table 15 :
Contingency table forExcellent_Maintainability rules representing the software engineer V Figure 8: Rule Inclusion Diagram for rule generated by 4cRuleBuilder for the category Fair_Maintainability The comparison clearly shows that none of the rules can be discarded.The rules are very different semantically and show very low of redundancy.Poor Maintainability.There is only one rule generated by 4cRuleBuilder for the category Poor Maintainability.The rule is shown below.Additionally, its value of Laplace ratio is below threshold.The rule is not considered for further investigations.http://www.ispacs.com/journals/ojids/2016/ojids-00007/

Table 16 :
Summary of Best Rules on Maintainability Data

Table 16 .
Till now, we have been comparing rules generated by the same tool for each category.Once the relevant and important rules generated by each tool are found, they can be compared with relevant rules generated by the other tools.The comparison is performed only for categories Excellent Maintainability and Fair Maintainability.The category Poor Maintainability has only two rules generated by C5.0/See5, and the rules have been already analyzed.All of these rules have high Laplace ratio values and are quite unique (all highly inclusive rules are ignored).The results of comparison (Comparisons is done on test dataset (121 data point))are presented in Table17.http://www.ispacs.com/journals/ojids/2016/ojids-00007/ International Scientific Publications and Consulting Services

Table 17 :
Contingency table for Excellent_Maintainability rules (the software engineer V)

Table 18 :
Contingency table for Fair_Maintainability rules representing the software engineer V