Using blocking approach to preserve privacy in classification rules by inserting dummy Transaction

The increasing rate of data sharing among organizations could maximize the risk of leaking sensitive knowledge. Trying to solve this problem leads to increase the importance of privacy preserving within the process of data sharing. In this study is focused on privacy preserving in classification rules mining as a technique of data mining. We propose a blocking algorithm to hiding sensitive classification rules. In the solution, rules' hiding occurs as a result of editing a set of transactions which satisfy sensitive classification rules. The proposed approach tries to deceive and block adversaries by inserting some dummy transactions. Finally, the solution has been evaluated and compared with other available solutions. Results show that limiting the number of attributes existing in each sensitive rule will lead to a decrease in both the number of lost rules and the production rate of ghost rules.


Introduction
Different data mining techniques and algorithms used to mine useful and hidden knowledge from databases have their own upsides and downsides.
Production and hosting by ISPACS GmbH.http://www.ispacs.com/journals/jsca/2017/jsca-00073/International Scientific Publications and Consulting Services One of the disadvantages resulted from using these techniques is disclosure of sensitive information; this issue will jeopardize privacy and security of data and bring some disastrous consequences upon the owners.In order to solve the mentioned risk, Privacy Preserving in Data Mining (PPDM) was proposed.Taking advantage of privacy preserving and data sanitization techniques along with extracting new useful patterns would minimize the possibility to gain access over classified information.The issue of privacy preserving have been put under study throughout different data mining techniques.The present paper concerns the issue of privacy preserving in mining process of classification rules.In the proposed solution, after mining classification rules from main database, selecting a number of rules as the sensitive rules by using blocking approach, editing a limited number of attributes values, and inserting some dummy transactions in the database, hiding process of sensitive classification rules will be run.This study aims on hiding all the sensitive classification rules with leaving the minimum side-effects possible on the insensitive rules.The fundamental concepts in privacy preserving and related works are presented in Sections 2,3,4.Section 5 includes the proposed algorithm, and, at the end, section 6 and 7 present the discussions and the conclusion part.

Theoretical backgrounds
Privacy preserving techniques impede sensitive information to be mined after data mining process by making changes in main database.The main purpose of preserving privacy in data mining process is to sanitize sensitive knowledge and make the minimum changes possible in main database in a manner that sensitive knowledge cannot be mined through (after) running data mining process.Generally, data sanitization or data censoring is carried out in one of the two following ways: Data Reconstruction: in this method, main database does not directly go through changes [1].Data Modification: here, changes are directly made in main database.This method includes 2 different techniques: Definition 2.1.Data Distortion which is carried out by changing the data value from 1 to 0 or vice versa [2], [3].Definition 2.2.Data Blocking which is operated by replacing the data value with "?" value [2], [3].
One of the downsides of Data Distortion technique is placement of wrong values in database.In some databases like medical databases Data Distortion technique cannot be used because excluding some elements can be highly dangerous, also placing a number of wrong values can lead to terrible consequences [4].Data sanitization, however, has its own problems.Among its side-effects, hiding failure, lost rules, and ghost rules are the most common ones, detailed description in [5], [6].Up to now, various attempts on sensitive knowledge sanitization have been done.Moreover, different algorithm have been introduced for hiding sensitive association rules or sensitive itemsets with regard to various approaches such as heuristic approach, border-based approach, exact approach, or hybrid approach [7][8][9].This paper concerns classification rules hiding.As it follows, a number of previously conducted studies in this field will be presented.

Related works
Methods which have been introduced for classification rules hiding go under 3 general categories as follows: Statistical-based methods, Reconstruction-based methods, and Perturbation-based methods.In [10], a reconstruction-based algorithm for hiding sensitive classification rules in Categorical database http://www.ispacs.com/journals/jsca/2017/jsca-00073/International Scientific Publications and Consulting Services has been presented.In this method and for generating a sanitized database, a decision tree including only insensitive rules has been generated and used to reconstruct the database.Reference [11]presents another reconstruction-based algorithm (titled LSA) which uses insensitive rules mined form main database to reconstruct the sanitized database.In this method, LSA edits the transactions supporting sensitive and insensitive rules in a way that they only support insensitive ones.This way will decrease the amount of side-effects in the sanitized database.Reference [12]includes another algorithm for hiding classification rules.The method in this article is based on data reduction approach and performs the hiding process by completely deleting the selected tuples.The solution proposed in [13]is based on data perturbation.The ROD algorithm has been introduced to hide sensitive classification rules in categorical databases.In ROD, tuples supporting the insensitive rules enter the sanitized database after the separation of sensitive and insensitive rules.Then, by replacing some attributes in every tuple related to sensitive rules, these tuples are changed so as they look the same as insensitive rules.One of the obstacles of data perturbation based approaches is the substitution of correct values in main database with wrong values (changing 1 to 0 and vice versa).As it was mentioned earlier, such substitution under certain conditions such as medical databases may have severe consequences [14].Hence, data blocking based approach has been introduced in [15].In this approach, all values of attributes in sensitive rules are substituted with unknown value "?" in all transactions supporting sensitive rules.This method inspired our proposed solution for hiding classification rules.Our proposed solution performs the hiding process by substituting a limited number of sensitive attributes in sensitive transactions.Also, it has been tried to deceive and block the adversaries by inserting dummy transactions including unknown values.

Problem definition
Suppose that Rc is a classification rule which has been mined form the main database D: 1) The left side of the above rule includes attributes and their values so as  1 ,  2 , … ,   ≠  and the right side includes class attribute and its value.Every classification rule has a degree of support which shows the number of transactions supporting the rule.A transaction in a database supports a classification rule if all attributes and all values of attributes of the rule exist in the transaction and they are the same as the rule.Regarding above, the issue of classification rules hiding is presented in the following way: if R is a set of classification rules mined form the main database D and Rs is the set of sensitive rules (RsϵR), then the purpose is to generate a sanitized database sensitive rules of which cannot be mined and its insensitive rules R-Rs can be mined as much as possible.

Main section: The proposed algorithm(BCR)
The proposed solution substitutes a certain number of attribute values with an unknown value in transactions supporting sensitive rules in order to hide sensitive classification rules.This solution does its task in a manner that these sensitive rules cannot be mined from final database.This method aims both on hiding sensitive rules and editing the database in such way that adversaries cannot detect the values existed before editing.So, inserting transactions with unknown values is done for a number of sensitive rules.In BCR method, after mining the classification rules from database and selecting the sensitive rules, sensitive rules are sorted in descending order regarding the number of attributes.Database would be then scanned for every sensitive rule and the transactions supporting the rule would be selected for editing.To hide a sensitive rule not all attributes of a rule but a certain number of them are to be blocked.Hence, formula 2 is used to determine the number of attributes to be blocked: http://www.ispacs.com/journals/jsca/2017/jsca-00073/International Scientific Publications and Consulting Services Where NA is the number of attributes of a sensitive rule.
In blocking process, class attribute is always blocked and the remaining numbers of X value are being blocked randomly from the attributes on the left side.Blocking process for every sensitive rule is conducted on the same rate as the rule's Degree of support.In other words, substituting main values with unknown values in transactions supporting the sensitive rule continues until the rule is finally hidden.
During the blocking process, inserting dummy transactions for a number of sensitive rules is performed.
According to the sorted sensitive rules, ⌈  2 ⌉dummy transactions for the first sensitive rule with the highest number of attributes, ⌈  2 ⌉ − 1 dummy transactions for the second sensitive rule, ⌈  2 ⌉ − 2 dummy transaction for the third sensitive rule ,….Will be inserted.The first transaction supporting the sensitive rule will be the first to experience the insertion of dummy transactions.Dummy transactions will be inserted on a certain number.For every real transaction some values of which have been changed to unknown there will be a dummy transaction inserted (If a sensitive rule has Degree of Support of 5 and 5 supporting transactions, but we can insert only three dummy transactions for it, inserting process will then begin form the first main transaction and the two last transactions will earn no dummy transactions.).Formula 3 shows the total number of dummy transactions which can be inserted: Where n is the number of sensitive rules.The insertion process begins from the first transactions which support the first sensitive rule.A dummy transaction including unknown values similar to the main transaction is inserted.Unknown values will be displaced regarding the attributes of the sensitive rule.This method prevents the repetition of transactions.Since inserting dummy transaction and displacing unknown values for sensitive rules with small number of attributes is difficult, sensitive rules have been sorted in a descending order with regard to the number of attributes, so inserting dummy transactions is being performed in a same order.Pseudo code for the proposed solution is shown in Figure 1.The following shows some examples of performance of the proposed solution.

Examples of performance of the proposed algorithm
Table 1 shows the main database.Every single transaction in this table has 5 attributes, and Approval result is the class attribute.To mine the classification rules the RIPPER algorithm which is available in WEKA 3.7 software, Weka, Classifiers.Rules.JRip was used.In order to mine rules, all parameters of JRIP algorithm but the last one, i.e.Use Pruning which has been changed from True to False, are in default mode.Table 2 shows the mined rules.Rule No. 2 is considered as a sensitive rule.Since the selected rule is a sensitive one, rule sorting will not be performed.The transactions which support the sensitive rule are selected by scanning the initial database.In this example, transactions no.6 and 14 support the sensitive rule.The blocking process is now being performed with regard to Formula 2 and variable value of X=2.Two attributes one of which is class attribute out of rule's 3attributes are blocked.According to Formula 3, a dummy transaction is inserted for sensitive rule.This dummy transactions in which unknown values will be substituted is similar to transaction No.6.The sensitive rule is now hidden.After running the JRIP algorithm again with similar parameters, the only rule being mined is rule No.1.Table 3 shows the sanitized database.

Comparative evaluation of the proposed algorithm
To evaluate the performance of the algorithm, both the BCR proposed algorithm and ROD algorithm [13] were run on 3 real databases provided by UCI repository.Table 4 shows the specifications of the 3 databases.Classification rules were mined using RIPPER algorithm and 2 and 3 sensitive rules were randomly selected.Three factors of hiding failure, lost rules, and ghost rules have been considered to evaluate the algorithms.Hiding failure factor shows the number of sensitive rules which will be mined from database after performing the sanitization process.Lost rule declares the number of insensitive rules which will not be mined from database after performing the sanitization process.Ghost rule shows the number of rules which do not exist in initial database but will be mined from sanitized database after performing the sanitization process.Results for running the proposed algorithm are shown in Figures 1, 2 and 3.As it can be seen in diagrams above, the proposed algorithm has made the sensitive rules made hidden with the minimum side-effects possible.Selecting a limited number of attributes for hiding process would cause fewer changes in database and, consequently, lead to a decrease in number of lost rules and generation rate of ghost rules.Sorting the rules according to related number of attributes and inserting dummy transactions in a hierarchical manner, along with leaving some sensitive rules free of dummy transactions would cause perturbation in detecting the blocking process and result in a fall in the generation rate of ghost rules.
The above result show that in ROD algorithm by increasing the number of sensitive rules the number of side effects increase as well however this is not the case in proposed algorithm.Although as it is illustrated in Figure3 inserting the transaction has resulted in ghost rules, yet in comparison to ROD the proposed algorithm shows better results.This is due to defining limitations for the insertion of dummy transaction.
Given that the insertion of dummy transactions is subtractive and hierarchical the production on ghost rules is being controlled.By making changes in all those attributes that are supporting the non-sensitive rules, the ROD algorithm produces more ghost rules compared to proposed method.This problem gets even worst when the number of attributes of sensitive and non-sensitive rules are equal.The more the number of attributes for the sensitive rules the better the result from proposed algorithm compared to the ROD, and proposed method is even more efficient in maintaining data quality.ingeneral in order to hide sensitive rules it is enough to change only a limited number of attributes and inserting more changes will only result in more side effects as mentioned above, therefore the result from suggested algorithm have shown to be better in all cases mentioned above compared to the results from ROD.Another factor contributing to better results in the proposed method compared to the ROD is their difference in sorting rules.Since in ROD the sorting of rules is based on the number of supports, if there are no non-sensitive rules with less number of support the sanitization of data may face some problems.

The limitation of the proposed algorithm
One of the limitations in the proposed algorithm is its time complexity in large databases.In other words due to inserting transactions, the larger the database the worse becomes the sanitization in the proposed method.Further, by accidently choosing the attributes in proposed method in case the same attribute is present in many different non-sensitive rules it will increase the number of side effects such as ghost rules or lost rules.

Conclusion and future works
The present study introduced an algorithm based on blocking approach, for hiding sensitive classification rules.The proposed solution performs the hiding process as follows: in this method, according to the number of attributes the sensitive rules are being sorted in descending order where the hiding is started from the sensitive rule with the highest number of attributes.For hiding the sensitive rules in addition to class attribute, values of a limited number of attributes on the left side of the sensitive rules are randomly substituted with "?" value.Proposed method inserts some dummy transactions which contain unknown values for a certain number of sensitive rules to disturb database.In the proposed method in order to control the outbreak of side effects resulted from inserting transaction the insert of transaction is done hierarchically and descending and this is done for the limited number of sensitive rules and transactions.All the results shows the proposed method is clean from any hiding failure and in the number of lost rules and ghost rules it has shown to be more efficient in comparison to the introduced algorithm in [15].Given that the propose algorithm perform the sanitization with fewer adverse effects, therefor, is more successful in maintaining data quality.The proposed solution is capable of further development and can be used in other classification methods such as nearest neighbor classifier and decision tree.
In the proposed algorithm, the manner of choice the victim attribute can be changed so that based on the non-sensitive rules.It can be used in conjunction with the insert of transactions, So that the insertion of transactions carried out according to the non-sensitive rules.

Table 2 :
Classification rules mined from the initial database.Classification rules No. (gender = female) and (years at current work = short) => approval result=NO (3.0/0.0)(years at current work = medium) and (black list = yes) => approval result=NO (

Figure 3 :
Figure 3: Comparative evaluation in Car database.

Table 3 :
The sanitized database.

Table 4 :
Specifications of databases.