Linguistic Characterization of Natural Data by Applying Intermediate Quantifiers on Fuzzy Association Rules

The paper aims at applying fuzzy natural logic together with the Fuzzy GUHA method to analyse and linguistically characterise scientific data. Fuzzy GUHA is a tool for extracting linguistic association rules from data. Obtained associations are IF-THEN rules composed of evaluative linguistic expressions, which allow the quantities to be characterized with vague linguistic terms such as “very small”, “big”, “medium” etc. Originally, fuzzy GUHA provides several numerical indices of rule quality, which may not be easily understandable for domain experts that are not familiar with GUHA association rules. Therefore, we show in this paper that the theory of intermediate quantifiers (a constituent of fuzzy natural logic) can be applied to the results in an automatic manner in order to obtain natural linguistic summarization. We also present an idea of how the theory of generalized Aristotles’s syllogisms can be used for a detailed data analysis. We also open the possibility to use fuzzy partial logic for cases where some data is missing or undefined.


Introduction
The main goal of this paper is to put together theoretical results on intermediate quantifiers which were proposed in several papers (see e.g.[1,2,3,4]) with the Fuzzy GUHA method [5], and to introduce a linguistic characterization of natural data using generalized intermediate quantifiers.
Production and hosting by ISPACS GmbH.
The theory of intermediate quantifiers was introduced by Novák in [3] and now is a constituent of the theory of Fuzzy Natural Logic (FNL), which is a mathematical counterpart of the concept of Natural Logic introduced by Lakoff [6].This theory is based on Łukasiewicz fuzzy type theory (Ł-FTT) [4], which is one of the existing higher-order fuzzy logics.Fuzzy GUHA is a special method for automated search of association rules from numerical data.Generally, obtained associations are in the form A ∼ B, which means that the occurrence of A is associated with the occurrence of B, where A and B are formulae created from objects' attributes.As proposed by Hájek et al. [5], the original GUHA method allowed only boolean attributes to be involved.Some parts of their approach was independently re-invented by Agrawal [7] many years later and is also known as the mining of association rules or market basket analysis.A detailed book on the GUHA method is [8], where one can find distinct statistically approved associations between attributes of given objects.Fuzzy GUHA is an extension of a classical GUHA method for fuzzy data.In this paper, we work with associations in the form of IF-THEN rules composed of evaluative linguistic expressions, which allow the quantities to be characterized with vague linguistic terms such as "very small", "big", "medium" etc.To measure the interestingness of a rule, many numerical characteristics or indices have been proposed (see [9,10] for a nice overview).As a supplement to them, we try to utilize the theory of intermediate quantifiers to characterize the intensity of association, which allows us to use linguistic characterizations such as "Almost all", "Most", "Some", or "A few".As a result, we may automatically obtain the following sentences from numerical bio-statistical data: • Almost all people, who suffer atopic tetter, live in an area affected by heavy industry and smoke, suffer from asthma.
• Most people who smoke and suffer from respiratory diseases also suffer from ischemic disease of leg.
In the practice, it is often the case that some data are not available e.g.due the error in measures, missing results, or if the respondent is not willing to answer or has no opinion on the given subject.We can completely remove the cases with missing values to obtain clean data, but it can result in an excessive loss of information.Alternatively, we can handle missing values by using fuzzy partial logics, which were proposed by Běhounek and Novák in [11].They provide formal apparatus for several types of missing information such as "unknown" or "undefined" (i.e.not meaningful) value.Basically, the semantics of these logics formed by algebras of truth values is extended by a special value " * ".The structure of the remaining part of the paper is: Next section provides a brief insight into the methods of fuzzy natural logic (FNL) and Fuzzy GUHA method.The main section of this paper is Section 3 where we show how the theory of intermediate quantifiers, which is one of three theories of fuzzy natural logic, can be applied together with Fuzzy GUHA method for an analysis of natural data.At the end of this paper, we propose an idea for future work, which will be based on the application of the theory of syllogistic reasoning and the theory of generalized Aristotle's square of opposition.

Preliminaries
This section presents a short review of special methods of fuzzy natural logic and we recall Fuzzy GUHA method, which finds distinct associations between attributes of given objects.By a fuzzy set, we denote a function A : U → [0, 1] where U is a universe and [0,1] is a support set of some standard algebra of truth values.The set of all fuzzy sets over U is denoted by F (U).If A is a fuzzy set in U, we will write A ⊂ ∼ U.

Fuzzy natural logic
Fuzzy natural logic (FNL) is designed by means of formal tools of the fuzzy type theory (FTT) which was thoroughly elaborated in [4].The main goal of FNL is to create a mathematical model of specific human thinking that uses natural language.Thus, FNL contains a model of the semantics of natural language as well.FNL is a formal mathematical theory which includes three theories: • The theory of evaluative linguistic expressions [12]; International Scientific Publications and Consulting Services • The theory of fuzzy IF-THEN rules and approximate reasoning [13,14]; • The formal theory of intermediate quantifiers, generalized syllogisms and generalized square of opposition [1,15].
The FNL has a very high potential for applications.In [16], we can find an application of fuzzy natural logic and fuzzy transform to analyse, forecast and linguistically characterise time series.In [17], linguistic associations are used to drive an ensemble of forecasters, and in [18] to adjust flood predictions.Some of the methods are available as a software package "lfl" (Linguistic Fuzzy Logic) [19] for the open-source R statistical environment [20].The approach of this paper puts together results of several papers ( [21,22,23,24,25,26]).The primary objective of this paper is to put together GUHA method, the theory of evaluative linguistic expressions and the theory of intermediate quantifiers and to bring new results from natural data using acquired information in sentences of natural language.

The theory of evaluative linguistic expressions
In the continuation, we assume a special formal theory T Ev in a language J Ev of Ł-FTT where T Ev provides formalization of a meaning of evaluative linguistic expressions, which are expressions of natural language, for example, very small, medium, very big, very short, more or less deep, quite roughly strong.This theory is a special theory of higher order fuzzy logic, which was introduced in [12], and it is based on on the standard Łukasiewicz MV-algebra of truth values, where ⊗ is the operation of Łukasiewicz conjunction and → is the operation of Łukasiewicz implication.
The significance of an evaluative linguistic expression is constructed using a special formula representing intension.
The model interprets the latter by a function from a set of potential worlds (we favor to speak about context in our theory) in a fuzzy sets set.In every context, intension conditions the corresponding extension which is a fuzzy set in a specific universe designed as a horizon, this can be moved along the universe.Context in T Ev is understood as a formula w αo , whose interpretation is a function The theory of intermediate quantifiers works only with abstract expressions, e.g."very small" which do not include specification of "what is indeed small").Therefore, such expressions hold only a single (abstract) context: as a result, their intension, in fact, corresponds with their extension.The meaning of evaluative expressions is obtained as an interpretation of special formulas in a model of T Ev .The core idea are three horizons, which are defined as follows: The LH is a left horizon interpreted by a function assigning to each z o a truth degree of the fuzzy equality with ⊥; a similar construct applies to the right RH and middle MH horizons.
A ⟨linguistic hedge⟩ represents a class of adverbial modification that includes a class of intensifying adverbs such as "very, roughly, approximately, significantly".The subsequent special linguistic hedges are introduced: • narrowing hedges, for example, "extremely, significantly, very" • widening hedges, for example, "more or less, roughly, very roughly".
We recall that, for example, "very small" is more precise than "small" which is more precise than "roughly small".We introduce the following special linguistic hedges: {Ex, Si, Ve, ML, Ro, QR, VR} (extremely, significantly, very, more or less, roughly, quite roughly, very roughly, respectively), which are ordered as follows: International Scientific Publications and Consulting Services By ≼, we denote a relation of the partial ordering of the hedges.It can be found in [12, page 23].The ν ν ν is an empty hedge.Note that hedges Ex, Si, Ve have a narrowing effect, and ML, Ro, QR, VR have a widening effect with respect to the empty hedge.A special role in our theory is played by formulas Sm∆ ∆ ∆, Me∆ ∆ ∆, Bi∆ ∆ ∆, where the connective ∆ ∆ ∆ has been used as a specific hedge that can be taken as the linguistic hedge "utmost" (or, alternatively, a "limit").This construct makes it possible to also include classical quantifiers in our theory without the need to introduce them as special cases that are different from the rest of the theory.The interpretation of special formulas of T Ev in the canonical model and the extensions construction of evaluative expressions are schematically shown in Figure 1.

Definition of intermediate quantifiers
Intermediate quantifiers are linguistic expressions, such as most, many, almost all, a few, a large part of, etc. Concerning our intermediate quantifiers, the classical theory of generalized quantifiers categorise them to ⟨1, 1⟩ ( [27,28,29]) that are isomorphism-invariant (cf.[30,31,32]).They have have extension property and are conservative.The formal theory of intermediate quantifiers which uses the fuzzy type theory (a higher-order fuzzy logic) was introduced in [3].Other authors, such as Hájek, Pereira and others ( [33,34,35]) proposed alternative mathematical models to some of these quantifiers.The fundamental idea is grounded in a supposition that intermediate quantifiers only represent classical quantifiers ∀ or ∃.However the quantification universe is altered.This is acquired by means of the theory of evaluative linguistic expressions introduced in the previous subsection.This idea is characterized by the following definition: Definition 2.1.An intermediate quantifier of a type ⟨1, 1⟩ interpreting the sentence "⟨Quantifier⟩ B's are A" is one of the following formulas: where x represents elements, z, A, B are interpreted as fuzzy sets.The (µB)z represents a measure of the fuzzy sets z w.r.t.B and Ev is an evaluative expressions.
To explain the meaning of this definition, note the following scheme: In a finite model, the measure (µB)z can be given by where |z| = ∑ u∈M α z(u) and |B| = ∑ u∈M α B(u).In this text, we restrict ourselves only on the quantifiers "almost all", "most", and "many".

Fuzzy GUHA
The classical GUHA method [5] deals with data in the form of Table 1, where o 1 , . . ., o n denote objects, X 1 , . . ., X m imply independent boolean attributes, Z implies a dependent (explained) Boolean attribute.Finally, symbols a i j (or a i ) ∈ {0, 1} imply if an object o i holds an attribute X j (or Z) or not.Method GUHA aim at searching associations in a form where A, B represent predicates including only the AND connective and X 1 , . . ., X p , for p ≤ m, represent all variables to be found in A. The A, B are named the antecedent and consequent, respectively.A detailed description of the algorithm for searching for linguistics associations from numerical data can be found e.g. in Section 4 in [36].
The antecedent-consequent relationship is delineated by the so-called quantifier ≃.There exist a lot of quantifiers that define the validity of data association 8 [8].For instance, the so-called implicational quantifier is defined as true if International Scientific Publications and Consulting Services where γ ∈ [0, 1] is a user-specified degree of confidence and r ∈ [0, 1] is a degree of support.Here a signifies the number of positive incidences of A as well as B within the data; and b signifies the number of positive incidences of A as well as B negated, it means "not B".The paper objective is to replace the classical GUHA implicational quantifier with some of the generalized intermediate quantifiers.Namely, we restrict ourselves on the quantifier "Almost all" (P), "Most" (T), and "Many" (K) -see Definition 2.2.The algorithm for finding the truth value of the quantifier uses the truth degrees of both the antecedent and consequent evaluated on each object o i and searches such z from Definition 2.2 by using a specific optimization technique, whose description is going to be published elsewhere.Contrary to the classical GUHA or fuzzy GUHA quantifiers, the intermediate quantifiers result in a non-crisp truth value from the interval [0, 1].

The Experiment on Real Data
We illustrate the use of our proposal on a famous Edgar Anderson's Iris dataset [37,38].It is a multivariate data set that quantifies the morphologic variation of iris flowers of three related species.It contains four numeric columns with sepal and petal length and width, and a categorical column of iris species (setosa, versicolor, virginica).The process of analysis was conducted as follows.First of all, numeric columns were transformed into membership degrees of fuzzy sets.From each numeric column, 5 fuzzy sets were extracted that model the following linguistic expressions: small, very small, medium, big, and very big.(E.g. the original numerical column "Sepal.Length" was used as a basis for the following fuzzy sets: "Sm.Sepal.Length", "VeSm.Sepal.Length", "Me.Sepal.Length", "Bi.Sepal.Length", and "VeBi.Sepal.Length".)On the transformed data set, a fuzzy GUHA from the "lfl" package [19] of the R statistical environment [20] was executed.We have searched for all rules with at most four predicates in the antecedent.A detailed analysis of the iris data set is not the purpose of this paper -therefore, we provide only a few examples of the obtained rules that also illustrate the latter use of intemediate quantifiers.Among other, e.g. the following rule was obtained from the data set: Sm.Sepal.Length & Sm.Petal.Length ⇒ Sm.Petal.Width, γ = 0.995, r = 0.132, which may be interpreted as: "If an object has both sepal and petal of small length then it has petal of small width with support 0.132 and confidence 0.995." In order to provide a linguistic characterization of the intensity of the rule, we have then evaluated membership degrees of the three intermediate quantifiers "Almost all", "Most", and "Many".The above-mentioned rule thus can be interpreted as: "Almost all irises with both sepal and petal of small length have its petal width small." For other rules see Table 2.As can be seen, the most strict quantifier (among the three that were evaluated) is "Almost all".It corresponds roughly to the confidence around 0.98, although generally, there is not a functional dependency between confidence and quantifier's truth degree.(The investigation of how the truth degree of the quantifiers relates to the confidence or other interest measures is given for future work.)A slightly weaker is the quantifier "Most", which was still 1 although the truth degree of "Almost all" decreased to 0.941.The weakest is the "many" quantifier that has a truth degree 0.962 even for a rule where the other two quantifiers decreased below 0.23.
As can be seen, the provided example shows that the fuzzy GUHA association rules may be interpreted using intermediate quantifiers that provide a linguistic description of the intensity of the found relationship.If missing information is present in the data, we can follow the principle of the theory of possibility and evaluate the quantifiers by considering the most pessimistic and optimistic case, which may result in an interval of possible membership degrees.
4 Future work using syllogistic reasoning In the previous section, we showed that the theory of intermediate quantifiers can be applied for the interpretation of natural data using intermediate quantifiers.Similarly we can analyze natural medicine data and obtain the expressions of natural language like as follows: • Most people who suffer atopic tetter and living in an area affected by heavy industry and smoke suffer from asthma.
• Most people who smoke and suffer from respiratory diseases also suffer from ischemic disease of leg.
A very interesting question is: is it possible to infer other new results from linguistic expressions which were derived before?The idea is to use the theory of generalized Aristotle's syllogisms and the theory of generalized Aristotle's square of opposition, which were syntactically and semantically studied in our previous papers (see [1,15,2]).
Using the theory of generalized Aristotle's syllogisms leads to the use of special kinds of consequent conjunction/disjunction syllogisms with two or more premisses, which were studied in [39] and later syntactically and semantically verified in [2].Recall that all the syllogisms from our previous papers are proved on a syntactical level, and hence, by completeness of the fuzzy type theory, the syllogisms are valid in all models.
By syllogistic reasoning, we can infer the new results as follows: Most people who smoke and suffer from respiratory diseases suffer from asthma.Most people who smoke and suffer from respiratory diseases suffer from ischemic disease of leg.Some people who smoke and suffer from respiratory diseases suffer from asthma or ischemic disease of leg.
A practical application of syllogistic reasoning may be also in reducing of the size of the set of rules obtained from data.If some rule can be derived from the others then it is very likely not so much important or interesting because it does not carry any new knowledge.On the other side, if we obtain from data a rule that is more intense that the one that can be derived from the other rules, it may be an indicator that such rule may carry some surprising and thus potentially useful knowledge.

Conclusion
In this paper, we have applied a standard data-mining method, namely the Fuzzy GUHA method, together with the theory of intermediate quantifiers, which is one of three main theories of fuzzy natural logic.We have found the linguistic associations in the form of IF-THEN rules composed of evaluative linguistic expressions.The theory of intermediate quantifiers was successfully applied to the resulting rules in order to provide a linguistic description of the intensity of the relationship captured by the rules.We have shown a practical example on a real data set.The main idea for the future is to apply the theory of generalized Aristotle's syllogisms and the theory of generalized Aristotle's square of opposition, which give to infer a new possible information from the result that was found before.Alternatively, we may use the syllogisms to prune the generated rules from rules that are potentially not so useful as those that cannot be derived from the others.Moreover, we would like to extend our approach to data with missing values by applying fuzzy partial logics [11].

Figure 1 :
Figure 1: Chart of the evaluative expressions extensions construction.Every extension is a fuzzy set acquired by composing one of the horizons (LH, MH, RH) with the function ν a,b,c , interpreting the hedge ν ν ν in M 0 (in the figure, it is turned 90 • in the counterclockwise direction).
evaluative linguistic expression, we obtain a definition of a concrete intermediate quantifier.Ex Bi signifies that the fuzzy set z is "extremely big" w.r.t.B, the formula Bi Ve formula signifies the fact that the fuzzy set z is "very big" w.r.t.B and, finally, by ¬ ¬ ¬(Sm ν ν ν) is understood that z is "not small" w.r.t.B.
size of z is evaluated by Ev (2.5)International Scientific Publications and Consulting Services Using a specific

Table 2 :
Example of obtained rules together with the degrees of support and confidence and membership degrees of quantifiers "Almost all", "Most", and "Many" support confidence Almost all Most Many Sm.Sepal.Length & Sm.Petal.Length ⇒ Sm.Petal.