The Architecture for Optimizing Semantic Topical Coherence in Gujarati Text Topic Modeling

Information retrieval has gained a lot of attention from researcher community. There is a huge work being done in text summarization, document classification, document retrieval based on query term, text mining, web content analysis, web page classification and much more. In addition, there has ample of research already been taken for the English as it is a trade language across the world. However, progress in the Gujarati text information retrieval did not follow this tendency. There is an enormous amount of Gujarati text sources which are available under the various title. Moreover, terabytes of text are being made available in form of newspapers on the websites. In this paper, LDA has been applied to Gujarati text for modeling topics. The Latent Dirichlet Allocation model is used to uncover the concealed topics in a collection of documents. It gives a better view to look at the large collection of text, as what exactly it is composed of. This paper aims mainly at the two objectives: 1) It targets the Gujarati language News articles corpus for topic inference and 2) It optimizes topic coherence by incorporating the relevant words set. The document set consists of news articles of the Gujarati newspaper, which is published daily across the state Gujarat in India.

The probabilistic topic model has got the prominent attention of researchers since last decade.There has already been carried out a vast amount of research work, and more than that might have been in progress.This trends of research interest can be justified as information and data have been increasing drastically.The topic model can be applied to text summarization, dimensionality reduction, document classification and information retrieval [21].The topic model can be used not only for text data processing or analysis but also for image classification or image processing, audio data analysis, and video data.Text analysis task such as document classification, document summarization or dimensionality reduction may present several challenges when the size of the dataset becomes very large [6].It would not be easy to infer some useful pattern or knowledge without such inference algorithm [10].There are ample of techniques and methods devised for text analysis.It focuses mainly on two approaches for the searching or locating particular text or document from the large collection.First is keyword searching and second is theme or topic based techniques.The topic-based approach may give additional benefit of hierarchical searching [6].The method applied here is the type of statistical inference method which works for inferring the topical structure of the corpora [11].There is a wide range of scope of the applicability of the probabilistic inference method.As the variety of domains available, there might be interesting and challenging to discover the major subjects of the collection.Researchers have worked for the scientific domain [17], health domain [14], for the patient medical records [9], for the author and their research paper relatedness [8] [16], for short text like twitter (BTM) and much more.In addition, for text data analysis, researchers have also shown their interest to work with the multilingual environment [13] [2] [19] [21].In the domain of text summarization and text mining, the work presented in [4] [12] for summarizing a large collection of documents.There have been already done research work for the big data analysis [1] [5] and sentimental and opinion mining [3].There is another very interesting research wing is about to explore the hierarchical thematic structure of the data [20] [7].In this paper, we have modeled the Gujarati text for discovering the topical structure of the corpus.We have considered the corpus of news articles of the daily published newspaper in the Gujarati language.The paper is organized as follows.Section II introduces the topic modeling and its various techniques.Section III emphasis on the Guajrati text topic modeling.This has explained, further, by Gibbs sampling with an example.The architecture and detailed workflow have been explained in section IV.Section VI shows the experiment and results.The paper ends with a conclusion.

Topic Modeling
Topic models are built on the base of the collection of documents.Documents are considered the building blocks for the topic model [11] [10].There are mainly two points to be assumed for modeling the topics: 1) documents are a mixture of topics and 2) in turn, topics are a mixture of words.In more specifically, the topic is the probability distribution over words and document is the probability distribution over topics.LDA has been categorized as a generative probabilistic algorithm.It describes the simple probabilistic method by which document can be generated.It assumes that document is a bag of words, where the order of the words does not make any difference.Based on this assumption, in the first step, a topic is sampled from the distribution of topics.In turn, A word is distributed for that topic in the second step.Topic inference process from the set of the documents is the reverse of this [10].When document collection is fed to the algorithm for the modeling, At the end of the process, one would be having the K number of topics with a set of words and probability of those words as one of the outcomes.It is assumed that corpus is composed of K-fix number of topics.However, one may select different values of K [15].

A. Latent Dirichlet Allocation
LDA generative process describes how data set caused by the given topic distribution and word distribution.But we have already the data and topics are to be inferred.Therefore generative story would not work here in this case.The process is about discovering the hidden structure of the data, and that are topics.To achieve this objective, Bayesian network to be approached.Bayesian network is the type of probabilistic graphical model which infers the relationship among the random variables.To understand the working of LDA, we start with the assumption that there are total M documents in the corpus and N words for each document.The vocabulary consist of V distinct words and there are K hidden topics in the corpus.
 The topic distribution e is to be sampled for each document.There would be a number of e as the number of documents in the corpus.In order of that, each e would be a K-dimensional vector.Each dimension in that vector represents proportion for a topic in that specific document. Similarly, there exists a distribution ϕ over words for each topic.Therefore, there would be K ϕ.
Again, ϕ is a V-dimensional vector where each dimension represents the probability of a word encountering in that particular topic. There would be actual word w in the document.And there is a topic assignment Z for each word for each document.There would be total M cross N number of assignment.Algorithm 1 shows the LDA generative process.
The topic assignment depends on topic distribution for that document.The observed word depends on the topic assignment and all of the topics K.These dependencies define LDA [11].They are set in the statistical assumptions behind the generative process, in the particular mathematical form of the joint distribution.
The concept of the generative process of LDA deals with both, the observed variables and the hidden variables.The hidden variables are the topics, the document -topic distribution and topic-word distribution.Words of the document in the corpus are only observed variable in LDA.The hidden variables are to be inferred from the observed variables.Conditional distribution is calculated from the joint distribution of observed variable and the hidden variables [6] [14] [10]. (2.1) The posterior distribution can be achieved by the prior distribution of the hidden variables as the Dirichlet distribution is the conjugate prior to the multinomial distribution.

B. Gibbs Sampling
There are several approaches applicable to compute the posterior distribution.Variational inference and Markov Chain Monte Carlo are mainly two approaches for the topical inference.In [18], Blei used the variational inference for discovering the correlation of topical structure.Again, there are two frequently used methods for MCMC known as Gibbs sampling and Metropolis-Hasting Algorithm.Gibbs sampling has been applied to discover the relationship between authors and their research interest [8], a Topic structure for the patient record [9] and to learn topics from scientific research articles [17].In [22], researchers have shown the effect of word embedding in a topic model with Metropolis-Hastings method.Here Gibbs sampling method has been applied for the inference.C  (2.4)

Gujarati Text Topic Modeling
Gujarati is the official language spoken in the State Gujarat, A state in western India.It is also an official language of in the union territories of Daman and Diu and Dadra and Nagar Haveli.Gujarati is a member of Indo-Aryan branch of the Indo-European language family.According to the Central Intelligence Agency (CIA), 4.5 % of the Indian population (1.21 billion according to the 2011 census) speaks Gujarati, which amounts to 54.6 million speakers in India.There are about 65.5 million speakers of the Gujarati worldwide, making it the 26th-most-spoken native language in the world.Gujarati was the first language of Mahatma Gandhi, The father of The Nation India.Gujarati is one of the 22 official languages and 14 regional languages of India.It is the medium of everyday communication in the Indian state of Gujarat.It is used in education, government, business and the media.The language is widely spoken in expatriate Gujarati communities in the UK and the U.S.These communities have Gujarati newspapers, magazines, and radio and television programs 1 .The information in Gujarati is accessible to people through the newspaper, Magazines, and websites.Across the India, there is more than 21 newspaper published daily basis, including Gujarat Samachar, Sandesh, Divya Bhaskar and Nav Gujarat Samay as major components 2 .With a physical copy, the paper is also being made available.There are more than 50 magazines serve the information and knowledge to the wider community of Gujarati Language.It spans across many areas like politics, agriculture, science, sports, entertainment, spirituality, kids special, women related and what not 3 .These magazines are published weekly, bi-weekly or monthly by various publishers.Table I shows the detail of daily published Gujarati Newspaper data.Despite such a large number of speakers and users of Gujarati language, research work in machine learning and information retrieval did not target Gujarati text remarkably.Most of the research work in information retrieval found the domain in the English language.To shift our research task for the Gujarati language, there might be two approaches.One is to keep English as a pivot language while using parallel corpora in multi-language text analysis.And second is to process original Gujarati language text and applying information retrieval techniques to analyze in the mono-language context [13].The first approach allows us to find what extent the corpora are similar to each other.It also helps find the equability across the text or domain, like this work under study, may find how topics are spanned across the news articles published in Gujarati and English during the similar course of time.The second approach may be applied to analyze the text corpora semantically.In addition, Gujarati text analysis would not be influenced by the English language as it would be processed independently.With respect to the above, there are some challenges faced in the development of the Gujarati information retrieval task. 1) How to find thematic structure from a large collection of Gujarati text documents?2) How to find the alignment of the thematic structure with another corpus of the same nature of duration.IV.The Architecture of the Proposed Method An architecture has been depicted in figure 2 for the Gujarati text topic modeling.It has several components including the generative process of Latent Dirichlet allocation.the detail of each component has been shown in the workflow, which shows how each component interacts with component right above and below.Though the preprocessing seems very common task in information retrieval and text processing, in this work it influences up to remarkable extents.As this is for the Gujarati text, it has got to deal with not only Gujarati characters and digits, but also English language characters and digits.This is due to the inherent characteristics of the corpus under study.It has been observed that majority of the Gujarati newspapers was containing few English digits, which has to be identified and eliminated before the modeling.In addition, sometimes other languages like the Hindi also have been found interleaved in the newspaper.However, the article is composed of the Gujarati language characters at large.In the work [23], the authors have proposed several approaches to increase the interpretability of the topic structure.In this paper, the core part of the architecture is the synonyms database for the Gujarati language.It may play a very crucial part once topics are inferred.As an outcome of the inference process in LDA, it clusters the relevant words, which is said to be a topic.There are high probable words would be on top as the words are ordered according to the probability of occurrences into that topic.In general, top N words are a conceptual representation of the whole cluster, rather all words.It means that low probable words are not part of the topic, or it can be stated that they may be the good for the other topic.Instead of the leaving the inferred topics as it is, the low probable words can be exchanged with other low probable words in another topic of the same model.This can be achieved using the synonyms word set.It composed of the two heading, one heading contains the word and other contains the N most relevant synonyms.First It picks the highest probable words from the topic and searches for the synonyms in the database.Once locating few words, for each word it searches in the other topic.If word found then word be exchanged with the selected word for the selected topic.At the end of processing for each topic, the topic may emerge with high coherence with more number of top words than before.Conversely, the task of exchanging the words among the topics may become more complex with respect to the time.

A. The Workflow of the Proposed Method
The workflow of the model has been depicted in figure 3. The preprocessing has been done with five sequential steps.It has been found the articles were containing both the English and the Gujarati digits.All tokens composed of digits or interleaved with character were removed first.The care also has been taken to remove special symbols.Generally, special symbol and digits do not contribute to the topic information.In addition, single letter word also has been eliminated for the similar objective.Once the corpus turns out to be smooth and processable enough for the modeling, topics can be inferred according to LDA inference process.The topics are then taken for the reassignment processing on the basis of synonyms set of the Gujarati words.The process would take place as explained above.The topic coherence is to be measured before and after reassignment is done.This would reveal what quantity of coherence is improved.For the topic coherence measure, Normalized PMI method is to be used (reference).There are many methods could be used alternative for measuring the topic coherence [24] [25].

Experiments and Results
This section represents the experiment carried out on the corpus.The corpus consists of news articles extracted from the online news portal.First, All the images have been removed from the corpus.Then we left with corpus containing text only.
A. Preprocessing steps The preprocessing has been done to convert it into machine processable form.First, it was required getting rid of images from the articles.Then we looked for standard techniques for shaping the data.All the data has been tokenized document by document.As result, each document in the corpus is now purely bag of words.Once it is tokenized, we removed all the punctuation symbols and words Therefore, it does not provide any fruitful information in the inference process of topics.After stop words are cleaned from the whole corpus, it has been observed that very least frequent words also could be removed for a better result.Only those words have been retained which were occurring more than 50 times across the corpus.As inference works iteration by iteration, word by word for each document in the corpus, words move to the specific topic, instead of spreading across many topics.Finally, corpus has been made free from digits.Documents were containing not only Gujarati digits but also English digits.Both of them removed.There might be the case that the same word may be found in more than one topic.But it is also common that the word occurs in topic A with high probability and with very low probability in the topic.This is the reason for removing least frequent word with some lower bound.

B. Experiments and Results
Table II depicts the topics inferred for the Gujarati text news corpus.There are only 8 topics has been displayed.For each topic, top 12 probable words have been selected for the representation of that particular topic.As LDA does not have the provision of automatic labeling of topics, labels are to be done manually.However, LDA does not insist on having a label for each topic because the topic is considered as the probability distribution of words.If we have to assign some meaningful label, then it would make some sense to us.For example, topic 1 can be assigned a label such as 'ચલચચત્ર' (The movie) or 'મનોરં જન' (The entertainment).Similarly, topic 2, topic 6 and topic 7 could be labeled as 'કર અને સે વાઓ' (Tax and Services), 'માચિતી' (Information) and 'પ્રદુ ષણ' (Pollution) respectively.Though all topics represent a specific theme, there are some cases in which a mismatch could be found out.For example, the word 'સોફ્ટવે ર' does not fit in the topic 'ચલચચત્ર' (The movie).In other words, the word 'સોફ્ટવે ર' is more fitting to the topic 'માચિતી' (Information) then to 'ચલચચત્ર' (The movie).By the same token, the last word under the topic 3 seems more suitable then topic 4. It can be stated if words are exchanged according to their semantic relationship with other topics word, then interpretability would be increased automatically.It is easily identifiable that the last word of topic 2 and topic 3 are exchangeable for better coherence and interpretability.As shown into the architecture, the basis of exchange of the words is a central repository of relevant words.Those words exchange may refer to this mapping.criteria, we can define our criteria and they can be influenced by functional domain.For example, we can consider words having a probability less than 0.005 to remove from current topic and allocate new topic to them.Besides, we would like to present the concept of synonyms of words to be used during a reshuffling of topic allocation to words.As depicted in figure 4, We choose each low probable wi word from each topic ki then find the semantically relevant words Ws for particular word wi.Afterwards, find the coherence of each synonym with other remaining topics k j .We also have to define a threshold value for the coherence of relevance to match with the topic.If we find coherence value of relevant word within specified threshold or value satisfies the threshold value, then we can consider those relevant words to be included in topic k j otherwise word wi should be dropped from the reconsideration of a different topic.

Conclusion
The architecture presented here for increasing topical coherence of the inferred topic by Latent Dirichlet Allocation.The workflow is shown to get into the detail working and interaction of the architectural components.In addition, the Gujarati language text has experimented for the analysis of the thematic structure.The results showed few example topics for the Gujarati language text.The main goal of the paper is to uplift the topic coherence of the infrared topics and this could be achieved by the incorporating some features of the language under study.In order to be specific, the concept of synonyms and the high and low probable words are considered for the improvements.In the experimental outcome, it has been observed that very less probable words can be exchanged for better interpretability of the topic.However, this step would refer to the relevant word dataset.

Figure 1 :
Figure 1: Plate Notation for LDA Generative Algorithm of counts with dimensions W × T and D × T respectively; C WT Contains the number of times word w. w j is assigned to topic j, not including the current instance of i and C DT contains the number d j of times topic j is assigned to some word token in document d, not including a current instance of i.Note that Equation (2.4) gives the unnormalized probability.The probability of assigning a word token to topic j is calculated by dividing the quantity in the Input: Dataset, K topics, Hyper parameter  and β Output: Topic files, Topic-Word distribution, Document-Topic distribution For All topics k  [1,K] Do Sample mixture componentsɸ ⃗⃗  ~  ( ); End For all documents m  [1,M] Do Sample mixture proportion  ~  ( ); Sample document length   ~ ()

Figure 4 :
Figure 4: Word swapping for topic coherence optimization For all words n  [1,   ] in document m Do Sample topic index , ~ (  ) Sample term for words  , ~ (ɸ ⃗⃗  , ) End

Table 1 :
Circulation of leading Gujarati newspapers International Scientific Publications and Consulting Services