The study of the effect of reducing input variables in the modeling process of Tehran-Saveh freeway accidents

Since Iran is among the countries where the rate of accidents caused by inattention to safety rules and factors affecting it always has been rising and according to the capabilities of neural network model in predicting accidents, the main purpose of this paper can be Using principal component analysis to identify factors influencing on modeling process Tehran-Saveh freeway accidents that is perhaps one of the country one of crowded freeway. In order to understand better the results obtained for determine the independent variables into the principal component analysis model, benchmarking KMO is used and variables such as traffic, petrol rationing and environmental variables were used as independent variables that has effect on the dependent variable accidents rate. In this study In addition to evaluation of the efficacy of artificial neural network model to model the regression of natural logarithm of the freeway traffic modeling, the accuracy of the models that built in accidents involving artificial neural network models and regression of natural logarithm in before and after the removal of less important variables studied and the results indicate that removal of minor factors of the PCA in the process of modeling, accuracy built models with fundamental changes did not go up and as a result it has been proved that the average daily traffic volume and average vehicle speed is the most important role in accidents of this freeways.


Introduction
Injuries and heavy losses caused by road accidents since the health and hygiene is a dangerous phenomenon of anti-health and The social dimension is a dangerous phenomenon that devastating families and the cultural dimension is a dangerous phenomenon that destroys cultural trainer supervisors navigator families and politically a dangerous phenomenon destructive political credibility crisis countries and economically a very rare phenomenon dangerous devastating economic resources, including human resources, thus an urgent need to understand this deteriorating situation and take appropriate action to solve it felt.
Basically Subject avoid waste increasing road accidents and the reduction in the number and severity of accidents, it should be the same size that is currently paid to other issues related to human health and the budget and resources allocated to them, be considered.About 16,000 people daily around the world as a result of injuries lose their lives.Injuries and surgeries make up 12% of the burden of disease and as the third most common cause of death for ages 1-40 years is considered.These injuries mainly occur in road accidents and in accordance with the World Health Organization, about 25% of all road deaths are deaths from injuries.In the years 2005-1995 the highest number of road deaths have been in the age group 16-20 years.Using neural models for forecasting and prediction in various spheres, it has been found the potential of these models for a variety of applications within the specified time.Studies have shown that taking into account the statistical concepts in the process of building neural models may improve the performance of the model.So take a systematic approach to the development of neural models, taking into account factors such as pre-processed data, determining the model inputs and appropriate network architecture, optimize the internal parameters and verify of the network model seem necessary.Neural networks are a valuable tool for a wide range of areas of management as a vital component of most of the data mining, will changes the way of looking of organization at the relationship between the data and the company's strategy.The results of previous studies of the neural network shows the performance in the field of accident prediction models and their evaluation criteria.For example Ankvar and Dogan to predict the number of accidents and traffic fatalities in the way of Turkey used neural network models and non-linear regression model and to have concluded that the neural network model has better results than non-linear regression model [2].Also Abdul Aty and et al From a probabilistic neural network to predict the traffic in the corridor between the city of Orlando have used And showed that at least 70% of accidents can be predicted by the probabilistic neural network model may be correct [1].Chang [3] in modeling of freeway accidents in Taiwan used of Artificial Neural Networks and negative binomial models and with compare the efficiency of two models, concluded that artificial neural networks models for analysis of freeway accidents is an alternative approach.In recent years models predict road accidents were analyzed using different methods, but the use of these models is not easily done when the number of variables be very large.
To reduce this problem could be the main component analysis method to remove the less important factors in accidents and therefore modeling process used that the present study is based on this approach.Unfortunately, Iran is one of the countries with the highest number of deaths and injuries are caused by accidents.Studies show that at present in our country every year more than 25,000 people killed in accidents caused by vehicles and more than a hundred thousand people a year are injured from road accidents in the last annual financial damages amounting to 4 billion the dollar, which compared with many countries, Iran is in a very worrying situation [15].It should be noted that Iran has in recent years become one of the crisis centers, several recent studies by the World Bank officially traffic safety situation in Iran is considered critical.Despite this situation, the studies of traffic safety in the country has received little attention.Thus, with the principal component analysis, a method that is frequently used to evaluate a group of correlated variables associated with one or more field is used, it could reduce the amount of concern about the crisis.The most important applications of this method include the analysis of multiple indicators, measurement and recognition of complex structures, index and data reduction can be searched.This is especially useful in the case given the size and composition of their structure is not entirely clear agreement [9].In this study, we have tried to evaluate the effect of reducing the Tehran-Saveh freeway accidents factors affecting the accuracy of models predicting traffic accidents, using this method will be discussed.By selecting this freeway, variables such as average daily traffic of vehicles, the average speed of vehicles, percentage of heavy vehicles, month of accident, gas rationing and weather conditions at the time of the accident (to warmer and cold seasons) as independent variables and the number of accidents in month as dependent variables and output have been studied.The research focuses on a method to reduce inputs to the model, by using principal component analysis is that its benefits can be reducing the time needed for data collection and modeling of artificial neural network.Before doing the principal component analysis method feasibility of using this method with calculate the value of KMO factor is investigated.In determine of less important factors were fitted in data traffic on this freeway with neural network models and natural logarithm regression model and then fit the components of the models studied before and after the removal of minor variants.Then the efficacy of artificial neural network to model the natural logarithm is studied.

Principal component analysis
Principal component analysis (PCA) is a type of statistical analysis that fewer of the name of the main components of the primary factors in the selection, so that the number of less important data is deleted.The method is one of the most valuable results of linear algebra, which is abundant in all forms of analysis of neural networks to computer graphics, because that's an easy way to extract relevant information from a dataset complex.In this method variables in a correlated multi-mode space to a set of Non-correlated components are summarized that each of them are linear combinations of variables.Non-correlated components are called the basic component (PC) are called that vectors covariance matrix or correlation matrix variables are obtained.In general, the use of basic components analysis is to reduce the number of variables and find a significant correlation between variables that in fact is the classification variables.The main advantage of this method are the elimination of co-linearity in the model due to the large number of affecting variables in the model.Whereas KMO test of this method is less than 0.5 the data for the analysis of the main factors is not suitable and if the amount is between 0.5 and 0.69 are more careful analysis of payment.But if the amount is greater than 0.7 of the correlations between the data will be suitable for proper analysis (Table 1) [10].With this method, The primary endpoint of P compounds  1 ,  2 … .,   to create maximum P independent components for  1 ,  2 , … .,   is created [11].Each component can be provided with a sequence of (2.1) to be determined: where   represents the desired component,   factor for primary variables and   is a primary variable.So are estimated coefficients   that the first component considered the maximum variance and the second component a maximum variance is not considered by the first component predicted and the process continues until the last component to include the variance.
To perform the Principal component analysis carried out with following steps: Step 1: Standardization of input variables: At this stage, the input data are standardized so that they mean zero and standard deviation one.Z matrix which is a matrix of standard values of the parameters of equation (2.3) is obtained [5].
3) In this equation ̅  is a data average and of   and   are standard deviation.The data are of   .
Step 2: Calculation factors KMO: KMO index is used in the range of zero to one.The index of equation (2.4) obtained that in this equation   is a simple correlation coefficient between i and j and   is a partial correlation coefficient between them.If the sum of the partial correlation coefficients between every pair of variables is small in compared with the sum of squares correlation KMO size suggests that the correlation between pairs of variables cannot be explained by variables.
Step 3: Calculate the correlation matrix (covariance) for primary variables: This matrix is shown the correlation between each of the basic variables used.The amount of each element of the matrix   represents the correlation between variables i and j that is achieved by equation.
Step 4: Calculate the eigenvalues  and eigenvectors of the correlation matrix: By solving equation (2.6) and (2.7) the eigenvalues and eigenvectors of any particular value are calculated.Special vectors obtained for each particular value as the primary variable coefficients in the respective components are formed.The solution of equation (2.6) where I is the identity matrix, eigenvalues   are calculated.
Step 5: Benchmarking extraction number of factors: Benchmark eigenvalues and benchmark amount of variance [12] and Benchmark test cuts [7] extracted the most important parameters are the number of agents that are used in the process.
In benchmark eigenvalue each element contains one or more variables.Squares bars an agent represent the percent of variance, correlation matrix is a factor that is explained by that factor and to calculate the correlation coefficients of the variables that need to be heard and together with a special value that is to be achieved.The special deal is a further factor, the factor explain more variance.
According to this the number of agents with attention to Eigenvalues of each factors is determined and factors that the eigenvalues is more than one that is considered to be a significant factor.
That used when the number of variables is between 20 and 50 seems to be reliable, but if the number of variables is less than 20, use of this criterion must be conservative.Also, if the number of variables is greater than 50, use of this factor resulted in a large number of factors to be extracted.In a measure of the percentage of variance explained percentage of variance is the basis of decisions and factors are extracted that have a high percentage of variance.If the variance is less than 50% of variables that must subscription rate is low, be deleted.
In benchmark cuts test also determines the factors on the basis that still has not overcome the common variance is specific variance so as long as the amount of shared variance is greater than a certain variance, the significant factor is extracted.
Step 6: Implementation of proper rotation on the coefficient matrix components: This step is to analyze the main factors known.At this stage, the main component derived variables have high coefficients as important variables for inclusion in the model are selected [16].
3 Models predict accidents and their evaluation criteria

Neural Network Model
Artificial neural network model of the human brain that tries to simulate the learning process.For design a network and how good it is a function as a meaningful role must be defined.Commonly used measure of performance is the set square error.
Where p is a total design patterns YP output vector (that's hidden output level), TP is a goal of design.The above equation shows that the output node, respectively, for the i-th unit of output in the p-th pattern.Network by adjusting the weights, learns issues.Set the scene for processing that cause the neural network to learn the relationship popular methods and by using the formula (3.9) is calculated where N: number of neurons output layer that fits with the number of observations objectives,   : value observed for the i-th record,   : value output for the i-th record [18].
Dividing the input data model to the different parts as well as training and validation of data during network construction is common [6] and [4].By applying this method, called stop training, the use of more complex architectures designed for the user's network, without being fitted on the problem and stopped by putting the standard of education.The reference architecture plays an important role in teaching methods [8].In this study learning algorithm trainlm and stop training for training the neural network to predict the number of accidents has been used.between inputs and the goals are learned or famous design.Neural networks are capable determine of the relationship between inputs and outputs of a physical device with a network of nodes that are connected together [14].Learning this models that determine their internal parameters, error correction is based on a law that extended the minimum mean square is

Regression of the natural logarithm models
Change one of the variables in the regression value per unit is another variable.The correlation between the two variables suggests that dependence and its severity.Regression analyzes the nature of this relationship by regression equations (linear or nonlinear) is expressed.Using the linear regression of a set of independent variables and the dependent variable to study the manner in which the relationship between the independent variables to be considered.Regression task is to help the dependent variable variance and the variables involved in this task partly through variance estimation done.In this type of values of a variable regression of the values of two or more variables is estimated.
Natural logarithm regression model based on the assumption that the natural logarithm   of a normal distribution with mean µ  and variance  2 follows it.Natural logarithm regression model is used when the data are non-negative and positive data distribution with standard deviation and average data is relatively large.In this model, the relationship between the number of accidents is expected to piece predictor variable i, q, is (3.10) and the exponential written as Equation (3.11) is: (  ) =  0 +  1  1 +  2  2 + ⋯ +     (3.10)   = ( 0 )( 1  1 )( 2  2 ) … .(    ) (3.11)In this model assumes that the logarithm of the number accidents from normal distribution with mean µ  and variance  2 follows it.Coefficients 0 ,  1 ,  2 , … ,   , linear regression coefficients are calculated by the method of least squares error.This model is a classical model of multiple linear correlation between the logarithm of the dependent variable, and q is an independent predictor variable.

Evaluation criteria
One of the criteria to verify the validity of the results of the regression and neural network model, standard fit correlation coefficient (R) or the second (R 2 ) is a value between -1 and +1, respectively, for R and for R 2 varies between 0 and +1 and the formula (3.12) R 2 value is calculated. (3.12) That in to   : the dependent variable observed,   ̂: the dependent variable according to the independent variable   ,  ̅: The data that formula   = ∑    calculated that in to the   is: independent variable.
Absolute values close to 1 better match observed data and estimates show.But due to the fact that the amount of R 2 by the data is out of range, it should be used in conjunction with other parameters.For this reason, in this study, the mean square error to compare the efficiency of neural network models and the natural logarithm regression was used to predict the number of accidents.

Information gathering area
In order to survey the accidents statistic and the affecting parameter, we chose the tehran-Saveh freeway that is 146 kilometers.Besides, this choice we used the database of transportation organization (1386-1392=2008-2013).In the gathering information part, the data of average daily traffic (ADT), speed average of heavy truck in monthly time units, through traffic calculators were used.The total numbers of accidents in this freeway during the different months of a year, was taken in to account as dependent variable and independent variables of each month.The number includes ADT, speed average and present of heavy trucks, season parameter(in warm seasons of spring and summer with code zero, and in cold seasons of autumn and winter with code one), the effect of petrol rationing (before execution with code zero, and after execution with code 1) are considered.The variables and indicators for modeling are shown in table (2).As all parameters for decision models don't have the same dimension for making them equalization we used normalize or standardize the data [13].There are many methods for normalizing input and output data among them is the method which change the data between one and zero or between -1 and 1.In order to standardize the data, consuming subordinate is normal, the execution (4.13) is also used.Where   is the primary amount of parameter,   is normal amount or standardized parameter, and input and output parameters are put in normal by using formula (3.12).The regarding statistics for real and normalized amount of data are in table (3).Making decision for present data depends on methods and statistic styles.By this kind of decision one is able to use quantities methods and level of assurance and accepted standards and issue a judgment.In this study the data take from information sources.After classifying and math's treatment, are analyzed by neural network method and regression in order to reduce the input variables of accidents in Tehran-saveh freeway.

The affecting factors by using analysis of principal components
The preliminary variables in Tehran-Saveh freeway include: monthly accidents numbers, daily traffic average, long vehicle percent, average speed of cars, season of accident and ration petrol.These are changed to standard amounts.The KMO is for all 0.432 data (less than 0.5).So, it is not possible to conduct the principal component analysis and the results are not creditable.By omitting month variable, the KMO will be 0.53 which confirms the necessary correlation among the input variable for analyzing of main components.The covariance matrix is calculated by using formula (5.14) and their amounts and vectors are calculated by using (5.15) and (5.16) equations.
The special amounts ( ℎ ∑ ℎ ) of used data, primary variables in each main component and percent of gathered information related to primary variables ∑ ( ℎ ∑ ℎ * 100) are given in table (4).(5.17) Table (5) shows the special vectors for formation of each component.To form the first component the following actions should be done.The primary measured amounts of ADT is multiplied by 0.52, the L.V measured parameters is multiplied by 0.53, the P.R is multiplied by 0.802, and N.A. is multiplied by 0.401.so the first main component is specified by formula 4-4.For getting the other components, this method will be continued (followed).As no.(5.17) shows, for formation of the first component of ADT and A.S have the highest coefficient that denotes more affect of this parameter for formation of the first four components make clear more than 92 percent of data variance.So.If we use variance percent gauge, the first four components are chosen.To determine the principle variables in accidents of this freeway, the main factors analysis method is used.In this method, the principle variables are those whose coefficient is used for formation of related factor (at least one of coefficients is used), and this coefficient has relatively high amount.The results are shown in table (6).This gauge is equal to 0.8, since the complex of accidents is taken into account.Considering the gauge and table (6), it is obvious that all variables, petrol ration, have coefficient more than 0.8 in four given factors.So the petrol ration (P.R) variable, has the least important variable affective in this study.The reason is that the variable is depended on the daily traffic.

Prediction of accidents prediction by using natural logarithm regression
To develop the logarithm regression, the KES is used to test the dependence.This test is nonparametric that is used to compare the distribution for the quantitive data.The KES, compares the feasibility of gathered amount in a special theoretical distribution.If the difference is very large, this test shows that your data is not match with one of theoretical distributions.
If the gauge (paragon= p-value) is less than 5%, the zero theory is rejected, that is, the data can't be from a special distribution (if the assurance is higher than 5%).
Considering the existing the β amounts in regression model, is according to the table (7).
The meaningful amount of coefficient in each variable, is compared by t and the results are shown in table (7).The table (7) shows that speed and traffic jam are most important factors of accidents in this freeway.
The results are confirmed by the others [17].This subject is also referred to by Hess and Polak and Olmstead.They pointed out the speed camera will decrease 18% of accidents in each month upgrading the observation and control will decrease

Pridiction of accidents by using artificial neural network model
Notice that three input variables and one output variables is considered, the above motioned model has three neuros in input layer and one output in the other layer, so that the number of neuros in secret layer should calculated by minimizing the error squares average.In the first step, the data are divided in to training(70%), reliability measuring (15%) and test (15%).Considering that change of number of secret layer neuros of network will chenge the network the different moods of network in neural network of secret layer will be executed.The system will be tested repitedy based on serial of trained data and tested reliability the amount of MSE and  2 for total data derived from the model is compared with the regression model.

Survey of effect of variables reduction
As mentioned previously, season and petrol variables, test of analys is of main components for omitting the unimportant variables in modeling process, is done.This is for neural network model and natural logarithm regression model and based on 36-month data of Tehran-saveh freeway.This is examined in two situations: before decreasing of variables and after decreasing variables.The care scale of built models is compared by using square power of  2 and MSE.The results are shown in table (9).The results show that the omitted variable during analysis method process, doesn't have considerable effect on the freeway accidents and the omission of it, doesnot change the care of the built models.The decrease of input variables in artificial neural network model couses the decrease of number of network training repeatitions to reach to an optimum network and also it optimizes the time for reaching to neural network.

Conclusion
The results of analysis of principel components show that the two season accident and petrol allocation are the less important for entering the modeling we used natural logarithm regression and neural network model in two situation (before the decrease of variables and after decrease) the results show that, by comparing the models, the excistance of less important variables in modeling doesnot have considerable effect on increasing care.To study the efficiency of outputs of neural network model it is compared with natural logarithm regression.The comparison showed that the neural network model has a better efficiency than the natural logarithm regression.

Table 6 :
Special vectors in PCA values 13.26 injuring accidents.The existence of facilities for traffic management in Arizona, America, cause 21 to 30 percent decrease of accidents.Also, in Minnesota freeways, almost 30% or 900 accidents, are decrease.The study about freeways traffic management facilities shows that: using advanced technology for announcement the speed limit, the lines control signals, and location can decrease the rate of accidents effectively.

Table 8 :
the comparison of natural logarithm regression model with neural network model (for normal data)

Table 9 :
the results of modeling before and after the variables reduction