Study on Structural Equation Modeling for Analyzing Data

The Structural Equation Model (SEM) is a combination of two separate statistical methods, namely factor analysis developed in psychology and psychometry and simultaneous equation model developed in econometrics. Factor analysis was first introduced by Galton in 1869 and Pearson (Pearson and Lee, 1904). Spearman's (1904) research is the development of a general factor analysis model in research relating to the structure of mental abilities, Spearman stated that the intercorrelation test between mental abilities can determine general ability factors and special ability factors. SEM is a combination of factor analysis and path analysis into one comprehensive statistical method. Path analysis itself is the forerunner of the structural equation of Sewwl Wright's research in the field of biometrics. Wright's contribution is to be able to show that the correlation between variables is related to the parameters of a model described by a path (path diagram). In SEM there are 2 variables, namely latent variables (exogenous and endogenous) and indicator variables. SEM has 2 equation models, namely the measurement equation model and the structural equation model. SEM also has 2 errors, namely the error for the measurement equation model and the error for the structural equation model. In general, SEM is formed from the relationship between latent variables and their respective indicator variables. To test whether the existing indicator variables are valid indicators for measuring the latent construct, Confirmatory Factor Analysis (CFA) is used. Data analysis with SEM must meet the existing SEM assumptions. The model feasibility test is carried out based on the goodness of fit criteria. The stages in SEM analysis are theoretical model development, flow chart drawing, flow chart conversion into equation form, input matrix and model


Introduction
Structural Equation Modeling (SEM) was first discovered by a scientist named Joreskog in 1970, Structural equation modeling is a second-generation multivariate analysis technique that combines factor analysis and path analysis, allowing researchers to simultaneously test and estimate latent variables, both exogenous and endogenous which also involves the indicator variables. In the structural equation model, there are several estimation methods. The method commonly used is the Maximum Likelihood (ML) method or the maximum likelihood method. ML is a method that has an unbiased estimator and a minimum variance (Ghozali, 2017).
SEM is a combination of factor analysis and path analysis into one comprehensive statistical method. Path analysis itself was the forerunner of the structural equations of Sewwl Wright's research in 1918Wright's research in , 1921Wright's research in , 1934Wright's research in , and 1960 in the field of biometrics. Wright's contribution is to be able to show that the correlation between variables is related to the parameters of a model described by a path (path diagram). Wright also stated that the resulting equation model can be used to estimate the direct effect, indirect effect, and total effect. The first application of path analysis developed by Spearman is statistically equivalent to factor analysis (Ghozali, 2017).
Structural Equation Modeling (SEM) is a multivariate analysis technique developed to cover the limitations of previous analytical models that have been widely used in statistical research (Sasongko et al., 2016).

A. Latent variables
Latent variables are abstract concepts, for example, people's behavior, attitudes, feelings, and motivations. Latent variables can only be observed imperfectly through their effects on the observed variables. SEM has two types of latent variables, namely endogenous latent variables and exogenous latent variables. Exogenous variables appear as independent variables in all equations in the model, with the mathematical notation letter ("ksi"). Endogenous variables are dependent variables in at least one equation in the model, with the mathematical notation letter ("eta") (Wijanto, 2008).

B. Indicator variables
Indicator or measurable variables are variables that can be observed or measured empirically and are often called indicators. The observed variable is the effect or measure of the latent variable. In the survey method using a questionnaire, each question on the questionnaire represents an observed variable (so if a questionnaire has 50 questions, then there will be 50 observed variables). The observed variables related to or an effect of the exogenous latent variable (ξ) are given mathematical notation with the label X, while those related to the endogenous latent variable (η) are labeled Y. The symbol of the path diagram of the observed variable is a square (Wijanto, 2008).

Models in SEM
There are 2 types of models in SEM, namely: A. Structural model Structural Model describes the relationship that exists between latent variables. Parameters that show the relationship between endogenous and exogenous variables are denoted by . Parameters that show the relationship between endogenous variables and other endogenous variables are denoted by . In SEM, the exogenous latent variable is independently covariant and the covariance matrix of this variable is denoted by (Wijanto, 2008). The general structural model equation can be written as follows: Suppose the random vectors and are endogenous and exogenous variables, respectively, forming a simultaneous equation with a linear relationship system: (1) and are the coefficient matrix of latent variables and are error vectors in structural equations. The element represents the effect of variable in the other variable , and element presents the direct effect of variable in variable . It is assumed that is not correlated with and is nonsingular. (Joreskog and Sorbon, 1989).
Structural model equations can be described again and the following results are obtained: (2) with: = vector of endogenous variables of size m× 1 = vector of exogenous variables of size n× 1 = coefficient matrix of endogenous variables of size = exogenous variable coefficient matrix of size = vector error in structural equation

B. Measurement Model
In SEM, each latent variable usually has several measures or indicators. SEM users most often associate latent variables with their indicators through measurement models in the form of factor analysis which are widely used in psychometrics and sociometry. In the SEM model, each latent variable is modeled as a factor that underlies the related indicators. The load of factors that relate the latent variable to the indicator is denoted by (lambda). For indicator X is denoted by _x and for indicator Y is denoted by _y (Wijanto, 2008). The equation of the measurement model can be written as follows: (3) (4) with: : indicator variable vector of exogenous variable : matrix for loading factor ( ) or coefficient that shows relationship between and with size : vector of measurement model error for with size : indicator variable vector of endogenous variable : matrix for loading factor ( ) or coefficient that shows relationship between and with size : vector of measurement model error for with size

A. Structural Error
In general, SEM users do not expect that the independent variable can perfectly predict the dependent variable, so that a structural error component is usually added in a model, which is given the Greek symbol (zeta). In order to obtain a consistent parameter estimate, this structural error is assumed to be uncorrelated with the exogenous variables of the model. However, structural faults can be modeled to correlate with other structural faults (Wijanto, 2008).

B. Measurement Error
In SEM the indicators or observed variables cannot perfectly measure the related latent variables. To model this imperfection, a component representing measurement error was added to the SEM, which was labeled with the Greek letter (delta) for the measurement error associated with the observed variable X, while that associated with the variable Y was labeled with the Greek letter (epsilon). . The measurement errors of may be covariant with each other, although by default they are not covariant with each other. The covariance matrices of are denoted by the Greek letter _δ (Theta delta) and are by default a diagonal matrix, while the covariance matrices of are denoted by the Greek letter _ϵ (Theta epsilon). When a latent variable is only reflected/measured by a single observed variable, estimating the value of the associated measurement error is difficult/impossible. In this case, the measurement error must be specified before estimating the parameter or the measurement error can be considered as non-existent or zero (Wijanto, 2008).

General Form of SEM
Based on the previous explanation, a model from SEM can be drawn up. SEM is formed based on the relationship between latent variables where each latent variable is measured by its respective indicator variables. An example of the general form of SEM can be seen in Figure 1 (Wijanto, 2008).

Confirmatory Factor Analysis (CFA)
Confirmatory Factor Analysis (CFA) is a measurement model that shows whether a latent variable is measured by one or more indicator variables (Sasongko et al., 2016). CFA tests whether the existing indicator variables are valid indicators of measuring the latent construct (Ghozali, 2017). The general model of the CFA is as follows: (5) with: X : vector for the indicator variable with size q x 1 : matrix for loading factor () a or coefficient that shows the relationship between x with ξ with size q x n : vector for the latent variable of size n x 1 : vector for the measurement error of size q x 1 The level of significance with a loading factor value > 0,4 indicates the indicator can explain each factor variable (Hair et al., 2010).

SEM Assumptions
Ferdinand in (Santosa, 2006), the assumptions that must be met in the data collection and processing procedures analyzed by SEM modeling are: a. The minimum sample size is 100 and then using a comparison of 5 observations for each estimate parameter. b. The distribution of the data must be analyzed to see if the assumption of normality is met. c. Outliers are observations with extreme values both univariate and multivariate that arise because of the combination of unique characteristics they have and are bound to be very different from other observations. d. Detect multicollinearity and singularity of the determinant of the covariance matrix. The very small value of the determinant of the covariance matrix indicates the existence of a multicollinearity or singularity problem.

Criteria Goodness of Fit
There are three types of Goodness-of-Fit measures, namely Absolute Fit Indices, Incremental Fit Indices, and Parsimonious fit indices (Haryono and Wardoyo, 2012).

Absolut Fit Measures
According to Wijanto (2008) the absolute fit measure determines the degree of prediction of the overall model (structural and measurement models) to the correlation and covariance matrices. Of the various absolute fit measures, the most commonly used measures to evaluate SEM are: a. Chi-Square The first statistic and the only statistical test in GOF is . Chi-square is used to test how close the match is between the covariance matrix of the sample S and the covariance matrix of the model ∑ . The statistical test is ∑ Which is a Chi-square distribution with a degree of freedom (df) of cp ; in this case, c = (nx+ny) (nx+ny+1)/2 is the number of non-redundant variance-covariance matrices of the observed variable, nx is the number of observed variables x, ny is the number of observed variables y. The p is the number of estimated parameters and n is the sample size.
The expected value of Chi-square is low. A low Chi square value indicates that the null hypothesis is accepted. This means that the predicted and actual input metrics are not statistically different Joreskog dan Sobron in Wijanto (2008) say that Chi-square should be treated more as a measure of goodness of fit or badness of fit and not as a statistical test. Chi-square is referred to as badness of fit because the higher Chi-square value indicates a bad fit while the Chi-square value a small one indicates a good fit.
Chi-square cannot be used as the only measure of the overall fit of the model, one of the reasons is because Chi-square is sensitive to sample size. When the sample size increases, the Chi-square value will also increase and lead to model rejection even though the value of the difference between the sample covariance matrix (S) and the model covariance matrix or ∑(Ɵ) has been minimal and small.

b. Goodness Of Fit Idices (GFI)
At first GFI was proposed by Jöreskog and Sörbom in Wijanto (2008) for estimation with ML and ULS, then generalized to other estimation methods by Tanaka and Huba (1985). GFI can be classified as an absolute fit measure, because it basically compares the hypothesized model with no model at all (∑ ). The formula for GFI is as follows: (7) with F is the minimum value of the hypothesized model F 0 is the minimum value of F when no model is hypothesized. The GFI value ranges from 0 (poor fit) to 1 (perfect fit), and the GFI value 0,90 is good fit, while 0,80 < GFI < 0,90 is often referred to as marginal fit.

c. Root Mean Square Error Of Approximation (RMSEA).
This index was first proposed by Steinger and Lind in Wijanto (2008) and is currently one of the most informative indices in SEM. The RMSEA calculation formula is as follows: with RMSEA value < 0,05 indicates close fit, while 0,05 RMSEA 0,08 indicates good fit (Brown and Cudeck, 1993).

Incremental Fit Indices
The measure of incremental fit compares the proposed model with the baseline model which is often referred to as the null model or the independence model and the saturated model. Null model is a model with the worst fit of the data-model ("worst fit"). The saturated model is the one with the best data-model fit ("best fit"). The basic model or null model is a model in which all variables in the model are independent of each other (or all correlations between variables are zero) and most restricted (most restricted), Byrne (1998) in Wijanto (2008).
The concept of incremental fit will place the data-model fit level between the null model and the saturated model. The level of fit of the data-model that is between the null model and the saturated model is called the nested model, Mueller in Wijanto (2008). This incremental fit measure contains a measure that represents the comparative fit to base model point of view. The closer to the saturated model, the better the fit. Of the various measures of incremental fit, the ones commonly used to evaluate SEM are: a. Adjusted Goodnes Of Fit Index (AGFI) Jöreskog and Sörbom in Wijanto (2008), AGFI is an extension of GFI adjusted for the ratio between the degree of freedom of the null / independence / baseline model and the degree of freedom of the hypothesized or estimated model. AGFI can be calculated by the following formula: (9) with is the degree of freedom of no model is the degree of freedom of the hypothesized model. As with GFI, the AGFI value ranges from 0 to 1 and the AGFI value 0,90 indicates good fit. While 0,80 < GFI < 0,90 is often referred to as marginal fit.
It was first proposed as a tool for evaluating factor analysis, but is now being developed for SEM.
With is the chi square of the null/independence model. is the chi square of the hypothesized model. is degree of freedom from null model.
is degree of freedom from hypothesized model. TLI value ranges from 0 to 1 with TLI value 0,90 indicating good fit and 0,80 < TLI < 0,90 is marginal fit.
c. Norm Fit Index (NFI) In addition to NNFI, Bentler and Bonnet in Wijanto (2008) also proposed a GOF measure known as NFI. This NFI has a value ranging from 0 to 1. The NFI value 0,90 indicates good fit, while 0,80 < NFI < 0,90 is often referred to as marginal fit. To obtain the NFI value, the following formula can be used: Bentler in Wijanto (2008) adds to the inventory of incremental matches through CFI, whose value can be calculated by the formula: The CFI value will range from 0 to 1. The CFI value 0,90 indicates good fit, while 0,80 < CFI < 0,90 is often referred to as marginal fit. e. Incremental Fit Index (IFI) Bollen in Wijanto (2008) proposes IFI, whose value can be obtained from: (13) The IFI value will range from 0 to 1. The IFI value 0,90 indicates good fit, while 0,80 < IFI < 0,90 is often referred to as marginal fit.
f. Relative Fit Index (RFI). Wijanto (2008) can be calculated using the formula: (14) with is the minimum value of F from the hypothetical model is the minimum value of F from null/independence

Parsimonious Fit Indices
According to Wijanto (2008) models with relatively few parameters (and relatively many degrees of freedom) are often known as models that have parsimony or high efficiency. Meanwhile, a model with many parameters (and a few degrees of freedom) can be said to be a complex model and lack parsimony. The parsimony fit measure relates the model's GOF to the number of parameters estimated, i.e., required to achieve a fit at that level. In this case, parsimony can be defined as obtaining the highest degree of fit for each degree of freedom. This means that higher parsimony is better. Of the various parsimony fit measures, the measures commonly used to evaluate SEM are: a. Parsimonious Normed Fit Index (PNFI).
According to James, Begink and Brett in Wijanto (2008) PNFI is a modification of NFI. PNFI takes into account the number of degrees of freedom to achieve a level of fit. PNFI is defined as follows: (15) with is the degree of freedom of hypothesis model is the degree of freedom of null model The higher the PNFI score the better. The use of PNFI is mainly for comparison of two or more models that have different degrees of freedom. The PNFI was used to compare alternative models, and no acceptable fit was recommended. However, when comparing the 2 models, the difference in PNFI values of 0,06 to 0,09 indicates a fairly large model difference (Haryono and Wardoyo, 2012 In contrast to AGFI which modifies GFI based on the degree of freedom, PGFI is based on the parsimony of the model (Wijanto, 2008). PGFI makes adjustments to GFI in the following ways: (16) The PGFI value ranges between 0 and 1 with a higher value indicating a better parsimony model.
In empirical research practice, a researcher does not have to meet all the goodness of fit criteria. According to Hair et al (2010) in Latan (2011), the use of 4 to 5 goodness of fit criteria is considered sufficient to assess the feasibility of a model, provided that each goodness of fit group is absolute fit indices, incremental fit indices and parsimonious fit. indices are represented.

SEM Stages
The steps of data analysis using SEM are as follows (Ghozali, 2017):

Theoretical Model Development
The theoretical development of the model is based on an existing and strong theory. The strength of the causal relationship between the two variables does not lie in the analytical method, but in the theoretical justification to support the model analysis (Ghozali, 2017).

Flowchart Drawing
In the second step, the relationship between exogenous and endogenous variables and their measuring variables will be described based on the previous theoretical development of the model (Ghozali, 2017).

Convert Flowchart into Equation
There are 2 equation models in SEM, namely a measurement equation model to express the relationship between latent variables and their indicator variables and a structural equation model to express the relationship between various latent variables (Ghozali, 2017).

Input Matrix and Model Parameter Estimation Technique
SEM only uses the variance/covariance matrix or correlation matrix as input data for the overall estimation it performs. Individual observations can be used, but the input data will be converted into a covariance matrix or correlation matrix before estimation. The covariance matrix is used when the purpose of the analysis is to test a model that already has a theoretical concept. The correlation matrix is used when the purpose of the analysis only wants to see the pattern of relationships between variables (Ghozali, 2017).
Parameter estimation techniques that can be used in SEM are Maximum Likelihood, Generalized Least Square Estimation, and Asymptotically Distribution-Free Estimation (Ghozali, 2017).

Model Problem Identification
According to Ghozali (2017)

Evaluating Model Parameter Estimation
In the process of estimating the model parameters, firstly, a confirmatory analysis (CFA) is carried out on the measurement equation model for exogenous and endogenous variables. Secondly, a feasibility test of the model is carried out by taking into account the loadin factor value (expected value 0,5) and the goodness of fit criteria. If the model is not feasible, modify the model by eliminating indicator variables that do not meet the model's eligibility criteria. Third, combine the final result model from the confirmatory analysis of the exogenous and endogenous variables and estimate the parameters of the model. Fourth, testing the SEM assumptions, goodness of fit feasibility test, validity test and reliability test (Ghozali, 2017).

Interpreting Models and Modifying Models
Interpreting the model and modifying the model for models that do not meet the test requirements. Before modifying the model, it is necessary to observe the value of the loading factor and standardized residuals generated by the model. An indicator that has a loading factor value of more than or equal to 0,5 indicates that the indicator meets convergent validity. Meanwhile, an indicator with a loading factor value below 0,5 means that it does not meet convergent validity, so the indicator is eliminated from the model (Ghozali, 2017).

Conclussion
There are 2 variables, 2 equation models, and 2 errors in SEM. Variables in SEM are latent variables and indicator variables. The equation model in SEM is a measurement equation model and a structural equation model. Errors in SEM are structural errors and measurement errors. CFA on the measurement model is carried out to test whether the existing indicator variables are valid indicators of measuring the latent construct. There are several goodness of fit test criteria that can be used in SEM including GFI, AGFI RMSEA, and TLI. The SEM steps are as follows: 1) Theoretical model development, 2) Flowchart drawing, 3) Convert flow chart into equation form, 4) Input matrix and model parameter estimation techniques, 5) Identify the model problem, 6) Evacuating model parameter estimates, and 7) Interpreting the model and modifying the model