Principal Component Regression in Statistical Downscaling with Missing Value for Daily Rainfall Forecasting

Drought is a serious problem that often arises during the dry season. Hydrometeorologically, drought is caused by reduced rainfall in a certain period. Therefore, it is necessary to take the latest actions that can overcome this problem. This research aims to predict the potential for a drought to occur again in the Kupang City, Indonesia by developing a rainfall forecasting model. Incomplete daily local climate data for Kupang City is an obstacle in this analysis of rainfall forecasting. Data correction was then carried out through imputed missing values using the Kalman Filter method with Arima State-Space model. The Kalman Filter and Arima State-Space model (2,1,1) produces the best missing data imputation with a Root Mean Square Error (RMSE) of 0.930. The rainfall forecasting process is carried out using Statistical Downscaling with the Principal Component Regression (PCR) model that considers global atmospheric circulation from the Global Circular Model (GCM). The results showed that the PCR model obtained was quite good with a Mean Absolute Percent Error (MAPE) value of 2.81%. This model is used to predict the daily rainfall of Kupang City by utilizing GCM data.


Introduction
Drought disaster in Indonesia is a problem that has a significant impact. As an agricultural country, drought can cause a decrease in food crop production which has an impact on decreasing the amount of national food and causing disruption of economic stability. Drought can be interpreted as a reduction in inventory water or moisture that is temporarily significantly below the normal or expected volume for a specified period of time (Murisidi and Sari, 2017;Field et al., 2004;Rozaki et al., 2021). The occurrence of natural disasters is also influenced by geographical conditions (Boesday et al., 2020;Titu-Eki and Kotta, 2021), one of which is the Kupang City. The rainy and dry seasons in Kupang City are closely related to the monsoon pattern so that it has an impact on reducing the formation of rain clouds in Kupang City.
Therefore, a number of handling efforts are needed to overcome the drought disaster in Kupang City. The latest effort that can be done is to develop a rainfall forecasting model. Accurate and precise rainfall forecasting models are an important part of providing rainfall information in the future. Based on previous research conducted by Estiningtyas and Wigena, it was explained that the best model of Principal Component Regression (PCR) was able to predict rainfall in El Nino conditions with an average RMSEP value of 95.22 and a correlation of 0.66 (Estiningtyas and Wigena, 2011). In addition, the several GCM models can approach the average monthly rainfall value with the largest correlation value of 0.497 at Bondan Station (Susandi et al., 2015). Based on the explanation above, it can be concluded that PCR and GCM models often have high accuracy values and low error values.
GCM data is a mathematical description of a large number of interactions of physics, chemistry, and dynamics of the Earth's atmosphere so that it produces a very large amount of data that can be used to make climate forecasts (Farikha et al., 2021). The resolution of the GCM data is too low to predict the local climate, so an alternative is needed in the use of GCM data, one of which is statistical downscaling (SD). Statistical Downscaling is a statistical model to describe the relationship between data on global-scale units and data on local-scale units within a certain time period (Sachindra et al., 2018). However, this application has some obstacles. One of the obstacles that occur is the loss of local climate information from data provided by the Meteorology, Climatology and Geophysics Agency (BMKG).
The main objective of this study is; (1) how to overcome missing values in local climate data, (2) how to model statistical downscaling using the PCR method in daily rainfall forecasting, and (3) how the results of daily rainfall forecasting are for policy making in dealing with drought disasters.

Data Description
In this study, two data were used, namely Global Circular Model (GCM) data and daily rainfall data in Kupang City. Both data were obtained from online publications issued by the BMKG through the website http://www.dataonlinebmkg.go.id. for daily rainfall data in Kupang City and GCM data obtained from the website https://cds.climate.copernicus.eu. Each data was obtained in the period January 1, 2019 to December 31, 2019. In the GCM data collection there are regional boundaries, it is latitude range of -13.25 o SL to -7.25 o SL and longitude range of 120.75 o EL to 126.75 o EL.

ARIMA(p, d, q) and Kalman Filter Imputation
The model state-space provides flexibility in extracting features from time series data. This model is generally used for the purpose of prediction, smoothing and likelihood assessment. The model state-space also provides a suitable framework for incorporating smoothing functions in various time series models to improve general predictions. In general, the model is state-space shown by equations (1) and (2) (1) (2) where, for a state vector of length , is a vector of length , N ̃D ( ), is a matrix, is a matrix dan D( ). Considering Equation (1), let m max( ). Then : (3) Also, for and let: Furthermore, the state vector ( The state-space form of the Arima model (p, d, q) has been found. Both have computational and conceptual advantages (Kumar and Goyal, 2011). This formulation ensures that the Arima model is responsive to the Kalman filter and smoothing for estimating model parameters and unobserved component extraction (Zulfi et al., 2018;Ananda and Wahyuni, 2021;Mehta and Sukmawaty, 2021). Estimation and updating of model parameters are part of the Kalman filter. These models are implemented using the "imputeTS" package, version 2.7, using the "StructTS" and "auto.arima" options of the "na.kalman" function in software version R. 4.1.1

Principal Component Regression
Rainfall data has very many variables and has high multicollinearity. Multicollinearity is the relationship of a condition where there is a correlation between independent variables or between independent variables that are not independent (Sahriman et al., 2014;Yu et al., 1997). The existence of multicollinearity in the multiple regression model can cause the variance of the data set to enlarge so that the influence of each independent variable cannot be separated. Principle Component Regression (PCR) is an algorithm to reduce multicollinearity from a dataset (Nair et al., 2013). In addition, by usually regressing on only a subset of all the principal components, PCR can result in dimension reduction through substantially lowering the effective number of parameters characterizing the underlying model. This can be particularly useful in settings with high-dimensional covariates. Also, through appropriate selection of the principal components to be used for regression, PCR can lead to efficient prediction of the outcome based on the assumed model. In general, the PCR equation (7) is: where ( ) is the response variable data, is the intercept value, is the coefficient for the th component, and is the th principal component.

Method Evaluation
Root Mean Square Error (RMSE) is a measure of error or error between two corresponding values, in this case the predicted value and the actual value. In general, RMSE is formulated by equation (8).
Where is the actual value, ̂ is the predicted value, is the number of data. In addition. to RMSE, the validation model that can be used is Mean Absolute Percent Error (MAPE). MAPE is used if the size of the forecasting variable is an important factor in evaluating the accuracy of the forecast. MAPE provides an indication of how big the forecast error is compared to the actual value of the series. The use of MAPE values has a range of values that can be used as measurement material regarding the ability of a forecasting model, the range of values can be seen in Table 1. The ability of the forecasting model is very good The ability of the forecasting model is good The ability of the forecasting model is feasible The ability of the forecasting model is poor In general, MAPE is formulated in the equation (9).
With is the amount of data, is the actual value, ̂ is the predictive value.

Model Development
This research will use NTT daily average rainfall data and Kupang's daily rainfall data at the Eltari observation station produced by BMKG from the website http://dataonline.bmkg.go.id/data_iklim. The probability of missing value at the daily rainfall of Kupang data does not depend on observed or unobserved data, these data are missing completely and randomly (MCAR) (Gill et al., 2007). NTT daily average rainfall data is used as a reference for the characteristics of the daily rainfall data for Kupang City to make the best estimation model. Algorithm performance is evaluated by various test scenarios. After that, build a forecasting model by adding GCM data. Use the best model to get long daily rainfall forecast results in Kupang City. For each test scenario, the following steps are performed.

Delete average NTT rainfall values based on missing values and unusual observation of daily rainfall for
Kupang City and obtain time series with NA (ts_NA). 3. Apply the Imputation algorithm to ts_NA to get ts_Imputed. 4. Compare ts_complete and ts_Imputed using the appropriate of error size. 5. Get the smallest error measure. The smallest error size is the best model of the imputation algorithm.
6. Apply the best model of the imputation algorithm to the missing values and unusual observation of the daily rainfall of Kupang City. 7. Get GCM data (No-Missing-Value) through the website https://cds.climate.copernicus.eu/ with domain grid 3 3 to 12 12. 8. Statistical Machine Learning: divides data into two parts, training and testing. Build forecasting models using training data. 9. Validating and Evaluating The Learning: Test the forecasting model using data testing. 10. Get the best forecasting model from the best imputation method. 11. Do a daily rainfall forecasting for the future.
In addition, the variables in this study are the predictor variable (x) and the response variable (y). Predictor variables are GCM data with domain grid and response variables are Kupang City rainfall data. In this research, we use R Software 4.1.1 and steps in this study can be seen in Figure 1.

Handling Missing Value
For solve missing values we use Arima method and Kalman Filter. Optimal accuracy in missing value imputation is assessed based on the given performance in selecting the best imputation method. Table 2. presents the performance results collected from the imputed missing values of the two methods used. The performance of the Kalman Filter with the Arima State-Space Model can be attributed to the relatively strong relationship between the missing values and the existing data. This best model is used for imputation of missing value in Kupang City. The graph of the imputed missing value in Kupang City is explicitly able to follow the trend (see Figure 2).

PCR Forecasting Models
Determination of the optimal domain produces an accurate daily rainfall forecasting model. In this study, the domain chosen for optimization consisted of 10 square grid sizes, 3 3 to 12 12. The ten grid sizes had the number of predictor variables 3, 16,25,36,49,64,81,100,121,and 144 predictor variables. Therefore,in this study,PCR was used. Conceptually, PCR is based on PCA where there is a reduction in the dimensions of the predictor variables for each domain size. The new variable from the reduction was used to construct the PCR mode. The performance of the simulation model is evaluated by MAPE and compared with MAPE values in each grid size to get the best grid size with minimum MAPE. The simulation results using training data for each grid size are shown in Table 3. Table 3 shows that there is a small difference in RMSE between the 10 grid sizes but the 6 6 grid size produces a minimum MAPE. Grid size with minimum MAPE is the best domain and is suitable for daily rainfall forecasting in the future. The grid size 6 6 has 36 predictor variables. In this study, the PCA method was used to extract orthogonal principal components (PCs). The PCs obtained were based on more than 93% variance. From Figure 3. The first 11 PCs have more than 93% information variance, so the number of new predictor variables used for this forecasting model consists of 11 PCs.
The SD results presented in Figure 4. are the results of the model test in the testing period. In this model, it can be seen that the prediction results are able to follow the actual data pattern. This model produces a MAPE value of 2.81% and an RMSE value of 10.81.  Therefore, the PCR model with main components (PCs) can be used for daily rainfall forecasting in the future period from December 1, 2020 to November 30, 2021. From Figure 5, the results can follow the same pattern from previous studies based on historical daily rainfall in Indonesia. research area. In the research area of daily rainfall from June to November tendency to lower and in December until May tendency to increase.