Data Mining Implementation Using Naïve Bayes Algorithm and Decision Tree J48 In Determining Concentration Selection

Computerization of society has substantially improved the ability to generate and collect data from a variety of sources. A large amount of data has flooded almost every aspect of people's lives. AMIK HASS Bandung has an Informatic Management Study Program consisting of three areas of concentration that can be selected by students in the fourth semester including Computerized Accounting, Computer Administration, and Multimedia. The determination of concentration selection should be precise based on past data, so the academic section must have a pattern or rule to predict concentration selection. In this work, the data mining techniques were using Naive Bayes and Decision Tree J48 using WEKA tools. The data set used in this study was 111 with a split test percentage mode of 75% used as training data as the model formation and 25% as test data to be tested against both models that had been established. The highest accuracy result obtained on Naive Bayes which is obtaining a 71.4% score consisting of 20 instances that were properly clarified from 28 training data. While Decision Tree J48 has a lower accuracy of 64.3% consisting of 18 instances that are properly clarified from 28 training data. In Decision Tree J48 there are 4 patterns or rules formed to determine concentration selection so that the academic section can assist students in determining concentration selection.


Introduction
The rapid development, of information technology, is undisputed. Along with these developments, all transaction data has been evolved by applying information technology. Thus the computerization of the community has substantially improved the ability to generate and collect data from various sources. Vast amounts of data have flooded almost every aspect of people's lives. e-ISSN 2721-477X p-ISSN 2722 The growth of explosive data has been stored, while data has generated an urgent need for new techniques and automated tools that can help intelligently turn large amounts of data into useful information and knowledge. This led to the development of computer science called data mining with its various applications. More popular data mining referred to as Knowledge Data Discovery or KDD is automatic or practical pattern extraction that represents knowledge implicitly stored or captured in large databases, data warehouses, web, other large information repositories, or data streams (Larose, 2015).
AMIK HASS Bandung is a private college in Bandung. In carrying out the learning process AMIK HASS Bandung has an Informatics Management Study Program consisting of three areas of concentration that can be selected by students in the fourth semester including Computerized Accounting, Computer Administration, and Multimedia. Concentration selection is an effort to determine interest in improving the field of science and competency that will be chosen by students based on the results of consultation with their respective guardian lecturers. Also, the academic section will evaluate student data in the form of gender, GPA, and Class. This activity takes a long time because the determination of concentration selection must be precise based on past data, so the academic section must have a pattern or rule to predict concentration selection. To solve the problem several methods can be applied in concentration selection at AMIK HASS Bandung. In this work, the data mining techniques used are Naïve Bayes and Decision Tree J48 using WEKA tools. Based on the background above, the purpose of this work is how to determine the pattern or rules of concentration selection and how much accuracy the application of Naïve Bayes data mining algorithms and Decision Tree J48 in concentration selection predictions.
Previous relevant work has been done by Nematzadeh (2012), researchers try to classify researchers as "Expert" and "Novice" based on cognitive factors to get the best possible answers. The domain of this work is based on the academic environment. An important point of this work is to classify researchers based on the Naive Bayes technique and Decision Tree J48 ultimately choosing the best method based on the highest accuracy of each method to help researchers get the best feedback based on their demands in the digital library. Based on the best accuracy, it can be concluded that web developers can use Naïve Bayes or Naïve Bayes update techniques compared to Decision Tree J48 to classify researchers and help them to get the best feedback based on their demands in the digital world of libraries (Nematzadeh, 2012).
Further work was carried out by George Dimitoglou et al. (2012), who examined the accuracy of data mining and machine learning with Naïve Bayes and J48 algorithms to predict the survival of lung cancer patients. The study showed an accuracy rate of about 90% on one of Naïve Bayes and J48's algorithms. The results of such a treating doctor can theoretically collect some medical measurements such as tumor size and location, treatment options, and others to predict with a fairly high degree of accuracy whether the patient is likely to live for five years or more. Given the high mortality rate (> 90%) patients in the study can be utilized to examine the survivability of patients over a shorter period, between 12 and 18 months (Dimitoglou et al, 2012).
In addition, much work has been done in data mining techniques in the field of education in various cases including Merceron, A et al. has a case study on mining education data sets to identify the behavior of failing students and to warn students about the risks before the final exam (Merceron and Yacef, 2005). Al-Radaideh (2006) applied the decision tree to predict the final grades of students studying C++ Courses at Yarmouk University. Jordan. Romero et al. (2008). have done work in the application of data mining techniques for moodle course management and data mining techniques that have been widely used for e-learning data mining. In addition, educational data mining work was carried out by Minaei-Bidgoli et al. (2003). Beikzadeh et al (2005) does work using educational data mining to identify and improve. It has been observed that there has been an improvement in the decision-making process. Waiyamai et al. (2003) in his work used data mining to help develop a new curriculum, and to help students choose the appropriate courses. Rao et al. (2016) work on learning models to predict student performance using classification techniques. It also shows comparative performance analysis of J48, naïve Bayesian classifier, and random forest algorithms.
Comparing data mining classification techniques is Algorithm C4.5, AODE, Naive Bayesian, K-Nearest Neighbor to analyze and predict student performance aimed at improving skills in achieving the final goal of the semester (Mayilvaganan and Kalpanadevi, 2014).
The study aims to determine hidden knowledge and patterns about student performance by applying two classification algorithms, KNN and Naive Bayes to the secondary school education data set at the Gaza Strip Environment Ministry in 2015. The main purpose of classification can be to help the ministry of education to improve the performance and initial prediction of student performance. Teachers can also take appropriate evaluations to improve student learning. Experiment results showed that Naïve Bayes was better than K-Nearest Neighbor by receiving the highest accuracy score of 93.6% (Amra and Maghari, 2017).
Further work was carried out by Devasia et al. (2016), the work aimed at developing a webbased application to utilize Naive Bayes techniques in retrieving information contained in the Higher Education database. The increase in the number of students who did not continue studying affects the reputation of educational institutions. The experiment was conducted on 700 students consisting of 19 attributes. Results prove that the Naive Bayes algorithm provides higher accuracy compared to other methods such as Regression, Decision Tree, Neural Network, and others.
The current work is different from the previous work, which determines the comparison and prediction of the selection of student concentration in the fourth semester using gender attributes, GPA, and class to help students in determining the concentration that should be taken.

Data Mining
Data mining is a process in analyzing the data of various perspective data and summarizing it to produce useful information. Technically the process of data mining is to find patterns and relationships in a large relational database. Data sources can include databases, data warehouses, the web, other repositories of information, or data that dynamically flows into the system. In large-scale information technology can develop transaction and analytical systems separately, in data mining provides a relationship between the two. Data mining can find new relationships and patterns in data. It is found in the areas of statistics, machine learning, artificial intelligence, and neural networks (Rao et al., 2016;Han et al., 2012).

Naïve Bayes
Naïve Bayes is a classification with probability and statistical methods put forward by British scientist Thomas Bayes (Han et al., 2012). This algorithm uses the Bayes theorem and assumes that all independent variables are class variable values. This method only requires the amount of training data to determine the approximate parameters required in the Process classification. NBC often works much better in the most complex real-world situations than expected. Bayes theorem is a mathematical formula that used to determine conditional probability in equation 1 (Saritas and Yasar., 2019).
Description: P(Ci|X) = Probability of Ci hypothesis if given fact or record X (Posterior probability) P(X|Ci) = look for parameter values that give the most likelihood P(Ci) = Prior probability from X (Prior probability) P(X) = Number of probability tuple that appears

Decision Tree J48
Decision Tree is one of the most intuitive and popular data mining methods, especially in providing explicit rules for proper classification and handling of heterogeneous data. The Decision Tree is on the line between predictive and descriptive methods.
The Decision Tree technique is used in classification to detect individual division criteria from population into specified classes (many cases n = 2) starting with selecting variables that based on the category to provide the best separation of individuals in each class, thus providing subpopulations called nodes, each containing the largest proportion of individuals in a single class. Then the same operation will be repeated on each newly acquired node until there is no further separation from the individual that may or is desired according to the criteria depending on the tree type.  Figure 1 is a Decision Tree that shows the induction of the decision tree building a flow chartlike structure in which each internal node (non-leaf) shows a test on the attribute, each branch corresponds to the test result, and each external node (leaf) indicates the predicted class. On each node, the algorithm selects the "best" attribute to partition data into each class (Ye N, 2013). Decision Tree J48 is an implementation of the C4.5 algorithm developed by the WEKA project team.

Research Methods
This work is a test of student data that chooses the concentration in the fourth semester taken from the Academic Section which is poured in the form of a table. The data will be done twice using Naïve Bayes and Decision Tree J48 with the machine learning tool "WEKA".
The prediction made in this work is to determine the concentration chosen by a student who will take the study in the fourth semester with the following conditions: a. Gender: whether male or female. b. Class: whether the class is regular or non-regular c. GPA: what is the GPA in the third semester with a range of <3 or >=3 These three conditions will predict students who will choose concentration as an interest namely Computer Administration, Computerized Accounting, or Multimedia by studying past events with various conditions. The data in this work is 111 data sets with a 75% split test percentage mode used as training data as a model and 25% as test data to be tested against established models. The following is the concentration selection data that will be performed on each test, namely:

Results And Discussions
Prediction testing was conducted using two classification techniques namely Naïve Bayes and Decision Tree J48. Here are the test results against the training set. Figure 2 is the result of Naïve Bayes classification testing of training sets, testing is done by the same method on Decision Tree J48.

Figure 2: Naïve Bayes Test Results
In Figure 2 the test results using the Naïve Bayes classification have an accuracy rate of 71.4% which states the correct prediction ratio with the overall testing set tested, while the absolute error means 0.3264. The proximity of the data in the Multimedia class of 71.4% shows that the correct percentage of students choosing the multimedia concentration of the entire student predicted chose multimedia concentration. The Computerized Accounting Class has data proximity of 76.9% indicating that the correct percentage of students choosing the concentration of Computerized Accounting from the entire student predicted chose the concentration of Computerized Accounting. A return score for multimedia classes of 76.9% indicates that the percentage of students predicted chose multimedia concentration over students who chose multimedia concentration. While the return of grades for Computerized Accounting class of 90.9% indicates that the percentage of students who are predicted to choose the concentration of Computerized Accounting versus the overall student who chose the concentration of Computerized Accounting. Figure 3 is confusion matrix Naïve Bayes, the first line there is "10 1 2" indicating that there are multimedia class instances in the testing set of 10 correctly predicted as Multimedia, 1 is incorrectly classified as Computer Administration and 2 are incorrectly classified as Computer Administration. In the second line, there is a "3 0 1" indicating that there are instances of the Computer Administration class in the testing set of 3 incorrectly classified as Multimedia and 1 classified as Computerized Accounting. In the third line, there is a "1 0 10" indicating that there is an instance of the Computerized Accounting class in the testing set and 1 is incorrectly classified as Multimedia, and 10 is correctly predicted as Computerized Accounting. Figure 4 shows the predicted results using the Naïve Bayes classification.  Figure 6 is the result of Decision Tree J48 testing against the testing set. Testing was conducted in the same method against Naive Bayes.  Figure 5 shows the test results using the Decision Tree J48 classification having an accuracy rate of 64.3% with an absolute error of 0.3257 stating the correct prediction ratio with the overall testing set tested.

Decision Tree J48
The proximity of the data in the Multimedia class is 100% indicating that the correct percentage of students choosing the multimedia concentration of the entire student is predicted to choose the multimedia concentration. The Computerized Accounting Class has data proximity of 76.9% indicating that the correct percentage of students choosing the concentration of Computerized Accounting from the entire student predicted chose the concentration of Computerized Accounting. While the Computer Administration Class has data proximity of 30% indicates that the correct percentage of students choosing computer administration concentrations from all students is predicted to choose the concentration of Computer Administration. The returned score for multimedia classes of 38.5% indicates that the percentage of students predicted to choose multimedia concentration over actual students chose multimedia concentration. The grade returned for the Computer Administration class of 75% indicates that the percentage of students who are predicted to choose the concentration of computer administration rather than the actual student chooses the multimedia concentration. While the returned grades for computerized accounting classes of 90.9% show that the percentage of students who predicted choosing computerized accounting concentration versus students as a whole chose computerized accounting concentration. Figure 6 shows the predicted results using the Decision Tree J48 classification.
Based on the data in figure 6 obtained in the confusion matrix for decision tree J48 classification, the first line there is "5 6 2" indicating that there are multimedia class instances in the testing set among them 5 correctly predicted as Multimedia, 6 are incorrectly classified as Computer Administration and 2 are incorrectly classified as Computer Administration. In the second line, there is a "0 3 1" indicating that there is an instance of the computer administration class in the test set 3 correctly predicted as computer administration and 1 is incorrectly classified as computer administration. In the third line, there is a "0 1 10" indicating that there is an instance of computerized accounting class in the test set 1 incorrectly classified as computerized accounting and 10 is correctly predicted as computerized accounting. Figure 7 is the predicted result of 28 data tests using the Decision Tree J48 classification.  Figure 8 shows the visualization of the tree formed from the Decision Tree J48 classification model.

Figure 8: Tree Visualization Results
Based on the results of the visualization of the tree in figure 9 then the pattern or rules formed in the Decision Tree classification are as follows:  IF "GPA >=3" AND "Gender = Male" AND "Class = Regular" THEN "Multimedia"  IF "GPA >=3" AND "Gender = Male" AND "Class = Non-Regular" THEN "Computer Administration"  IF "GPA>=3" AND "Gender = Female" THEN "Computerized Accounting"  IF "GPA < 3" THEN "Multimedia"

Model Evaluation
After analyzing the results, table 2 shows the difference between the two algorithms in the test against the data set.  Table 3 shows the difference in the average proximity of the data, the average return of the value, and the length of time in which it is required in the classification process.  Table 3 shows the average proximity of data generated by Naïve Bayes by 63.4% and Decision Tree J48 is higher at 81% which states that the average percentage of students choosing the concentration of all students predicted. While the average return of grades produced on Naïve Bayes was 71.4% and Decision Tree J48 was lower which was 64.3% stating that the average percentage of students predicted in the selection of a concentration compared to the overall students who chose that concentration. While the time it takes to build a model on Naïve Bayes takes 0 seconds and Decision Tree J48 takes 0.02 seconds.

Conclusion
Based on test results using Naïve Bayes and Decision Tree J48 with split percentage mode in the same data set, some conclusions can be drawn as follows: 1. There are 4 patterns or rules formed to determine the selection of concentration so that the academic section can assist students in determining concentration selection. 2. While the Decision Tree J48 classification has a lower accuracy of 64.3% consists of 18 instances that are clarified correctly from 28 training data. While the mean absolute error value in the Decision Tree J48 classification has a lower value. The smaller the absolute error mean value, the better the classification model.