.

Tuesday, June 4, 2019

Analysis of Attribution Selection Techniques

Analysis of Attribution Selection TechniquesABSTRACTFrom a rangy amount of entropy, the significant knowledge is discovered by means of applying the proficiencys and those techniques in the knowledge management handle is known as entropy excavation techniques. For a specific domain, a form of knowledge discovery called data mining is necessary for solving the problems. The classes of unknown data are detected by the technique called classification. Neural networks, rule based, end steers, Bayesian are the about of the existing methods used for the classification. It is necessary to filter the irrelevant attributes ahead applying any mining techniques. Embedded, wrapper and filter techniques are various feature option techniques used for the filtering. In this paper, we have discussed the attribute selection techniques like Fuzzy Rough SubSets Evaluation and data Gain Sub learn Evaluation for selecting the attributes from the large number of attributes and for search metho ds like BestFirst Search is used for addled uncut subset evaluation and Ranker method is applied for the nurture watch evaluation. The decision tree classification techniques like ID3 and J48 algorithmic program are used for the classification. From this paper, the above techniques are analysed by the Heart Disease Dataset and generate the outlet and from the result we can conclude which technique will be best for the attribute selection.1. INTRODUCTIONAs the world grows in complexity, overwhelming us with the data it generates, data mining becomes the only hope for elucidating the patterns that underlie it. The manual process of data analysis becomes tedious as size of data grows and the number of dimensions increases, so the process of data analysis needs to be computerised. The term Knowledge Discovery from data (KDD) refers to the automated process of knowledge discovery from databases. The process of KDD is comprised of many stairs namely data cleaning, data integration, d ata selection, data transformation, data mining, pattern evaluation and knowledge submitation. Data mining is a step in the social unit process of knowledge discovery which can be explained as a process of extracting or mining knowledge from large amounts of data. Data mining is a form of knowledge discovery essential for solving problems in a specific domain. Data mining can also be explained as the non petite process that automatically collects the useful out of sight information from the data and is taken on as forms of rule, judgment, pattern and so on. The knowledge extracted from data mining, allows the user to dumbfound interesting patterns and regularities deeply buried in the data to help in the process of decision making. The data mining tasks can be broadly separate in two categories descriptive and predictive. Descriptive mining tasks characterize the general properties of the data in the database. Predictive mining tasks perform inference on the occurrent data in order to make predictions. According to different goals, the mining task can be mainly divided into four types class/concept description, association analysis, classification or prediction and clustering analysis.2. LITERATURE SURVEYData available for mining is raw data. Data may be in different formats as it comes from different sources, it may consist of noisy data, irrelevant attributes, missing data etc. Data needs to be pre processed before applying any mixed bag of data mining algorithm which is done employ following stepsData Integration If the data to be mined comes from some(prenominal) different sources data needs to be integrated which involves removing inconsistencies in names of attributes or attribute value names between data sets of different sources .Data Cleaning This step may involve detecting and correcting phantasms in the data, filling in missing values, etc.Discretization When the data mining algorithm cannot cope with ceaseless attributes, discretizat ion needs to be applied. This step consists of transforming a continuous attribute into a categorical attribute, taking only a few discrete values. Discretization much improves the understandability of the discovered knowledge.Attribute Selection not all attributes are relevant so for selecting a subset of attributes relevant for mining, among all original attributes, attribute selection is required.A Decision Tree Classifier consists of a decision tree generated on the basis of instances. The decision tree has two types of nodes a) the root and the essential nodes, b) the leaf nodes. The root and the internal nodes are associated with attributes, leaf nodes are associated with classes. Basically, each non-leaf node has an outgoing branch for each possible value of the attribute associated with the node. To determine the class for a new(a) instance using a decision tree, beginning with the root, successive internal nodes are visited until a leaf node is reached. At the root nod e and at each internal node, a test is applied. The outcome of the test determines the branch traversed, and the next node visited. The class for the instance is the class of the final leaf node.3. give SELECTIONMany irrelevant attributes may be present in data to be mined. So they need to be removed. Also many mining algorithms dont perform well with large amounts of features or attributes. Therefore feature selection techniques needs to be applied before any kind of mining algorithm is applied. The main objectives of feature selection are to avoid overfitting and improve meansl performance and to provide faster and more cost-effective models. The selection of optimal features adds an extra layer of complexity in the modelling as instead of just finding optimal parameters for full set of features, first optimal feature subset is to be found and the model parameters are to be optimised. Attribute selection methods can be broadly divided into filter and wrapper approaches. In the filter approach the attribute selection method is in symbiotic of the data mining algorithm to be applied to the selected attributes and assess the relevance of features by facial expression only at the intrinsic properties of the data. In most cases a feature relevance score is calculated, and lowscoring features are removed. The subset of features left after feature removal is presented as input to the classification algorithm. Advantages of filter techniques are that they easily scale to highdimensional datasets are computationally simple and fast, and as the filter approach is in open of the mining algorithm so feature selection needs to be performed only once, and then different classifiers can be evaluated.4. ROUGH SETSAny set of all undetectable (similar) objects is called an elementary set. Any union of some elementary sets is referred to as a crisp or precise set otherwise the set is rough (imprecise, vague). to each one rough set has boundary-line cases, i.e., objects w hich cannot be with certainty classified, by employing the available knowledge, as members of the set or its complement. Obviously rough sets, in contrast to precise sets, cannot be characterized in terms of information about their elements. With any rough set a pair of precise sets called the depress and the upper approximation of the rough set is associated. The lower approximation consists of all objects which surely belong to the set and the upper approximation contains all objects which possible belong to the set. The difference between the upper and the lower approximation constitutes the boundary region of the rough set. Rough set approach to data analysis has many important advantages like provides efficient algorithms for finding hidden patterns in data, identifies relationships that would not be found using statistical methods, allows both qualitative and quantitative data, finds minimal sets of data (data reduction), evaluates significance of data, easy to understand.5. ID3 DECISION guide ALGORITHMFrom the available data, using the different attribute values gives the dependent variable (target value) of a new sample by the predictive machine-learning called a decision tree. The attributes are denoted by the internal nodes of a decision tree in the observed samples, the possible values of these attributes is shown by the branches between the nodes, the classification value (final) of the dependent variable is given by the terminal nodes. Here we are using this type of decision tree for large dataset of telecommunication industry. In the data set, the dependent variable is the attribute that have to be predicted, the values of all other attributes decides the dependent variable value and it is depends on it. The independent variable is the attribute, which predicts the values of the dependent variables.The simple algorithm is followed by this J48 Decision tree classifier. In the available data set using the attribute value, the decision tree is co nstructed for assort a new item. It describes the attribute that separates the various instances most clearly, whenever it finds a set of items (training set). The highest information gain is given by classifying the instances and the information about the data instances are represent by this feature. We can allot or predict the target value of the new instance by assuring all the respective attributes and their values.6. J48 DECISION channelize TECHNIQUEJ48 is an open source Java performance of the C4.5 algorithm in the Weka data mining tool. C4.5 is a program that creates a decision tree based on a set of labeled input data. This algorithm was developed by Ross Quinlan. The decision trees generated by C4.5 can be used for classification, and for this reason, C4.5 is often referred to as a statistical classifier (C4.5 (J48).7. IMPLEMENTATION MODELmaori hen is a collection of machine learning algorithms for Data Mining tasks. It contains tools for data preprocessing, classificatio n, regression, clustering, association rules, and visualization. For our project the classification tools were used. There was no preprocessing of the data. WEKA has four different modes to work in.Simple CLI provides a simple command-line interface that allows direct execution of WEKA commands.Explorer an environment for exploring data with WEKA.Experimenter an environment for performing experiments and conduction of statistical tests between learning schemes.Knowledge Flow presents a data-flow inspired interface to WEKA. The user can select WEKA components from a tool bar, place them on a layout canvas and connect them together in order to form a knowledge flow for processing and analyzing data.For most of the tests, which will be explained in more detail later, the explorer mode of WEKA is used. But because of the size of some data sets, there was not enough memory to run all the tests this way. Therefore the tests for the larger data sets were executed in the simple CLI mode to save working memory.8. IMPLEMENTATION RESULTThe attributes that are selected by the Fuzzy Rough Subset Evaluation using Best First Search method and instruction Gain Subset Evaluation using Ranker Method is as follows8.1 Fuzzy Rough Subset Using Best First Search Method=== Attribute Selection on all input data ===Search MethodBest first.Start set no attributesSearch direction forwardStale search after 5 node expansionsTotal number of subsets evaluated 90Merit of best subset found 1Attribute Subset Evaluator (supervised, Class (nominal) 14 class)Fuzzy rough feature selectionMethod Weak gammaSimilarity measure max(min( (a(y)-(a(x)-sigma_a)) / (a(x)-(a(x)-sigma_a)),((a(x)+sigma_a)-a(y)) / ((a(x)+sigma_a)-a(x)) , 0).Decision similarity EquivalenceImplicator LukasiewiczT-Norm LukasiewiczRelation composition Lukasiewicz(S-Norm Lukasiewicz)Dataset consistency 1.0Selected attributes 1,3,4,5,8,10,12 7023479118.2 Info Gain Subset Evaluation Using Ranker Search Method=== Attribute Selection on all input data ===Search MethodAttribute ranking.Attribute Evaluator (supervised, Class (nominal) 14 class)Information Gain Ranking Filter bedded attributes0.208556 13 120.192202 3 20.175278 12 110.129915 9 80.12028 8 70.119648 10 90.111153 11 100.066896 2 10.056726 1 00.024152 7 60.000193 6 50 4 30 5 4Selected attributes 13,3,12,9,8,10,11,2,1,7,6,4,5 138.2 ID3 sort lead for 14 Attributes the right way assort Instances 266 98.5185 %Incorrectly Classified Instances 4 1.4815 %Kappa statistic 0.9699Mean positive shift 0.0183Root mean squared delusion 0.0956 congener absolute fallacy 3.6997 %Root relative squared error 19.2354 % insurance coverage of cases (0.95 level) 100 %Mean rel. region size (0.95 level) 52.2222 %Total Number of Instances 2708.3 J48 Classification Result for 14 AttributesCorrectly Classified Instances 239 88.5185 %Incorrectly Classified Instances 31 11.4815 %Kappa statistic 0.7653Mean absolute error 0.1908Root mean squared error 0.3088Relative absolute e rror 38.6242 %Root relative squared error 62.1512 %Coverage of cases (0.95 level) 100 %Mean rel. region size (0.95 level) 92.2222 %Total Number of Instances 2708.4 ID3 Classification Result for selected Attributes using Fuzzy Rough Subset EvaluationCorrectly Classified Instances 270 100 %Incorrectly Classified Instances 0 0 %Kappa statistic 1Mean absolute error 0Root mean squared error 0Relative absolute error 0 %Root relative squared error 0 %Coverage of cases (0.95 level) 100 %Mean rel. region size (0.95 level) 25 %Total Number of Instances 2708.5 J48 Classification Result for selected Attributes using Fuzzy Rough Subset EvaluationCorrectly Classified Instances 160 59.2593 %Incorrectly Classified Instances 110 40.7407 %Kappa statistic 0Mean absolute error 0.2914Root mean squared error 0.3817Relative absolute error 99.5829 %Root relative squared error 99.9969 %Coverage of cases (0.95 level) 100 %Mean rel. region size (0.95 level) 100 %Total Number of Instances 2708.6 ID3 Classifica tion Result for Information Gain Subset Evaluation Using Ranker MethodCorrectly Classified Instances 270 100 %Incorrectly Classified Instances 0 0 %Kappa statistic 1Mean absolute error 0Root mean squared error 0Relative absolute error 0 %Root relative squared error 0 %Coverage of cases (0.95 level) 100 %Mean rel. region size (0.95 level) 33.3333 %Total Number of Instances 2708.7 J48 Classification Result for Information Gain Subset Evaluation Using Ranker MethodCorrectly Classified Instances 165 61.1111 %Incorrectly Classified Instances 105 38.8889 %Kappa statistic 0.3025Mean absolute error 0.31Root mean squared error 0.3937Relative absolute error 87.1586 %Root relative squared error 93.4871 %Coverage of cases (0.95 level) 100 %Mean rel. region size (0.95 level) 89.2593 %Total Number of Instances 270CONCLUSIONIn this paper, from the above implementation result the Fuzzy Rough Subsets Evaluation is gives the selected attributes in less amount than the Info Gain Subset Evaluation and J48 decision tree classification techniques gives the approximate error rate using Fuzzy Rough Subsets Evaluation for the given data set than the ID3 decision tree techniques for both evaluation techniques. So finally for selecting the attributes fuzzy techniques gives the better result using Best First Search method and J48 classification method.

No comments:

Post a Comment