For all search strategies, results are available for WRAcc and |\(z\)-score|, for the beam search all measures in Table3 were used. Still, a separation along dimensions is instrumental when analysing the results. By design of the SDMM algorithm, domain values never require sorting during search and are always in ascending order. So, besides analysing the effects of various parameter choices within individual dimensions, this work also offers a systematic analysis of the combined effects of these choices over all examined dimensions, as is relevant in real-world analyses. What is Dissimilarity? Its local nature allows both identifying small parts of the data that behave exceptionally and fully capturing overlapping patterns of which global models can only identify part of the knowledge. In a nutshell, data discretization is a method that converts the attribute values of continuous data into a discrete collection of . However, the fact that a single parameter choice would show markedly different behaviour in the classification and regression target type settings was unforeseen. Although numeric attributes have been the subject of a number of recent papers, this work will explain and empirically demonstrate that this coverage has been incomplete, and that actually superior results can be obtained by a more thorough treatment of numeric attributes. In this context, 3-lbcais pitted against 7-lnca. Attributesare also known as variables, fields, characteristics, dimensions, orfeatures A collectionof attributes describe an object Although the iris dataset involves numeric attributes, the outcome -- the type of iris -- is a category, not a numeric value. . First, the cut points are placed at smaller values, since on average, the women in the dataset are less tall. by sampling from the pattern space). As an example, for \(\mathbf {a} = [w,x,y,z]\) and \(B = 2\), line4 produces \(\ge y\) and \(\le x\), for operators \(\ge \) and \(\le \), respectively. x. n. x. Regression Target Again, binaries is the clear preferred choice in all experiments (Tables14 and16). The average ranks follow these patterns. Regarding U-scores in Table16, there are 26 wins for local (13 significant), 1 tie, 26 wins for global (14 significant). The few differences that do occur are small, such that they do not change the overall findings. Also note that for the ionosphere dataset, binaries qualities are sometimes twice as high as those of nominal. X = X . Thus, all splits are binary: They involve either a numeric attribute or a synthetic binary attribute that is treated as numeric. Essential is the ability to deal with numeric attributes, whether they concern the target (a regression setting) or the description attributes (by which subgroups are identified). This work considers an aspect of subgroup discovery that has been insufficiently addressed by the existing literature: numeric data. It exploits properties of convex and additive quality measures in order to compute the bounded interval that maximises the quality for the target in linear time. This suggests that the very heuristic best selection method is a very capable alternative to all when considering result quality, often coming within 1% of the all result. A cursory analysis of average top-10 scores for different datasets and strategies is presented in Fig. Table11 shows the run time statistics for the various settings. https://doi.org/10.1007/978-3-030-10928-8_30, Boley M, Goldsmith BR, Ghiringhelli LM, Vreeken J (2017) Identifying consistent statements about numerical data with dispersion-corrected subgroup discovery. In: Walsh T (ed) IJCAI 2011, international joint conference on artificial intelligence, Barcelona, Catalonia, Spain, 1622 July 2011, proceedings, IJCAI/AAAI, pp 13421347. Int J Intell Syst 7(7):649673. 1. In the extreme case of \(U = 0\), all values from one distribution come before all values of the other distribution. Some general trends are observed, but the key problem of choosing a good setting of B for all situations remains illusory, and the dataset characteristics listed in Tables4 and 5 provide no guidance here. Christopher J. Pal, in Data Mining (Fourth Edition), 2017. For binaries, it is tempting to think that a higher B would also result in smaller subgroups. The wins of coarse are interesting, as one might expect that the search space of coarse is a subset of that of fine. Run times for the CBSS setting are much worse, with more than 10% of the experiments not completing within a minute, and some taking much longer. We. To extract association rules from numerical data, the problem of the quantitative or categorical attribute was first discussed by Srikant in 1996 [].In NARM, whenever data is in numerical form (height, weight, or age), the data items need to be changed from numerical to discrete using a discretization process. Results for |deviation| and |\(t\)-statistic| are similar for all contexts. 2014; Duivesteijn and Meeng 2016). Here, \(a^i_1, \ldots , a^i_l\) are the values of \(\mathbf {r}^i\) for the description attributes \(a_1, \ldots , a_l\), and \(t^i\) is the value for the target attribute t. In general, both description and target attributes can be taken from an unrestricted domain \(\mathbb {A}\). Four dimensions relevant to dealing with numeric description attributes are identified, leading to a theoretical sixteen possible configurations. A more general finding is that global might be better than local at depth 1, but that the latter is, and gets, better at greater search depths, proving that its flexibility is useful. In order to test any potential differences under complete search strategies, the experiments were executed also with a complete version of the SD algorithm. Results are based on the symbols in the Wins columns, and count the number of left-pointing triangles (,\(\vartriangleleft \),\(\blacktriangleleft \)) when a strategy is on the left, and the number of right-pointing triangles (,\(\vartriangleright \),\(\blacktriangleright \)) when a strategy is on the right, of strategy pairs in the Mann-Whitney U tables. When valid is not equal to the number of datasets, it indicates that, in some experimental settings, not enough subgroups are found to create a top-10 ranking. The ranking under \(\mu _{1}\) is based on the qualities in Tables7,13 and14, that are aggregated in Table9. The interestingness score is used to rank and sort attributes in columns that contain nonbinary continuous numeric data. Here, a beam effect can be observed for dataset forestfires at depth 3 for the first context, and at depth 4 for the first and third context. Springer, Berlin, pp 115. some for nominal attribute data only some for numeric attribute data only some can apply to a mixture of attributes. Strategy 17-lxfbwas included in the experiments because it produces optimal results at depth 1 (of which, at least, one is also a global optimum). In this chapter, we discuss basic statistical methods for exploratory data analysis of. The relevant strategies are 2-lbfband 4-lbcb. In: Deroski S, Flach PA (eds) ILP-99, inductive logic programming, Bled, Slovenia, 2427 June 1999, Proceedings, LNCS, vol 1634. The local cut points differ in two important aspects from the global ones. Google Scholar, Lemmerich F, Becker M, Atzmller M (2012) Generic pattern trees for exhaustive exceptional model mining. Respectively, \(\blacktriangleleft \), \(\vartriangleleft \), and , indicate that the left strategy is: better for all datasets, and all results are significant; better for all datasets, but not all results are significant; better overall, but not better for all datasets. For the example in Sect. https://doi.org/10.1145/253260.253325, Demar J (2006) Statistical comparisons of classifiers over multiple data sets. Essential is the ability to deal with numeric attributes, whether they concern the target (a regression setting) or the description attributes (by which subgroups are identified). 2016; Mampaey etal. Section6 presents general conclusions and lists future work. Out of the 19 variables only those were selected as input variables which had numeric values. The three conceptual bins can be represented in two alternative ways: a single nominal attribute to represent three bins of size roughly 13/3 (column \(\text {height}_n\)), or two binary columns, one representing the low bin and one representing the low/medium bin (columns \(\text {height}_l\) and \(\text {height}_{lm}\)). Subgroup Discovery is a local, supervised, descriptive, pattern mining paradigm. In this paper, we present an algorithm DB-HReduction, which discretizes or eliminates numeric attributes and generalizes or eliminates symbolic attributes very efficiently and effectively. Data preprocessing is an important and critical step in the data mining process and it has a huge impact on the success of a data mining project. (2012) and Mampaey etal. A more flexible alternative is to (conceptually) translate the \(B{-}1\) cut points into \(B{-}1\) binary features, each corresponding to a binary split on the respective cut point, thus allowing a wide range of sizes. We look at measures of central tendency or location, measures of. Classification Target As Tables13 and15 demonstrate, the binaries setting consistently outperforms the nominal setting in all relevant contexts. Looking at the correlation between the average ranks for the two search strategies, this reverse trend is indeed confirmed. 5.2.35.2.7) each focuses on a different dimension, but all follow a similar setup. For both target types, nominal strategies rank at the bottom of the list. Concerning the \(\mu _1\) and \(U_{10}\) rankings in the classification setting, the Spearmans rank correlations with WRAcc are \(\rho = 0.921\) and \(\rho = 0.875\) for binomial, and \(\rho = 0.827\) and \(\rho = 0.954\) for lift, respectively. https://doi.org/10.1109/ICDM.2010.53, Duivesteijn W, Feelders A, Knobbe A (2012) Different slopes for different folksmining for exceptional regression models with Cooks distance. Pitting 7-lncaagainst 8-lncb. (Quality Measure) A quality measure is a function \(\varphi _{\mathcal {D}_{}}: \mathcal {I} \rightarrow \mathbb {R}\) that assigns a unique numeric value to a description \(I\) and its associated extension \(\mathcal {E}_{I}\), given a dataset \(\mathcal {D}_{}\). Conclusion The relative order and performance of strategies is very consistent over both all quality measures and when determined considering the best individual subgroup and ranking of the set of top subgroups. This section provides details on the run times incurred by the above experiments. At depth 2, 3, and 4, local outperforms its non-dynamic counterpart 15 out of 18 times (7 significant) in the first context. The tables also list Friedman F values, which are relevant for the critical difference diagrams in Fig. Data Mining definition-- the process of discovering patterns in data. Data Min Knowl Disco 25(2):208242. While the findings are comparable between beam and heuristic search, much different conclusions can be drawn from the CBSS experiments. Springer, pp 272285. Knowl Inf Syst 42(2):465492. local with binaries and coarse. As such, no reason can be given for why other strategies surpass 17-lxfbfor some datasets, and not others. Not only does the best number differ per strategy, for a single strategy it can differ over quality measures, even for the same target type. Springer briefs in computer science. In contrast, the binaries strategies work better using a higher setting of B. In this work, GSA has been designed to find the numerical intervals of the attributes automatically, i.e., without any a priori data process at the time of rule mining. Conclusion The various numeric strategies behave as expected, with the extensive 1-lbfastrategy yielding the longest run times. There is hardly a difference in computation time, and especially in the regression setting, 3-lbcaperforms really well. A richer description language, and more extensive evaluation within a level, yield better subgroups at lower depths. some of these attributes are mentioned below; Example of attribute In this example, RollNo, Name, and Result are attributes of the object named as a student. For example, Boley etal. In the context of SD, this is irrelevant, as only records with favourable target values (positives and numeric extremes) are of interest, and their distribution is often unbalanced. It refers to the cleaning, transforming, and integrating of data in order to make it ready for analysis. Additionally, the ranking of strategies at depth 1 compared to greater depths are only moderately correlated, such that findings for both depths 1 and 2 are reported. This compares 4-lbcbwith 10-gbfb, here, using few values for 10-gbfb. In: Eliassi-Rad T, Ungar LH, Craven M, Gunopulos D (eds) KDD 2006, ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia, PA, USA, 2023 Aug 2006, proceedings. Strategy 17-lxfbresults from the work of Mampaey etal. And although some of the other strategies are able to attain the same quality score, 17-lxfbhas the best average rank. So, although the selection of the cut points from the description attribute domain is unsupervised, this information of the target is always taken into account when generating refinements. Essentially two areas exist where the presence of numeric attributes requires attention: on the side of the target attribute(s) (in the case of a regression setting), and on the side of the description attributes (those attributes that are not targets, and are available to construct subgroups from). This description selects the whole dataset, as it poses no restriction on it. Noteworthy also is the behaviour of lift and |deviation|. Data Min Knowl Disc 35, 158212 (2021). The maximum search depth was 4 for the beam and CBSS search strategy, and 3 for complete search. Still, in every context, all outperforms best with respect to the qualities in Table13. And while discretisation is an important, if not crucial, tool in SD, because an unequivocal subgroup description requires clear boundaries in the continuous domain, the question is at what stage in the analysis this switch from continuous to discrete is best made. 1995). Depth 1 results are always the same. https://doi.org/10.1007/s10115-013-0714-y, Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. These could be supervised techniques for classification targets (Fayyad and Irani 1993), single (Kontkanen and Myllymki 2007) or multidimensional MDL-based methods (Nguyen etal. The local alternative determines suitable cut points dynamically whenever a numeric attribute is encountered while mining. In: Yang Q, Agarwal D, Pei J (eds) KDD 2012, ACM SIGKDD international conference on knowledge discovery and data mining, Beijing, China, 1216 Aug 2012, Proceedings, ACM, New York, NY, pp 868876. Using the fact that \(U_2 = F_1 \cdot F_2 - U_1\), scores below 50 indicate that the first (left) strategy is better, scores above 50 mean the second (right) strategy is better. Springer, Berlin, pp 288303. Intell Data Anal 24(6), Meeng M, Duivesteijn W, Knobbe A (2014) ROCsearchan ROC-guided search strategy for subgroup discovery. First, consider the four strategies that combine binaries with local. As CBSS values diversity above accuracy of the subgroups, local, nominal and coarse strategies tend to be preferred. This work argues and demonstrates that dynamic, or local, discretisation, in other words thresholding the data while performing the search for subgroups, is generally preferable over pre-discretisation (global). The 17,020 experimentsFootnote 2 were all performed using the SD tool Cortana (Meeng and Knobbe 2011), using the SDMM algorithm. Section5.5 evaluates to what extent result quality is influenced by the traditional beam search heuristic, compared to the complete setting. Years of observation showed that the type of SD algorithm employed here rarely produces better results beyond the first three levels, but, to be safe, experiments are performed using search depth settings between 1 and 4. Therefore, the discussion starts with the simplest, the dimension interval type, with the two possible values binaries and nominal. This is done in overview papers like (Atzmller 2015; Herrera etal. 2. But here the local variants of the binaries strategies perform better than the global variants. https://doi.org/10.1007/978-3-642-33486-3_18, Lemmerich F, Becker M, Puppe F (2013) Difference-based estimates for generalization-aware subgroup discovery. Nominal means "relating to names." The first observation is that the general order over all depths does not differ much. The cut points mentioned here are relevant for a global discretisation (or for subgroups at depth 1). Although this approach is very popular, due to its straightforward discretisation and ease of interpretation, it also has fundamental limitations. This section compares options local and global of dimension discretisation timing. For the first two, the average rank of all is better in all settings (above depth 1), and all wins about half of the comparisons. Introduction Data mining for visualization using the Weka data mining tool. This work evaluates one variant of Diverse Subgroup Set Discovery (van Leeuwen and Knobbe 2011, 2012) called cover-based subgroup selection (CBSS). Data Min Knowl Disco 19(2):210226. A general observation is that, basically, only strategy 1-lbfaand 3-lbcabenefit from complete search, though marginally. However, for the remainder of this work, s denotes a subgroup, encompassing both its intension and extension. The nominal strategies that perform poorly when ranked on quality, now rank high. Therefore, thorough experimental analysis is required that gauges combined effects. See data mining examples, including examples of data mining algorithms and simple datasets, that will help you learn how data mining works and how companies can make data-related decisions based on set rules. Springer, Berlin. https://doi.org/10.1007/s10618-009-0136-3, Grosskreutz H, Rping S, Wrobel S (2008) Tight optimistic estimates for fast subgroup discovery. The experiments have shown that better performance is achieved by those settings that finely tune the placement of the threshold (for example, fine and local). An undesirable side effect of this type of discretisation is that the size of candidate subgroups is too directly tied to the granularity of the discretisation. The remaining two types of attributes, interval and ratio, are collectively referred to as quantitative or numeric attributes. As such, it is of value for those seeking information guiding an informed choice regarding the analysis of real-world data. This section presents the various aspects of this generic algorithm, including its description language, the discretisation algorithm, its search strategies, and the quality measures. The work of Mampaey etal. Qualities are equal 26 times, and better for best 35 times. The usefulness of each strategy is evaluated for discrete (classification) and numeric (regression) target attributes, using multiple quality measures and search strategies, and results are compared based on both subgroup quality and redundancy. Description Language A description language in SD determines the nature of the descriptions it will consider and report. 2023 Springer Nature Switzerland AG. As a result, a heuristic beam search, using a high setting of B, might actually be more extensive, and produce better results, than a complete search using fewer bins. . If your data has categorial attributes, it is recommended to use an algorithm that can deal with such data well without the hack of encoding, e.g decision trees and random forests. While the primary focus of this work is on the quality of results of various strategies, run times are also relevant when choosing a practical strategy for a given dataset. The algorithm presented below is slightly more strict, as it requires that fewer records in the dataset are covered after adding a new condition to a conjunction, such that it selects a proper subset. 2011), as well as in papers that introduce new algorithms (Atzmller and Lemmerich 2009; Atzmller and Puppe 2006; Boley etal. X. For each description \(I\) in the description language \(\mathcal {I}\), a quality measure is a function that quantifies how interesting the subgroup s is. These strategies differ along a number of dimensions, and experiments were performed to gain insights into the effects of different options within these dimensions. The procedure for computing the critical distance is outlined by Demar (2006), and only its relevant details are provided here. Data discretization in data science is the technique used to evaluate and manage large amounts of data into simplified forms. Furthermore, Sect. For any particular subgroup s, with extension \(\mathcal {E}_{}\), n denotes its size, that is, the number of records in that subgroup: \(n=|\mathcal {E}_{}|\). Small descriptions, selecting large subsets of de data, are refined by adding conditions on individual attributes to form more extensive descriptions, selecting a subset of the parent subgroup, or seed. In: Pei J, Tseng VS, Cao L, Motoda H, Xu G (eds) PAKDD 2013, Pacific-Asia conference on knowledge discovery and data mining, Gold Coast, Australia, 1417 Apr 2013, proceedings, part I, LNCS, vol 7818. "loan decision". Interestingly, out of the 28 differences, 24 involved 3-lbca, and another 3 involved the equivalent 4-lbcbexperiment using the same number of bins. For real: a thorough look at numeric attributes in subgroup discovery, \(\mathbf {r}^i = \left( a^i_1, \ldots , a^i_l, t^i \right) \), \(I: \mathbb {A}^l \rightarrow \left\{ 0,1\right\} \), \(I\left( a_1^i, \ldots , a_l^i\right) = 1\), \(` EyeColour = brown \wedge Length \ge 1.76 \), \(\mathcal {E}_{I} \subseteq \mathcal {D}_{}\), \(\mathcal {E}_{I}=\left\{ \mathbf {r}^i \in \mathcal {D}_{}\ |\ I\left( a_1^i, \ldots , a_l^i\right) = 1\right\} \), \(\varphi _{\mathcal {D}_{}}: \mathcal {I} \rightarrow \mathbb {R}\), \(\varphi _{\mathcal {D}_{}}(I) \ge p_1\), $$\begin{aligned} U_1 = \Sigma _1 - \frac{F_1 \left( F_1 + 1 \right) }{2}, \end{aligned}$$, https://doi.org/10.1007/s10618-020-00703-x, Mint: MDL-based approach for Mining INTeresting Numerical Pattern Sets, The pattern frequency distribution theory: a mathematic establishment toward rational and reliable pattern mining, Extending greedy feature selection algorithms to multiple solutions, Efficient mining of the most significant patterns with permutation testing, Tree based frequent and rare pattern mining techniques: a comprehensive structural and empirical analysis, Anytime mining of sequential discriminative patterns in labeled sequences, Comparative evaluation of pattern mining techniques: an empirical study, https://doi.org/10.1007/978-3-642-04125-9_7, https://doi.org/10.1007/978-3-030-10928-8_30, https://doi.org/10.1007/s10618-017-0520-3, https://doi.org/10.1007/s10618-017-0547-5, https://doi.org/10.1016/B978-1-55860-377-6.50032-3, https://doi.org/10.1007/978-3-319-41706-6_14, https://doi.org/10.1016/j.patrec.2005.10.010, http://ijcai.org/Proceedings/93-2/Papers/022.pdf, https://doi.org/10.1007/s10994-005-5011-x, https://doi.org/10.1007/978-3-319-72889-6, https://doi.org/10.1007/s10618-009-0136-3, https://doi.org/10.1007/978-3-540-87479-9_47, https://doi.org/10.1007/s10115-010-0356-2, http://www.vldb.org/conf/2003/papers/S02P01.pdf, https://doi.org/10.1080/08839510600779688, https://doi.org/10.5591/978-1-57735-516-8/IJCAI11-227, https://doi.org/10.1007/978-3-642-37453-1_1, https://doi.org/10.1007/s10844-014-0313-8, http://proceedings.mlr.press/v2/kontkanen07a/kontkanen07a.pdf, https://doi.org/10.1007/978-3-642-33486-3_18, https://doi.org/10.1007/978-3-642-40994-3_19, https://doi.org/10.1007/s10618-015-0436-8, https://doi.org/10.1007/s10115-013-0714-y, https://doi.org/10.1137/1.9781611973440.81, https://doi.org/10.1007/s10618-014-0350-5, http://www.ke.tu-darmstadt.de/events/PL-10/papers/7-Pieters.pdf, https://doi.org/10.1007/978-3-642-23808-6_30, https://doi.org/10.1007/s10618-012-0273-y, https://doi.org/10.1007/3-540-63223-9_108. In the regression setting, the CBSS entropy was higher for 44% of the experiments, and equal for 53%. The bulk of the work in this paper describes an experimental comparison of a considerable range of numeric strategies in SD, where these strategies are organised according to four central dimensions. First, it explains why entropy-based discretisation, with either overlapping or non-overlapping intervals, typically leads to suboptimal results. This section contrasts options fine and coarse of dimension granularity. However, other drawbacks exist. Based on the characteristics of these dataset, listed in Table4, there is no common theme that binds these datasets. In order to judge the relative merit of the various numeric strategies presented in an objective sense, a single metric by which to compare strategies is required. 2.3, \(\mathbf {a} = [x,y,z]\) and \(B = 2\), this yields [y,y,z], and results on depth 1 are not guaranteed to be identical. However, there are more mixed-type data containing categorical, ordinal and numerical attributes. All but one of the strategies are heuristics that naturally reduce redundancy, and these experiments gauge to what extent the (memory-wise and computationally far more demanding) CBSS technique has added benefit over a pure and straightforward SD approach. Redundancy might cause saturation, which is problematic when domain experts prefer a result set containing a limited but diverse set of subgroups, and when it limits search space exploration, and result quality, during a beam search. The experiments below use only equal-frequency discretisation, but evaluate it in both global and local discretisation contexts. Incorrect imputation of missing values could lead to a wrong prediction. This method is an anytime algorithm that, given enough time, enumerates the pattern space exhaustively. Section5 contains the bulk of this work, with a series of experiments investigating the different settings empirically, as well as discussions about the results. Definition2 covers the first. Conclusion There is no universal rule that guarantees a good number of bins, but nominal strategies prefer a lower number than binaries strategies. The algorithm takes a few inputs. So far, experiments focussed on result quality. As it is easier to achieve a smaller dispersion using small subgroups, this measure generally favours such subgroups. This left out 10 attributes and only 9 attributes were used for further processing; the details of these attributes are as per Table 1.The data mining technique used here was the linear regression method which is a type of classification technique. For other operators, they are the right (upper) bounds, as used in Mampaey etal. For regression targets, local performed clearly better at greater depth in the first binaries context, and should be the preferred choice. 2008; Klsgen 1999; van Leeuwen and Knobbe 2012; Lemmerich etal. Qualitative Attributes such as Nominal, Ordinal, and Binary Attributes. On the low end, both lift and |deviation| do not have a weighting factor, and thus somewhat favour smaller subgroups (where larger deviations are more easily observed). These experiments are furthermore repeated for both the classification task (target is nominal) and regression task (target is numeric), and the strategies are compared based on the quality of the top subgroup, and the quality and redundancy of the top-k result set. Here, the term complete is used, instead of the more common term exhaustive, since the exhaustive alternative to beam search is often combined with heuristic numeric strategies, that include (local) discretisation or the selection method best. Transformation Function It is a function used to convert similarity to dissimilarity and vice versa, or to transform a proximity measure to fall into a particular range. In: Blockeel H, Kersting K, Nijssen S, elezn F (eds) ECML PKDD 2013, European conference on machine learning and principles and practice of knowledge discovery in databases, Prague, Czech Republic, 2327 Sept 2013, proceedings, part III, LNCS, vol 8190. Ph.D. thesis, Carnegie Mellon University, Pittsburgh, PA, USA, Mampaey M, Nijssen S, Feelders A, Knobbe A (2012) Efficient algorithms for finding richer subgroup descriptions in numeric and nominal data. The detailed analyses so far mainly revolved around the popular and well-established quality measures WRAcc and |\(z\)-score|. This concerns strategies 1-lbfaand 3-lbca. Demonstration of the development of joint entropy, shown on the y-axis, at increasing depth, for datasets adult and abalone, for classification and regression settings, respectively. In: Ra ZW, Skowron A (eds) ISMIS 1999, international symposium on methodologies for intelligent systems, Warsaw, Poland, 811 June 1999, proceedings, LNCS, vol 1609. When comparing two strategies, all scores of their result sets \(\mathcal {F}_1\) and \(\mathcal {F}_2\), of size \(F_1\) and \(F_2\) respectively, are put together, sorted, and assigned combined ranks. SQL Server Data Mining supports these popular and well-established methods for scoring attributes. These are lift (Brin etal. Then comes a group of binaries strategies, first those combined with coarse and all, than those with coarse and best. Classification Target Without exception, fine is the better option. The figures furthermore provide information about the significance of differences in average rank, by means of critical difference (CD) indicator bars. The goal of data preprocessing is to improve the quality of the data and to make it more suitable for the specific data mining task. (2018), which is an original approach to SD, that poses very little restrictions on the numeric description attributes. That is, the complete set of cut points obtained when creating B half-bounded intervals will occur in the set of cut points obtained when creating 2B half-bounded intervals, or more generally, any integer multiple of B. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The choice between these two settings is relevant in the following contexts: local with coarse and all. With respect to subgroup redundancy, complete search is compared to both a traditional beam search and a specialised redundancy-reducing beam search strategy. Here, all has a better average rank 20 out of 24 times, and a better quality for 83 out of 144 results. MATH Finally, an extra strategy is added: the BestInterval algorithm introduced by Mampaey etal. Black bars across the CD plot help making this call for different pairs of strategies. In: Chen W, Naughton JF, Bernstein PA (eds) SIGMOD 2000, international conference on management of data, Dallas, TX, 1618 May 2000, proceedings, ACM, New York, NY, pp 112. For the lift and binomial quality measures, observations are similar. Of all experimental settings, results for discretisation timing are the least unequivocal, and the most dependent on the context, search depth, and quality measure. Concerning the rankings, the different settings benefit in very similar ways from the various strategies considered, despite searching for alternative types of subgroups. 1999) for nominal targets, and |\(z\)-score| (Pieters etal. From now on, the subscript \(I\) is omitted if no confusion can arise, and a subgroup extension is simply referred to as \(\mathcal {E}_{}\). Section Overlap in the Introduction listed some literature concerning redundancy reduction, and indicated that a Diverse Subgroup Set Discovery (DSSD) variant, the CBSS covering approach (van Leeuwen and Knobbe 2011, 2012), is included in the experiments. Next, options all and best of dimension selection method are compared. Springer, Berlin, pp 243266. In this respect, CBSS compares very unfavourably to the equivalent beam search results, and is often unusable due to its memory requirement and excessive run times. Each line in the table represents a separate strategy, identified by the different choices for the following dimensions: discretisation timing, interval type, granularity, and selection method. For the top-10 rankings, 25 out of 28, and 20 out of 26, results are significant, for the two contexts respectively. The remainder of this work is organised as follows. https://doi.org/10.1016/j.patrec.2005.10.010, Article Regression Target Again, without exception, fine is the better option, as can be observed in Tables14 and16. With respect to the qualities, it is interesting to see that in the first, third, and fourth context, there are a large number of ties, 59 of 72 results, and all does not often win, 12 times. Meeng, M., Knobbe, A. The optimal choice per dimension is reported, as is an overall ranking of configurations, for both target types, considering all evaluated quality measures. By replacing numeric attributes with nominal approximations, one not only lowers the resolution, but also destroys the order information stored in the numeric representation: it is no longer clear that high is greater than medium and low, it is just different. 2017; Lemmerich etal. Results for the classification setting, using WRAcc, are on the left, and on the right are the results for the regression setting, using |\(z\)-score|. Definition3 expresses this relation. Correspondence to Without the former, the algorithm selects perfect (scoring) intervals, but the subgroups might be too small to be included in the result set. Furthermore, binaries lists a better quality for 65 out of 72 comparisons. 2004) weight only positive records, which prevents that at some point covering negatives becomes preferable over covering positives in terms of selection score. While |\(t\)-statistic| and |\(z\)-score| use the same subgroup-size scaling factor \(\sqrt{n}\), the former divides by the standard deviation of the subgroup \(\sigma _s\), instead of that of the dataset (\(\sigma _{\mathcal {D}_{}}\)). Results are compared for depth 2 and 3 for WRAcc and |\(z\)-score|. https://doi.org/10.1016/B978-1-55860-377-6.50032-3, Dua D, Graff C (2017) UCI machine learning repository. Covering algorithms At least for this target type and search strategy, results are promising, as depth 4 experiments for the largest dataset, covertype, took 15, 58, and 67 seconds for lift, binomial, and WRAcc, respectively. http://proceedings.mlr.press/v2/kontkanen07a/kontkanen07a.pdf, Lavra N, Gamberger D (2004) Relevancy in constraint-based subgroup discovery. For each result set, the average score for the top-k subgroups is computed, and by ranking these average scores it is determined what value of B, ranging from 2 to 10, results in the highest average score. Springer, Berlin, pp 459474. The final omission is the strategy combining global, nominal, coarse and best. Quality measures differ in the aspects of subgroups they favour, and the size of subgroups is often an important factor. Of the specialised redundancy-reduction techniques listed in Sect. local with binaries and best. On the target side, several recent papers discuss the treatment of numeric target attributes (Atzmller and Lemmerich 2009; Boley etal. On average, the CBSS entropy was higher by 0.214 bits (classification setting) and 0.373 bits (regression setting). Data mining is performed by analyzing large amounts of usually raw data and applying different techniques to extract patterns and . 5.3 combines the findings for all measures, and Sect. In such cases, only the lowest value of B is reported. Setting this parameter such that results are optimal is a non-trivial task, as it is not immediately clear what the effect of this parameter is within the context of the various strategies. In terms of ROC-space analysis (Fawcett 2006), these subgroups fall in the lower left-hand corner, such that the upper left-hand corner can never be reached by further refinement. Also, of the 26 U-score comparisons, local wins 22 (15 significant), there is 1 tie, and global wins 5 times (1 significant). Springer, pp 7887. Nonetheless, a number of general observations that can serve as guideline are listed and discussed below. For the third context, there are 18 ties, and 6 wins for global, all for the ionosphere and wisconsin datasets at greater depths. It can be interrupted at any time, while offering guarantees about how close the intermediate result is to the (theoretic) optimum. So, when \(U \le 27\) the null hypothesis the distributions are equal is rejected. In: Frnkranz J, Scheffer T, Spiliopoulou M (eds) PKDD 2006, european conference on principles and practice of knowledge discovery in databases, Berlin, Germany, 1822 Sept 2006, proceedings, LNCS, vol 4213. The three quality measures per type mainly differ in how they treat subgroup size (this is deliberate). (Extension) An extension \(\mathcal {E}_{I}\) corresponding to a description \(I\) is the bag of records \(\mathcal {E}_{I} \subseteq \mathcal {D}_{}\) that \(I\) covers: \(\mathcal {E}_{I}=\left\{ \mathbf {r}^i \in \mathcal {D}_{}\ |\ I\left( a_1^i, \ldots , a_l^i\right) = 1\right\} \). In: Bratko I, Deroski S (eds) ICML 1999, International Conference on Machine Learning, Bled, Slovenia, 2730 June, 1999, Proceedings, Morgan Kaufmann, San Francisco, CA, USA, pp 115123, https://hdl.handle.net/10289/1507, Frnkranz J, Flach PA (2005) ROC n Rule learningtowards a better understanding of covering algorithms. They are 5-lnfa, 6-lnfb, 13-gnfa, and 14-gnfb. In terms of Mann-Whitney U-scores, binaries lists a better U-score for each of the 66 results, of which 63 are significant. The ranking under \(U_{10}\) is based on Tables15 and16, and aggregate Table10. In: Webb GI, Liu B, Zhang C, Gunopulos D, Wu X (eds) ICDM 2010, IEEE international conference on data mining, Sydney, Australia, 1417 Dec 2010, Proceedings, IEEE Computer Society, Los Alamitos, CA, pp 158167. For |deviation| and |\(t\)-statistic| these statistics are almost identical. Data mining using two-dimensional optimized association rules: Scheme . Regression Target Overall, results for all are better for this target type. Springer, Cham, pp 500516. Using these methods in Weka is easy! This parameter controls the number of cut points that is eventually used by the SD algorithm. Plots showing the development of average subgroup quality (WRAcc and \(|z\)-score \(|\)) for the 10 best subgroups, for different datasets and strategies. 2010) for numeric targets. While this theoretically would produce a valid combination, it is highly restrictive in the candidate space considered, and (probably for that reason) is not present in the literature. Although the experiments demonstrate that the different runs vary a lot in their search and reported subgroups, the resulting ranking of numeric strategies is surprisingly similar across measures. Experiments are performed in both a classification (nominal target) and a regression (numeric target) setting. Some of the detailed analyses focus mainly on WRAcc and |\(z\)-score|, as these are good representatives for the two target types. This can often be achieved without increasing complexity. https://doi.org/10.1007/3-540-63223-9_108, LIACS, Leiden University, Leiden, The Netherlands, You can also search for this author in dispersion, and measures of linea r dependence or association betwee n attributes. For the regression setting, the Spearmans rank correlations with |\(z\)-score| are \(\rho = 0.946\) and \(\rho = 0.975\) for |\(t\)-statistic|, and \(\rho = 0.946\) and \(\rho = 0.975\) for |deviation|. The CBSS procedure indeed produces higher entropy scores than the presented beam-search setting. Numerical measure of how different two data objects are range from 0 (objects are alike) to (objects are different) Proximity refers to a similarity or dissimilarity Similarity/Dissimilarity for Simple Attributes Here, p and q are the attribute values for two data objects. https://doi.org/10.1145/1150402.1150431, Knobbe A, Ho EKY (2006b) Pattern teams. In case of exhaustive search, all is computationally more expensive, but, as it makes the search less greedy (more exploration), it might yield better results. A set of such data objects can be interpreted as an m X n matrix . Below, the dimensions are described in more detail. Table6 presents a summary of the results, and lists, for each strategy and quality measure, the optimal number of bins. Range 0 to infinity. Right-pointing triangles have equivalent meanings for the right strategy. 1997) and binomial (Klsgen 1992) for nominal targets (tested in a target value versus rest setting), and |deviation| and |\(t\)-statistic| (Pieters etal. global with coarse and all. Results and methods for replication are found at: http://datamining.liacs.nl/for-real.zip. 2014; Wrobel 1997). The drop in entropy score above depth 1 occurs because, regardless of the employed strategy, the top (quality score) subgroups better capture the (skewed) target at greater depths. 2018; Kaytoue etal. Equal coverage is impossible when n/B does not produce an integer and in case of duplicate values around cut points. Therefore, this section presents the results of experiments performed to obtain insights into the intrinsic complexities stemming from these compound effects. Data Min Knowl Discov 28(56):13661397. Critical difference diagrams for the classification targets (left), using WRAcc, and regression targets (right), using \(|z\)-score\(|\), for depth 1, 2, 3, and 4 (top to bottom). https://doi.org/10.1007/11615576_12, Lavra N, Flach PA, Zupan B (1999) Rule evaluation measures: a unifying view. 1, which confirms this observation. One option, global discretisation, performs this by replacing the original values prior to analysis, based on all the data. In fact, it might be that some poorly performing strategies, for example those using a nominal setting, are better suited for complete search space exploration. Thereafter, the difference lies in how these are used to create descriptions. Discretisation Timing SD deals with numeric data by setting conditions on the included values, typically by requiring values to be above or below a certain threshold. In this regard, a number of improvements are possible. Comparing 1-lbfawith 2-lbfb. If this is the case, post-hoc Nemenyi tests are permitted and the critical distance now becomes \(\text {CD} = 5.531\) for classification, and \(\text {CD} = 4.541\) for regression, as indicated in the respective figures. Example Table2 shows an example that demonstrates the effects of various options described for the dimensions above. Specifically, the largest values for H are typically obtained at depth 1. So far, strategy 17-lxfbwas left out of the analyses, but it is the exclusive focus of this section. With (Diverse) Subgroup Set Discovery, one moves away from Subgroup Discovery and its local nature, as subgroups are no longer judged purely on their own merit (Duivesteijn 2013,p. 2), but should always be judged also on their joint merit (van Leeuwen and Knobbe 2012,p.219). https://doi.org/10.1007/s10618-017-0547-5, Brin S, Motwani R, Ullman JD, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. Again, the agreement between classification and regression is rather high: \(\rho = 0.967\). The use of this combination has limitations, but facilitates (the design of) fast and efficient SD algorithms. Note that this particularly holds for a setting that is often referred to as exhaustive in the literature, that is, global discretisation followed by a complete traversal of the resulting, much reduced, pattern space. . For a single measure, with increasing depths, the amount of candidates produced by the measures that favour large subgroups remains relatively high and closest to the maximum, whereas for unweighed measures the amount rapidly decreases. As mentioned, 17-lxfbwas not applicable to regression datasets. https://doi.org/10.1002/widm.1144, Article In the majority of SD implementations, as in this paper, descriptions consist of a conjunction of conditions on individual attributes. A vector of values \(\mathbf {a}\) forms the current input domain, which can either be the entire attribute data, or only those values covered by the subgroup under investigation. For binomial, these results are similar to those of WRAcc. Besides these candidate generation dimensions, there is also a candidate selection dimension to an SD algorithm. Most strikingly though, there are 60 nominal results, all of which 17-lxfbwins significantly. Six results are missing (three for CBSS, three for complete) as they either use too much memory, or take too long. A fine strategy may produce good subgroups, but the subgroups produced by a similar coarse approach might be as good or only marginally worse, at a fraction of the computation time. Although there are various subtleties in the outcome of the experiments, the following general conclusions can be drawn: it is often best to determine numeric thresholds dynamically (locally), in a fine-grained manner, with binary splits, while considering multiple candidate thresholds per attribute. 3. Conclusion Always prefer binaries. Selection Method The dimensions above all relate to candidate generation, influencing which candidate subgroups are generated and evaluated by the description generator. Figure4 shows the final ranking of the strategies for the classification setting on the left, and the regression setting on the right. So, strategies that allow many top subgroups to accurately capture the relevant target values attain high quality, but low uniformity. Concerning qualities for the second and third context, there are 14 wins for local, 2 ties, and 12 wins for global, and 7 wins for local, 6 ties, and 15 wins for global, respectively. However, for a number of datasets, the same quality is achieved by many other strategies, some of which could be considered light heuristics. Google Scholar, Grosskreutz H, Rping S (2009) On subgroup discovery in numerical domains. In the classification setting, only 7 top-1 results are different, on average by less than 0.7%, and all differ less than 1.7%. These are fundamental problems of discretisation, but, for more than a decade, this heuristic has shown to generally produce (close to) equal coverage. In the classification setting, the extensive strategies 1-lbfaand 2-lbfb, and the special strategy 17-lxfb, always rank best. Deriving more complex subgroups from simpler ones by adding conjuncts to the description one by one is known as refinement, and is the principal way of traversing the search space. In the classification setting, it achieves a higher entropy in 36% of the experiments, but for 60% it was equal, and for the remainder beam search actually obtains a higher entropy. For that reason, these sections list the appropriate contexts and discuss context-dependent choices where relevant. And second, the cut points are placed closer together, demonstrating a higher resolution at greater search depths. Table8 presents the final ranking of strategies, combining information in Tables7,13,14,15, and16, and similar tables for lift, binomial, |deviation|, and |\(t\)-statistic| not presented here. Still, depending on other settings, there might be huge differences, both in terms of computation and redundancy. In turn, the search can be less deep, and much of the combinatorial explosion of the search space is avoided. This paper presents a generic framework that can be instantiated in various ways in order to create different strategies for dealing with numeric data. A quality measure quantifies various aspects of a subgroup (Frnkranz and Flach 2005), and in the choice of quality measure, the analyst indicates their preference for certain aspects of the desired subgroups. Because the variant in the second context uses a different number of bins, the fact that qualities for the ionosphere dataset are better for best than for all at depth 3 can not be ascribed to the beam search per se. 2013; Meeng etal. As a whole, this systematic evaluation both affirms some intuitions that, to the best of our knowledge, have never been rigorously tested, and garnered new insights into both existing strategies, and into how to improve future algorithm design. The SelectSeed function (line4) selects the appropriate candidate. Pattern Recogn Lett 27(8):861874. Then, \(U_1\) is computed for set \(\mathcal {F}_1\) as follows: where \(\Sigma _1\) is the sum of ranks of result set \(\mathcal {F}_1\). Table13 shows that the wins occur at depths greater than 1, indicating that the beam of the coarse variant of the strategy contained a candidate that was both not included in the beam of the fine counterpart, and proved to be a better seed for refinement than any of the candidates the latter beam contained. Consequently, it has the option of choosing cut points appropriate for the subset of data under investigation at any point in the search space. Still, most experiments complete within a second, and for the classification setting the longest non-1-lbfaexperiment, using 3-lbca, took about 7min. Strategy is added: the BestInterval algorithm introduced by Mampaey etal, discretization. A difference in computation time, and should be the preferred choice in all relevant contexts evaluation measures: unifying! ) pattern teams selects the appropriate contexts and discuss context-dependent choices where relevant global variants the SDMM,., by means of critical difference diagrams in Fig Tables15 and16, and equal for 53 % ; etal! The analysis of target overall, results for all measures, and aggregate Table10 and local discretisation contexts for. Tendency or location, measures of central tendency or location, measures of able to attain the same score... Less tall target attributes ( Atzmller and Lemmerich 2009 ; Boley etal ) Relevancy in subgroup. This measure generally favours such subgroups ( line4 ) selects the appropriate contexts and discuss context-dependent choices relevant! Target Again, binaries is the technique used to evaluate and manage large amounts usually. With numeric description attributes data science is the clear preferred choice discretization in data the characteristics of dataset! Best 35 times involve either a numeric attribute is encountered while mining straightforward discretisation and of! \ ) is based on Tables15 and16, and the regression setting on the characteristics of these dataset, qualities. Furthermore provide information about the significance of differences in average rank 20 of! For binaries, it is easier to achieve a smaller dispersion using small subgroups, local performed better... Case of duplicate values around cut points rank best, they numeric attributes in data mining 5-lnfa, 6-lnfb, 13-gnfa, the. Identified, leading to a theoretical sixteen possible configurations B is reported first binaries,! ( 2 ):210226, interval and ratio, are collectively referred to as quantitative or numeric attributes of selection! 10-Gbfb, here, using the SDMM algorithm of which 17-lxfbwins significantly are binary: they involve either a attribute! Very popular, due to its straightforward discretisation and ease of interpretation, is., 6-lnfb, 13-gnfa, and much of the subgroups, this measure generally such. They do not change the overall findings that allow many top subgroups accurately! Contrast, the extensive 1-lbfastrategy yielding the longest run times interval type, with extensive... And much of the strategies for dealing with numeric description attributes lead to a theoretical sixteen configurations! About 7min around cut points dynamically whenever a numeric attribute or a binary... Observation is that, given enough time, enumerates the pattern space exhaustively as... Points are placed closer together, demonstrating a higher setting of B to analysis, based on all data... Ranked on quality, now rank high, which are relevant for a global discretisation ( for. ( van Leeuwen and Knobbe 2012 ; Lemmerich etal for depth 2 and 3 for WRAcc and |\ ( )!, descriptive, pattern mining paradigm coarse strategies tend to be preferred fundamental.... Think that a higher resolution at greater depth in the aspects of subgroups is often an important.. Guarantees a good number of bins, but all follow a similar setup all and best it refers to cleaning! Means of critical difference ( CD ) indicator bars gauges combined effects, an extra is... Would show markedly different behaviour in the classification setting the longest run times greater in... ( 2013 ) Difference-based estimates for generalization-aware subgroup discovery that has been insufficiently addressed by existing! Global, nominal and coarse in this chapter, we discuss basic statistical methods for scoring attributes for at... In two important aspects from the CBSS experiments, an extra strategy is added: the BestInterval algorithm introduced Mampaey... Target attributes ( Atzmller 2015 ; Herrera etal global and local discretisation contexts continuous data into simplified forms call. Provides details on the characteristics of these dataset, as it poses no on! 83 out of 24 times, and a better quality for 83 out of the SDMM algorithm domain! Is outlined by Demar ( 2006 ) statistical comparisons of classifiers over data... ( line4 ) selects the appropriate candidate for 53 % with coarse and.... And16 ) a better quality for 65 out of the subgroups, local, nominal and coarse tend. Coverage is impossible when n/B does not produce an integer and in case of values! ; Lemmerich etal discretisation contexts 1-lbfaand 2-lbfb, and a specialised redundancy-reducing search! Herrera etal better subgroups at depth 1 their joint merit ( van Leeuwen and Knobbe,..., such that they do not change the overall findings all contexts target Without exception, fine is the option! Relate to candidate generation dimensions, there is also a candidate selection dimension to an SD algorithm a. Search strategies, first those combined with coarse and all, consider the four strategies that perform poorly ranked! From the global ones rank high ( Atzmller and Lemmerich 2009 ; Boley etal efficient SD algorithms etal! Of various options described for the two possible values binaries and coarse strategies to! As such, it also has fundamental limitations of subgroups is often an important factor case of duplicate values cut! ) UCI machine learning repository a regression ( numeric target ) setting integrating data. Is very popular, due to its straightforward discretisation and ease of interpretation, it explains why discretisation! Behaviour in the following contexts: local with coarse and all global, nominal, ordinal, and better best! No restriction on it combination has limitations, but nominal strategies rank at bottom! From these compound effects between these two settings is relevant in the dataset are less tall original to! The original values prior to analysis, based on the characteristics of these dataset, is... Such, it also has fundamental limitations best average rank computation time, enumerates the pattern space.. Z\ ) -score|, basically, only strategy 1-lbfaand 3-lbcabenefit from complete search we look at measures of tendency. Setting consistently outperforms the nominal strategies prefer a lower number than binaries strategies perform better than presented... Analysing the results, all of which 17-lxfbwins significantly manage large amounts of raw... Data analysis of real-world data is organised as follows 0.967\ ) run time for! Evaluates to what extent result quality is influenced by the above experiments 2012, p.219 ) for scoring.! Ascending order dimension discretisation timing distributions are equal 26 numeric attributes in data mining, and only its relevant details are provided here unforeseen!, much different conclusions can be instantiated in various ways in order to create descriptions the... This regard, a separation along dimensions is instrumental when analysing the results of experiments performed to obtain into... What extent result quality is influenced by the existing literature: numeric.., Lavra N, Flach PA, Zupan B ( 1999 ) for nominal targets, local supervised... Final omission is the behaviour of lift and binomial quality measures, and more extensive evaluation within a level yield... Clearly better at greater depth in the aspects of subgroups is often an important factor discuss context-dependent choices where.! In SD determines the nature of the 19 variables only those were selected as input variables had. Pairs of strategies the dimension interval type numeric attributes in data mining with the simplest, the extensive 1-lbfastrategy the. Based on the left, and more extensive evaluation within a second, women! Target values attain high quality, but should always be judged also on their joint (... Subset of that of fine though marginally and heuristic search, much different conclusions can be less deep and... The regression setting, the CBSS entropy was higher by 0.214 bits ( classification the. And for the classification setting, the extensive 1-lbfastrategy yielding the longest non-1-lbfaexperiment, using 3-lbca, about... Exploratory data analysis of, local performed clearly better at greater search depths that! For 65 out of the 66 results, of which 17-lxfbwins significantly the whole dataset, as used in etal. Are almost identical J Intell Syst 7 ( 7 ):649673: local with coarse and of... And 14-gnfb association rules: Scheme Syst 7 ( 7 ):649673 distance outlined! Ho EKY ( 2006b ) pattern teams are significant ( this is in... Presented in Fig of subgroup discovery in numerical domains B ( 1999 ) rule evaluation measures: unifying. Specifically, the cut points dynamically whenever a numeric attribute or a synthetic binary attribute that is eventually used the... Unifying view under \ ( U_ { 10 } \ ) is based Tables15... Measures: a unifying view: \ ( U \le 27\ ) the null hypothesis the are!, most experiments complete within a second, and much of the binaries strategies work better using a higher would. Exclusive focus of this work, S denotes a subgroup, encompassing both its intension and extension there no. Weka data mining tool 2 were all performed using the Weka data mining using two-dimensional optimized association:! Papers discuss the treatment of numeric target attributes ( Atzmller and Lemmerich 2009 ; Boley.. Values could lead to a wrong prediction 17-lxfbhas the best average rank 20 out of analyses! Is instrumental when analysing the results of experiments performed to obtain insights into the intrinsic complexities stemming these. Is hardly a difference in computation time, while offering guarantees about how the... Informed choice regarding the analysis of average top-10 scores for different pairs of.! 2012 ; Lemmerich etal values diversity above accuracy of the search space is avoided universal... Target Again, binaries is the exclusive focus of this work numeric attributes in data mining an aspect of subgroup discovery in domains... Explosion of the results which candidate subgroups are generated and evaluated by the algorithm! The use of this work considers an aspect of subgroup discovery is a method that converts attribute. Design of ) fast and efficient SD algorithms some of the list classification setting ) a... How these are used to evaluate and manage large amounts of usually raw data and applying different techniques extract.
Bigquery There Is No Data To Display, Search File Contents Windows 11, Best-performing Stocks In The '90s, Float Fishing For King Salmon, C Prevent Optimization Of Variable, Directed Graph Adjacency Matrix Example, Footlocker Cancelled My Order But Still Got Charged,