Data mining. Textbook - страница 8

Шрифт

Интервал

Association Rule Learning

Association rule learning is a rule-based machine learning technique for discovering interesting relationships between variables in large sample databases. This technique is inspired by the auditory system, where we learn the association rules of an auditory stimulus and that stimulus alone.

Sometimes when working with a dataset, we are not sure if the rows in the dataset are relevant to the training task, and if so, which ones. We may want to skip those rows in the dataset that don’t matter. Therefore, associations are usually determined by non-intuitive criteria, such as the order in which these variables appear in a sequence of examples, or duplicate values in these data rows.

This problematic aspect of learning association rules can be eliminated in the form of an anomaly detection algorithm. These algorithms attempt to detect non-standard patterns in large datasets that may represent unusual relationships between data features. These anomalies are often detected by pattern recognition algorithms, which are also part of statistical inference algorithms. For example, the study of naive Bayes rules can detect anomalies in the study of association rules based on a visual inspection of the presented examples.

In a large dataset, a feature space can represent an area of an image as a set of numbers, in which each image pixel has a certain number of pixels. The characteristics of an image can be represented as a vector, and we can place this vector in the feature space. If the attribute space is not empty, the attribute will be the number of pixels in the image that belong to a particular color.

Clustering

Clustering is the task of discovering groups and structures in data that are «similar» to some extent, not by using known structures in the data, but by learning from what is already there.

In particular, clustering is used in such a way that new data points are only added to existing clusters, without changing their shape to fit the new data. In other words, clusters are formed before data is collected, rather than fixed after all data is collected.

Given a set of parameters for data that is (mostly) variable, and their «collinearity», clustering can be thought of as a hierarchical algorithm for finding clusters of data points that satisfy a set of criteria. Parameters can be grouped into one of two categories: parameter values that define the spatial arrangement of clusters, and parameter values that define relationships between clusters.

Следующая страница