Data mining. Textbook - страница 4

Шрифт
Интервал


Data Anomaly Similarities

The concept of anomaly can be described as a data value that differs significantly from the mean distribution. But the description of anomalies is also quite general. Any number of outliers can occur in a dataset if there is a difference between observed relationships or proportions. This concept is best known for observing relationships. They are averaged to obtain a distribution. The similarity of the observed ratio or proportion is much less than the anomaly. Anomalies are not necessarily rare. Even when the observations are more similar than the expected values, the observed distribution is not the typical or expected distribution (outliers). However, there is also a natural distribution of possible values that observations can fit into. Anomalies are easy to spot by looking at the statistical distribution of the observed data.

In the second scenario, there is no known distribution, so it is impossible to conclude that the observations are typical of any distribution. However, there may be an available distribution that predicts the distribution of observations in this case.

In the third scenario, there are enough different data points to use the resulting distribution to predict the observed data. This is possible when using data that is not very normal or has varying degrees of deviation from the observed distribution. In this case, there is an average or expected value. A prediction is a distribution that will describe data that is not typical of the data, although they are not necessarily anomalies. This is especially true for irregular datasets (also known as outliers).

Anomalies are not limited to natural observations. In fact, most data in the business, social, mathematical, or scientific fields sometimes has unusual values or distributions. To aid decision making in these situations, patterns can be identified relating to different data values, relationships, proportions, or differences from a normal distribution. These patterns or anomalies are deviations of some theoretical significance. However, the deviation value is usually so small that most people don’t notice it. It can be called outlier, anomaly, or difference, with either term referring to both the observed data and the possible underlying probability distribution that generates the data.