Data mining. Textbook - страница 13

Шрифт

Интервал

The simplest classification algorithm for classifying data is the total correlation method, also known as the correlation method. In full correlation, you have two sets of data and you are comparing data from one set to data from another set. This is easy to do for individual pieces of data. The next step is to calculate the correlation between the two datasets. This correlation of two sets of data tells you what percentage of the data is in each set. Thus, using this correlation, you can classify data as either one set or the other, indicating the parts of the data set that come from one set or the other.

This simple method often works well for data stored in simple databases with a small amount of data and slow data access speeds. For example, a database system may use a tree structure to store data, with the columns of a record representing fields in the structure. This structure did not allow data to be ranked because the data would be in two separate rows of the tree structure. This makes it impossible to make sense of the data if the data fits in only one tree structure. If the database has two data trees, you will need to compare each of the two trees. If there were a large number of trees, the comparison could be computationally expensive.

Therefore, full correlation is a poor classification method. Data correlation does not distinguish between relevant parts of the data, and the data is relatively small in both columns and rows. These problems make full correlation unsuitable for simple data classification systems and data storage systems. However, if the data is relatively large, full correlation can be applied. This example is useful for storage systems with a relatively high computational load.

Combining a data classification method with a data storage system improves both performance and usability. In particular, the size of the resulting classification algorithm is largely independent of the size of the data store. The detailed classification algorithm does not require a lot of memory to store data at all. It is often small enough to be buffered, and many organizations store their classification systems this way. Also, the performance characteristics of the storage system do not depend on the classifier. The storage system can handle data with a high degree of variability.

Следующая страница