MASSACHUSETTS INSTITUTE OF TECHNOLOGY(マサチューセッツ工科大学)などの研究者達が、新しいデータ関連付け・集約システム「Data Civilizer」についての発表を行っています。


今回の「Data Civilizer」が提案しているのも、実際には従来からの様々なテーブル間でのデータの関連付けの為の手法です。

研究者達は、「Data Civilizerは関連する情報を含むデータセットをすばやく見つけ出し、さまざまなデータテーブル間の接続を自動的に検出し(automatically finds connections among many different data tables and allows users to perform database-style queries across all of them)」、作業目的を達成する為の労力を削減する、と主張しています。

The system begins by analyzing every column of every table at its disposal. First, it produces a statistical summary of the data in each column. For numerical data, that might include a distribution of the frequency with which different values occur; the range of values; and the “cardinality” of the values, or the number of different values the column contains. For textual data, a summary would include a list of the most frequently occurring words in the column and the number of different words. Data Civilizer also keeps a master index of every word occurring in every table and the tables that contain it.
Then the system compares all of the column summaries against each other, identifying pairs of columns that appear to have commonalities — similar data ranges, similar sets of words, and the like. It assigns every pair of columns a similarity score and, on that basis, produces a map, rather like a network diagram, that traces out the connections between individual columns and between the tables that contain them.





Taming data:System finds and links related data scattered across digital files, for easy querying and filtering.
Larry Hardesty | MIT News Office