Name: | Description: | Size: | Format: | |
---|---|---|---|---|
1.41 MB | Adobe PDF |
Authors
Advisor(s)
Abstract(s)
O tratamento de conjuntos de dados de grande dimensão é uma questão que é recorrente nos
dias de hoje e cuja tarefa não é simples, dadas as limitações computacionais, ainda,
existentes. Uma das abordagens possíveis passa por realizar uma seleção de atributos que
permita diminuir, consideravelmente, a dimensão dos dados sem aumentar a inconsistência
dos mesmos. “Rough Sets” é uma abordagem que difere doutras técnicas de seleção de
atributos pela sua capacidade de lidar com dados inconsistentes. Outra abordagem para
redução de dados é conhecida como Análise Lógica de Dados (LAD). A Análise Lógica de
Dados Inconsistentes (LAID) junta as vantagens destas duas abordagens.
Com o grande aumento do volume de dados, o paradigma, relativamente ao seu
manuseamento, tem-se alterado. Antes, o tratamento dos dados era efetuado num único
computador e o acesso era realizado depois do seu carregamento em memória. A tendência
atual é aceder aos dados em disco, num ambiente cloud. O trabalho realizado pretende
validar este novo paradigma, com recurso ao sistema de dados HDF5 (Hierarchical Data
Format) e ao ambiente remoto disponibilizado pela INCD (Infraestrutura Nacional de
Computação Distribuída). Pelo facto de o HDF5 ser o sistema adotado pela comunidade
Python para lidar com dados de grande dimensão, esta linguagem foi escolhida para
implementação do LAID.
A presente dissertação é mais um contributo para o aprofundamento das técnicas de Data
Mining (extração de conhecimento de dados). Nomeadamente, aborda a seleção de atributos
(feature selection) aplicada a conjunto de dados de grande dimensão, guardados no formato
HDF5, com avaliação da inconsistência dos dados, através da aplicação do algoritmo LAID,
codificado em Python, num ambiente cloud.
The treatment of large datasets is an issue that is often addressed today and whose task is not simple, given the computational limitations that still exist. One possible approach is to perform a feature selection that allows a considerably reduction of data size without increasing inconsistency. “Rough Sets” is an approach that differs from other feature selection techniques by its ability to deal with inconsistent data. Another approach to data reduction is known as Logical Analysis of Data (LAD). Logical Analysis of Inconsistent Data (LAID) combines the advantages of these two approaches. With the increase of large volumes of data, its handling paradigm has been changing over. Previously, data processing was performed on a single computer, with in-memory data access. The current trend is to access data on disk, in a cloud environment. The work carried out intends to validate this new paradigm, using HDF5 data system (Hierarchical Data Format) and remote environment provided by INCD (National Distributed Computing Infrastructure). Because HDF5 is the system adopted by Python’s community to handle large datasets, this language was chosen for LAID algorithm implementation. The present document is one more contribution for deepening research of Data Mining techniques (data knowledge extraction). It addresses the feature selection applied to large datasets, stored in HDF5 format, with the evaluation of data inconsistency, through the application of LAID’s algorithm, encoded in Python, in a cloud environment.
The treatment of large datasets is an issue that is often addressed today and whose task is not simple, given the computational limitations that still exist. One possible approach is to perform a feature selection that allows a considerably reduction of data size without increasing inconsistency. “Rough Sets” is an approach that differs from other feature selection techniques by its ability to deal with inconsistent data. Another approach to data reduction is known as Logical Analysis of Data (LAD). Logical Analysis of Inconsistent Data (LAID) combines the advantages of these two approaches. With the increase of large volumes of data, its handling paradigm has been changing over. Previously, data processing was performed on a single computer, with in-memory data access. The current trend is to access data on disk, in a cloud environment. The work carried out intends to validate this new paradigm, using HDF5 data system (Hierarchical Data Format) and remote environment provided by INCD (National Distributed Computing Infrastructure). Because HDF5 is the system adopted by Python’s community to handle large datasets, this language was chosen for LAID algorithm implementation. The present document is one more contribution for deepening research of Data Mining techniques (data knowledge extraction). It addresses the feature selection applied to large datasets, stored in HDF5 format, with the evaluation of data inconsistency, through the application of LAID’s algorithm, encoded in Python, in a cloud environment.
Description
Keywords
Data mining Seleção de atributos Inconsistência de dados Análise Lógica de Dados (LAD) Análise Lógica de Dados Inconsistentes (LAID) Feature selection LAID Data inconsistency HDF5 Python INCD
Citation
Apolónia, João - Seleção de atributos de dados inconsistentes [Em linha]. Lisboa: [s.n.], 2018. 111 p.