Loading...
Research Project
Untitled
Funder
Authors
Publications
Qualidade de dados em bases de dados anonimizadas: uma abordagem de avaliação mista
Publication . Pombinho, Paulo; Cavique, Luís; Correia, Luís
A qualidade dos dados é essencial para uma correta compreensão dos conceitos que representam. Em projetos de prospeção de dados é especialmente relevante evitar dados com qualidade inferior uma vez que se usam algoritmos que dependem de dados corretos para criar modelos e previsões precisos. Neste artigo, propomos uma abordagem de avaliação de qualidade que considera métricas que lidam com atributos individuais e, adicionalmente, uma análise longitudinal de fluxo, que permite fazer uma avaliação de qualidade que tem em consideração informação contextual. São propostas métricas de Qualidade de Dados por Entrada e Qualidade de Dados por Atributo e, finalmente, é proposta uma medida de Qualidade Global de Dados baseada nessas métricas.
Imbalanced learning in assessing the risk of corruption in public administration
Publication . Vasconcelos, Marcelo Oliveira; Chaim, Ricardo Matos; Cavique, Luís
This research aims to identify the corruption of the civil servants in the Federal District, Brazilian Public Administration. For this purpose, a predictive model was created integrating data from eight different systems and applying logistic regression to real datasets that, by their nature, present a low percentage of examples of interest in identifying patterns for machine learning, a situation defined as a class imbalance. In this study, the imbalance of classes was considered extreme at a ratio of 1:707 or, in percentage terms, 0.14% of the interest class to the population. Two possible approaches were used, balancing with resampling techniques using synthetic minority oversampling technique SMOTE and applying algorithms with specific parameterization to obtain the desired standards of the minority class without generating bias from the dominant class. The best modeling result was obtained by applying it to the second approach, generating an area value on the ROC curve of around 0.69. Based on sixty-eight features, the respective coefficients that correspond to the risk factors for corruption were found. A subset of twenty features is discussed in order to find practical utility after the discovery process.
Data pre-processing and data generation in the student flow case study
Publication . Cavique, Luís; Pombinho, Paulo; Tallón Ballesteros, Antonio J.; Correia, Luís
Education covers a range of sectors from kindergarten to higher education. In the education system, each grade has three possible outcomes: dropout, retention and pass to the next grade. In this work, we study the data from the Department of Statistics of Education and Science (DGEEC) of the Education Ministry. DGEEC maintains those outcomes for each school year, therefore, this study seeks a longitudinal view based on student flow. The document reports the data pre-processing, a stochastic model based on the pre-processed data and a data generation process that uses the previous model.
Organizational Units
Description
Keywords
Contributors
Funders
Funding agency
Fundação para a Ciência e a Tecnologia
Funding programme
3599-PPCDT
Funding Award Number
DSAIPA/DS/0039/2018