Loading...
3 results
Search Results
Now showing 1 - 3 of 3
- Dataset for corruption risk assessment in a public administrationPublication . Vasconcelos, Marcelo Oliveira; Cavique, LuísThis data article describes a dataset of corruption approach and possible variables related, and this dataset was created by integrating eight different systems of Brazilian federal government and Federal District. We present real data from civil servants and militaries to comply with GDPR legislation, the attributes that could identify a person were removed, making the data anonymized.
- Imbalanced learning in assessing the risk of corruption in public administrationPublication . Vasconcelos, Marcelo Oliveira; Chaim, Ricardo Matos; Cavique, LuísThis research aims to identify the corruption of the civil servants in the Federal District, Brazilian Public Administration. For this purpose, a predictive model was created integrating data from eight different systems and applying logistic regression to real datasets that, by their nature, present a low percentage of examples of interest in identifying patterns for machine learning, a situation defined as a class imbalance. In this study, the imbalance of classes was considered extreme at a ratio of 1:707 or, in percentage terms, 0.14% of the interest class to the population. Two possible approaches were used, balancing with resampling techniques using synthetic minority oversampling technique SMOTE and applying algorithms with specific parameterization to obtain the desired standards of the minority class without generating bias from the dominant class. The best modeling result was obtained by applying it to the second approach, generating an area value on the ROC curve of around 0.69. Based on sixty-eight features, the respective coefficients that correspond to the risk factors for corruption were found. A subset of twenty features is discussed in order to find practical utility after the discovery process.
- Mitigating false negatives in imbalanced datasets: an ensemble approachPublication . Cavique, Luís; Vasconcelos, MarceloImbalanced datasets present a challenge in machine learning, especially in binary classification scenarios where one class significantly outweighs the other. This imbalance often leads to models favoring the majority class, resulting in inadequate predictions for the minority class, specifically in false negatives. In response to this issue, this work introduces the MinFNR ensemble algorithm, designed to minimize False Negative Rates (FNR) in imbalanced datasets. The new approach strategically combines data-level, algorithmic-level, and hybrid-level approaches to enhance overall predictive capabilities while minimizing computational resources using the Set Covering Problem (SCP) formulation. Through a comprehensive evaluation of diverse datasets, MinFNR consistently outperforms individual algorithms, showing its potential for applications where the cost of false negatives is substantial, such as fraud detection and medical diagnosis. This work also contributes to ongoing efforts to improve the reliability and effectiveness of machine learning algorithms in real imbalanced scenarios.
