Datos sintéticos: Introducción a técnicas generativas y evaluación de calidad
| dc.contributor.advisor | , Javier Mauricio | |
| dc.contributor.author | Cleves Leguízamo, Diego Andrés | |
| dc.contributor.corporatename | Universidad Santo Tomás | |
| dc.contributor.orcid | https://orcid.org/0009-0003-5914-4156 | |
| dc.date.accessioned | 2026-02-06T11:42:32Z | |
| dc.date.available | 2026-02-06T11:42:32Z | |
| dc.date.issued | 2026-02-04 | |
| dc.description | El presente trabajo aborda el estudio de datos sintéticos desde su concepción teórica y generación. Se propone la implementación de diversos modelos con el fin de sintetizar datos categóricos y cuantitativos, luego se comparan de acuerdo a su capacidad de enmascarar datos (propensión), sus medidas de semejanza estadística y tiempo de ejecución. Los resultados mostraron que simular variables categóricas con base a reglas, que representan sus dependencias en la realidad es el mejor método para simularlas. No obstante, a las variables numéricas no fue posible sintetizarlas de manera adecuada, los modelos propuestos no capturaron la cópula adecuadamente. A manera de conclusión se indica dónde se fallo y las oportunidades de mejora disponibles. | |
| dc.description.abstract | This work studies synthetic data from its theoretical conception to its generation. Several models are implemented to synthesize categorical and numerical data and are compared in terms of data masking capability (propensity), statistical similarity, and execution time. The results indicate that rule-based simulation is the most effective approach for categorical variables, while numerical variables could not be adequately synthesized due to the models’ inability to capture the copula structure. The conclusions discuss the identified limitations and potential improvements. | |
| dc.description.degreelevel | Pregrado | spa |
| dc.description.degreename | Profesional en estadística | spa |
| dc.format.mimetype | application/pdf | |
| dc.identifier.citation | Cleves Leguízamo, D. A. (2025). Datos sintéticos: Introducción a técnicas generativas y evaluación de calidad. [Trabajo de Grado, Universidad Santo Tomás]. Repositorio Institucional | |
| dc.identifier.instname | instname:Universidad Santo Tomás | spa |
| dc.identifier.reponame | reponame:Repositorio Institucional Universidad Santo Tomás | spa |
| dc.identifier.repourl | repourl:https://repository.usta.edu.co | spa |
| dc.identifier.uri | http://hdl.handle.net/11634/71463 | |
| dc.language.iso | spa | |
| dc.publisher | Universidad Santo Tomás | spa |
| dc.publisher.branch | CRAI-USTA Bogotá | |
| dc.publisher.faculty | Facultad de estadística | spa |
| dc.publisher.program | Rregrado estadística | spa |
| dc.relation.references | Abowd, J., & Lane, J. (2003). Synthetic Data and Confidentiality Protection. https://www2.census.gov/ces/tp/tp-2003-10.pdf | |
| dc.relation.references | APX. (2025). Propensity Score Evaluation. Evaluating Synthetic Data Quality. https://apxml.com/courses/evaluating-synthetic-data-quality/chapter-2-advanced-statistical- fidelity/propensity-score-evaluation | |
| dc.relation.references | Colombi, K. (2025). How to Generate Synthetic Data: A Comprehensive Guide. Tonic.ai. https://www.tonic.ai/guides/how-to-generate-synthetic-data-a-comprehensive-guide | |
| dc.relation.references | DataCebo. (2025). sdmetrics. https://docs.sdv.dev/sdmetrics | |
| dc.relation.references | Desfontaines, D. (2023). The Fundamental Trilemma of Synthetic Data Generation. https://www.tmlt.io/resources/fundamental-trilemma-synthetic-data-generation | |
| dc.relation.references | El Namaki, M. S. S. (2025). AI Data Syndrome: Could Synthetic Data Lead to Hallucinative AI Analytics Outcomes?. https://ijeber.com/uploads2025/ijeber_05_325.pdf | |
| dc.relation.references | Farhadyar, K., Bonofiglio, F., Hackenberg, M., Behrens, M., Zöller, D., & Binder, H. (2024). Combining Propensity Score Methods with Variational Autoencoders for Generating Synthetic Data in Presence of Latent Sub-Groups. BMC Medical Research Methodology, 24, Article 198. https://doi.org/10.1186/s12874-024-02327-x | |
| dc.relation.references | Genest, C., & Favre, A. (1986). Everything You Always Wanted to Know About Copula Modeling but Were Afraid to Ask. https://www.uni-muenster.de/Physik.TP/ lemm/seminarSS08/JHE-2007.pdf | |
| dc.relation.references | Hamming, R. (1986). You and Your Research. https://docs.sdv.dev/sdmetrics | |
| dc.relation.references | Jordan, J., Szpruch, L., Houssiau, F., Bottarelli, M., Cherubin, G., Maple, C., Cohen, S. N., & Weller, A. (2024). Synthetic data: What, Why and How?. https://www.tmlt.io/resources/fundamental-trilemma-synthetic-data-generation | |
| dc.relation.references | Kooistra, E. (2024). How do I Know that the Synthetic Data is of the Right Quality for My Use Case? https://bluegen.ai/how-do-i-know-that-the-synthetic-data-is-of-the-right-quality-for- my-use-case/ | |
| dc.relation.references | Locowic, L., Monteverdi, A., & Mendoza, E. (2024). Synthetic Data Generation From Real Data Sources Using Monte Carlo Tree Search and Large Language Models. https://www.authorea.com/users/832345/articles/1225643 | |
| dc.relation.references | Mohapatra, S., Zong, J., Kerschbaum, F., & He, X. (2022). Differentially Private Data Generation with Missing Mata. Proceedings of the VLDB Endowment. https://www.vldb.org/pvldb/vol17/p2022-mohapatra.pdf | |
| dc.relation.references | Office of the Privacy Commissioner of Canada. (2022). When What is Old is New Again – The Reality of Synthetic Data. Privacy Tech-Know Blog. https://www.priv.gc.ca/en/blog/20221012 | |
| dc.relation.references | Pathare, A., Mangrulkar, R., Suvarna, K., Parekh, A., Thakur, G., & Gawade, A. (2023). Comparison of Tabular Synthetic Data Generation Techniques Using Propensity and Cluster Log Metric. International Journal of Intelligent Systems and Applications in Engineering, 11(3), 100177. https://www.sciencedirect.com/science/article/pii/S2667096823000241 | |
| dc.relation.references | Ravn, L. (2024). The Overlooked Politics of Synthetic Data Performance Metrics. https://www2.census.gov/ces/tp/tp-2003-10.pdf | |
| dc.relation.references | Restrepo Lopera, J. P. (2023). Nonparametric Generation of Synthetic Data Using Copulas. https://repository.eafit.edu.co/entities/publication/e87d24db-ff18-44e6-b8b3-8eb43b1870fa | |
| dc.relation.references | Rubin, D. B. (1993). Discussion: Statistical Disclosure Limitation. Journal of Official Statistics. https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/discussion-statistical-disclosure-limitation2.pdf | |
| dc.relation.references | Shannon, C. E. (1948). A Mathematical Theory of Information. Bell System Technical Journal, 27(3), 379–423. https://people.math.harvard.edu/ ctm/home/text/others/shannon/entropy/entropy.pdf | |
| dc.relation.references | Snoke, J., Raab, G. M., Nowok, B., Dibben, C., & Slavkovic, A. (2018). General and specific utility measures for synthetic data. Journal of the Royal Statistical Society: Series A (Statistics in Society), 181(3), 663–688. https://academic.oup.com/jrsssa/article/181/3/663/7072005 | |
| dc.relation.references | UNESCO. (2025). Ethics of Artificial Intelligence. https://www.unesco.org/en/artificial-intelligence/recommendation-ethics | |
| dc.rights | Attribution-NonCommercial-NoDerivs 2.5 Colombia | en |
| dc.rights.accessrights | info:eu-repo/semantics/openAccess | |
| dc.rights.coar | http://purl.org/coar/access_right/c_abf2 | |
| dc.rights.local | Abierto (Texto Completo) | spa |
| dc.rights.local | Abierto (Texto Completo) | spa |
| dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/2.5/co/ | |
| dc.subject.keyword | Synthesis | |
| dc.subject.keyword | Similarity | |
| dc.subject.keyword | Propensity | |
| dc.subject.keyword | Dependence | |
| dc.subject.keyword | Copula | |
| dc.subject.lemb | Estadísticas | |
| dc.subject.lemb | Modelos estadísticos | |
| dc.subject.lemb | Análisis comparativo | |
| dc.subject.proposal | Sintetización | |
| dc.subject.proposal | Semejanza | |
| dc.subject.proposal | Propensión | |
| dc.subject.proposal | Dependencia | |
| dc.subject.proposal | Cópula | |
| dc.title | Datos sintéticos: Introducción a técnicas generativas y evaluación de calidad | |
| dc.type | bachelor thesis | |
| dc.type.coar | http://purl.org/coar/resource_type/c_7a1f | |
| dc.type.coarversion | http://purl.org/coar/version/c_ab4af688f83e57aa | |
| dc.type.drive | info:eu-repo/semantics/bachelorThesis | |
| dc.type.version | info:eu-repo/semantics/acceptedVersion |
Archivos
Bloque original
1 - 1 de 1
Cargando...
- Nombre:
- 2025DiegoCleves.pdf
- Tamaño:
- 185.21 KB
- Formato:
- Adobe Portable Document Format
Bloque de licencias
1 - 3 de 3
Cargando...
- Nombre:
- license.txt
- Tamaño:
- 807 B
- Formato:
- Item-specific license agreed upon to submission
- Descripción:
Cargando...
- Nombre:
- 2025cartadefacultad.pdf
- Tamaño:
- 274.44 KB
- Formato:
- Adobe Portable Document Format
- Descripción:
- Carta de facultad
Cargando...
- Nombre:
- 2025cartaderechosdeautor.pdf
- Tamaño:
- 303.9 KB
- Formato:
- Adobe Portable Document Format
- Descripción:
- Carta derechos de autor

