Datos sintéticos: Introducción a técnicas generativas y evaluación de calidad

dc.contributor.advisor, Javier Mauricio
dc.contributor.authorCleves Leguízamo, Diego Andrés
dc.contributor.corporatenameUniversidad Santo Tomás
dc.contributor.orcidhttps://orcid.org/0009-0003-5914-4156
dc.date.accessioned2026-02-06T11:42:32Z
dc.date.available2026-02-06T11:42:32Z
dc.date.issued2026-02-04
dc.descriptionEl presente trabajo aborda el estudio de datos sintéticos desde su concepción teórica y generación. Se propone la implementación de diversos modelos con el fin de sintetizar datos categóricos y cuantitativos, luego se comparan de acuerdo a su capacidad de enmascarar datos (propensión), sus medidas de semejanza estadística y tiempo de ejecución. Los resultados mostraron que simular variables categóricas con base a reglas, que representan sus dependencias en la realidad es el mejor método para simularlas. No obstante, a las variables numéricas no fue posible sintetizarlas de manera adecuada, los modelos propuestos no capturaron la cópula adecuadamente. A manera de conclusión se indica dónde se fallo y las oportunidades de mejora disponibles.
dc.description.abstractThis work studies synthetic data from its theoretical conception to its generation. Several models are implemented to synthesize categorical and numerical data and are compared in terms of data masking capability (propensity), statistical similarity, and execution time. The results indicate that rule-based simulation is the most effective approach for categorical variables, while numerical variables could not be adequately synthesized due to the models’ inability to capture the copula structure. The conclusions discuss the identified limitations and potential improvements.
dc.description.degreelevelPregradospa
dc.description.degreenameProfesional en estadísticaspa
dc.format.mimetypeapplication/pdf
dc.identifier.citationCleves Leguízamo, D. A. (2025). Datos sintéticos: Introducción a técnicas generativas y evaluación de calidad. [Trabajo de Grado, Universidad Santo Tomás]. Repositorio Institucional
dc.identifier.instnameinstname:Universidad Santo Tomásspa
dc.identifier.reponamereponame:Repositorio Institucional Universidad Santo Tomásspa
dc.identifier.repourlrepourl:https://repository.usta.edu.cospa
dc.identifier.urihttp://hdl.handle.net/11634/71463
dc.language.isospa
dc.publisherUniversidad Santo Tomásspa
dc.publisher.branchCRAI-USTA Bogotá
dc.publisher.facultyFacultad de estadísticaspa
dc.publisher.programRregrado estadísticaspa
dc.relation.referencesAbowd, J., & Lane, J. (2003). Synthetic Data and Confidentiality Protection. https://www2.census.gov/ces/tp/tp-2003-10.pdf
dc.relation.referencesAPX. (2025). Propensity Score Evaluation. Evaluating Synthetic Data Quality. https://apxml.com/courses/evaluating-synthetic-data-quality/chapter-2-advanced-statistical- fidelity/propensity-score-evaluation
dc.relation.referencesColombi, K. (2025). How to Generate Synthetic Data: A Comprehensive Guide. Tonic.ai. https://www.tonic.ai/guides/how-to-generate-synthetic-data-a-comprehensive-guide
dc.relation.referencesDataCebo. (2025). sdmetrics. https://docs.sdv.dev/sdmetrics
dc.relation.referencesDesfontaines, D. (2023). The Fundamental Trilemma of Synthetic Data Generation. https://www.tmlt.io/resources/fundamental-trilemma-synthetic-data-generation
dc.relation.referencesEl Namaki, M. S. S. (2025). AI Data Syndrome: Could Synthetic Data Lead to Hallucinative AI Analytics Outcomes?. https://ijeber.com/uploads2025/ijeber_05_325.pdf
dc.relation.referencesFarhadyar, K., Bonofiglio, F., Hackenberg, M., Behrens, M., Zöller, D., & Binder, H. (2024). Combining Propensity Score Methods with Variational Autoencoders for Generating Synthetic Data in Presence of Latent Sub-Groups. BMC Medical Research Methodology, 24, Article 198. https://doi.org/10.1186/s12874-024-02327-x
dc.relation.referencesGenest, C., & Favre, A. (1986). Everything You Always Wanted to Know About Copula Modeling but Were Afraid to Ask. https://www.uni-muenster.de/Physik.TP/ lemm/seminarSS08/JHE-2007.pdf
dc.relation.referencesHamming, R. (1986). You and Your Research. https://docs.sdv.dev/sdmetrics
dc.relation.referencesJordan, J., Szpruch, L., Houssiau, F., Bottarelli, M., Cherubin, G., Maple, C., Cohen, S. N., & Weller, A. (2024). Synthetic data: What, Why and How?. https://www.tmlt.io/resources/fundamental-trilemma-synthetic-data-generation
dc.relation.referencesKooistra, E. (2024). How do I Know that the Synthetic Data is of the Right Quality for My Use Case? https://bluegen.ai/how-do-i-know-that-the-synthetic-data-is-of-the-right-quality-for- my-use-case/
dc.relation.referencesLocowic, L., Monteverdi, A., & Mendoza, E. (2024). Synthetic Data Generation From Real Data Sources Using Monte Carlo Tree Search and Large Language Models. https://www.authorea.com/users/832345/articles/1225643
dc.relation.referencesMohapatra, S., Zong, J., Kerschbaum, F., & He, X. (2022). Differentially Private Data Generation with Missing Mata. Proceedings of the VLDB Endowment. https://www.vldb.org/pvldb/vol17/p2022-mohapatra.pdf
dc.relation.referencesOffice of the Privacy Commissioner of Canada. (2022). When What is Old is New Again – The Reality of Synthetic Data. Privacy Tech-Know Blog. https://www.priv.gc.ca/en/blog/20221012
dc.relation.referencesPathare, A., Mangrulkar, R., Suvarna, K., Parekh, A., Thakur, G., & Gawade, A. (2023). Comparison of Tabular Synthetic Data Generation Techniques Using Propensity and Cluster Log Metric. International Journal of Intelligent Systems and Applications in Engineering, 11(3), 100177. https://www.sciencedirect.com/science/article/pii/S2667096823000241
dc.relation.referencesRavn, L. (2024). The Overlooked Politics of Synthetic Data Performance Metrics. https://www2.census.gov/ces/tp/tp-2003-10.pdf
dc.relation.referencesRestrepo Lopera, J. P. (2023). Nonparametric Generation of Synthetic Data Using Copulas. https://repository.eafit.edu.co/entities/publication/e87d24db-ff18-44e6-b8b3-8eb43b1870fa
dc.relation.referencesRubin, D. B. (1993). Discussion: Statistical Disclosure Limitation. Journal of Official Statistics. https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/discussion-statistical-disclosure-limitation2.pdf
dc.relation.referencesShannon, C. E. (1948). A Mathematical Theory of Information. Bell System Technical Journal, 27(3), 379–423. https://people.math.harvard.edu/ ctm/home/text/others/shannon/entropy/entropy.pdf
dc.relation.referencesSnoke, J., Raab, G. M., Nowok, B., Dibben, C., & Slavkovic, A. (2018). General and specific utility measures for synthetic data. Journal of the Royal Statistical Society: Series A (Statistics in Society), 181(3), 663–688. https://academic.oup.com/jrsssa/article/181/3/663/7072005
dc.relation.referencesUNESCO. (2025). Ethics of Artificial Intelligence. https://www.unesco.org/en/artificial-intelligence/recommendation-ethics
dc.rightsAttribution-NonCommercial-NoDerivs 2.5 Colombiaen
dc.rights.accessrightsinfo:eu-repo/semantics/openAccess
dc.rights.coarhttp://purl.org/coar/access_right/c_abf2
dc.rights.localAbierto (Texto Completo)spa
dc.rights.localAbierto (Texto Completo)spa
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/2.5/co/
dc.subject.keywordSynthesis
dc.subject.keywordSimilarity
dc.subject.keywordPropensity
dc.subject.keywordDependence
dc.subject.keywordCopula
dc.subject.lembEstadísticas
dc.subject.lembModelos estadísticos
dc.subject.lembAnálisis comparativo
dc.subject.proposalSintetización
dc.subject.proposalSemejanza
dc.subject.proposalPropensión
dc.subject.proposalDependencia
dc.subject.proposalCópula
dc.titleDatos sintéticos: Introducción a técnicas generativas y evaluación de calidad
dc.typebachelor thesis
dc.type.coarhttp://purl.org/coar/resource_type/c_7a1f
dc.type.coarversionhttp://purl.org/coar/version/c_ab4af688f83e57aa
dc.type.driveinfo:eu-repo/semantics/bachelorThesis
dc.type.versioninfo:eu-repo/semantics/acceptedVersion

Archivos

Bloque original

Mostrando 1 - 1 de 1
Cargando...
Miniatura
Nombre:
2025DiegoCleves.pdf
Tamaño:
185.21 KB
Formato:
Adobe Portable Document Format

Bloque de licencias

Mostrando 1 - 3 de 3
Cargando...
Miniatura
Nombre:
license.txt
Tamaño:
807 B
Formato:
Item-specific license agreed upon to submission
Descripción:
Cargando...
Miniatura
Nombre:
2025cartadefacultad.pdf
Tamaño:
274.44 KB
Formato:
Adobe Portable Document Format
Descripción:
Carta de facultad
Cargando...
Miniatura
Nombre:
2025cartaderechosdeautor.pdf
Tamaño:
303.9 KB
Formato:
Adobe Portable Document Format
Descripción:
Carta derechos de autor