Project Archive

An in-depth analysis of case studies that demonstrate the fusion of data science, psychology, and business strategy to create techno-human solutions.

Predictive Model for Screening Autism Spectrum Disorder (ASD) in Adults

An end-to-end Machine Learning pipeline merging psychopathology with data science to create accessible screening tools.

🎯 Context & Problem

Diagnosing Autism Spectrum Disorder in adults can be a lengthy and costly process. This project sought to answer: Can we use data from a standard psychometric screening questionnaire (the AQ-10) to build a Machine Learning model that identifies subtle patterns and offers an automated, rapid, and scalable initial risk assessment?

⚙️ Technical Methodology

The project began with unsupervised clustering (K-Means, DBSCAN) to understand the data's natural structure. The EDA revealed a significant class imbalance, which was addressed using techniques like SMOTE. A "bake-off" of multiple models with hyperparameter optimization was conducted, achieving an average F1-score of 0.87.

🛠️ Tech Stack

Python Pandas Scikit-learn Keras/TensorFlow XGBoost LightGBM Imbalanced-learn

📊 Impact & Results

A validated model that serves as a non-invasive support tool for healthcare professionals, helping to prioritize cases and allocate resources more efficiently. It demonstrates the potential of AI to create more accessible, data-driven mental health solutions.

🔗 View code on GitHub

Strategic Blueprint: Predicting Hypertension-Psychopathology Comorbidity

A complete data science solution design using CRISP-DM, demonstrating the ability to architect complex solutions.

🎯 Context & Problem

A geriatric care company needed to validate a clinical hypothesis: is there a predictable relationship between hypertension peaks and the onset of psychopathological crises? The challenge was to structure a complex and sensitive data project rigorously, ethically, and in a way that generates preventive business value.

⚙️ CRISP-DM Methodology

A detailed plan including a sophisticated data preparation strategy with innovative variables like a 'Hydration' index. It proposed interpretable models (Decision Trees, Logistic Regression) to facilitate clinical validation and included a dedicated phase for the proactive identification of biases in historical data.

🛠️ Tech Stack

CRISP-DM R Logistic Regression Decision Trees

📊 Impact & Results

A strategic plan that ensures a solid foundation, minimizing risks and maximizing impact. It demonstrates high-level technical leadership and business vision. The blueprint is ready for execution by a development team.

🔗 View Blueprint on GitHub

Identifying Diabetes Risk Profiles via Clustering

Transforming a raw dataset into actionable patient segments using unsupervised clustering and advanced feature engineering.

🎯 Context & Problem

Beyond predicting individual diabetes cases, it was crucial to understand if natural groups of individuals with combined risk profiles existed. The objective: to shift from individual analysis to population segmentation to inform personalized prevention campaigns.

⚙️ Technical Methodology

An EDA → Feature Engineering → Clustering workflow. Key finding: "outliers" in health variables were a signal of high-risk groups, leading to the use of RobustScaler. Created composite variables: a healthy_habits_score and a cardio_risk_index. The Elbow Method justified 3 clusters as the optimal segmentation.

🛠️ Tech Stack

Python Pandas Scikit-learn Matplotlib Seaborn Tableau

📊 Impact & Results

A strategic data asset: a segmented dataset that allows business analysts to explore risk profiles without ML knowledge. It democratizes insight for multidisciplinary teams.

🔗 View code on GitHub

Data-Driven Strategy for International Market Expansion

Applying Machine Learning to a strategic business problem, transforming macroeconomic data into defensible recommendations.

🎯 Context & Problem

An international expansion decision represents a high financial risk. This project sought to replace intuition with a rigorous data science approach to answer: "Which countries offer the best balance of economic opportunity and stability for expansion?"

⚙️ Technical Methodology

A two-phase project: I) ETL & Preparation; II) Modeling & Segmentation. PCA was used to visualize relationships between countries. K-Means and DBSCAN were applied to group countries with similar profiles. Comparing methods provided a robust understanding of the data structure, resulting in a logical segmentation of nations.

🛠️ Tech Stack

R Tidyverse ggplot2 cluster fpc dbscan

📊 Impact & Results

A ranked and segmented list of candidate countries, providing a tool for evidence-based strategic decision-making. It reduces risk and optimizes investment in the global expansion strategy.

🔗 View Part I (ETL) 🔗 View Part II (Modeling)

Creating a Dataset for Video Game Economic Analysis via Ethical Web Scraping

A web scraping pipeline to extract data from a "lazy loading" website by managing XHR requests. The result is a clean, published dataset on Zenodo (with a DOI), demonstrating a complete data lifecycle.

🎯 Context & Problem

The retro video game market has a fascinating economic behavior, but price data is scattered. The problem was: how can we systematically create a structured dataset from a dynamic web source efficiently and respectfully?

👤 My Role & Responsibilities

I acted as the sole Data Engineer, responsible for the entire process: research and feasibility (analyzing `robots.txt`), developing the Python scraper to handle dynamic content, implementing ethical practices (rate limiting), data cleaning and structuring, and the final dataset publication.

⚙️ Technical Methodology

The main challenge was the "infinite scroll". Instead of using Selenium, I analyzed the network traffic, identified the XHR requests that loaded the data in JSON, and simulated those requests directly with `requests-html`, a much faster approach. Once the data was obtained, I used Pandas to flatten, clean, and structure the result into a CSV.

🛠️ Tech Stack

Python requests-html BeautifulSoup4 Pandas

📊 Impact & Results

The main result is a high-quality, citable dataset (DOI: 10.5281/zenodo.14043146) with 2,369 records, which can now be used by the community for economic analysis or price prediction models. The impact is the creation of a new public data asset.

🔗 Relevant Links

🔗 View code on GitHub 📄 View dataset on Zenodo (DOI)

U.S. Traffic Fatality Data Preparation & Analysis (CRISP-DM)

Applying the CRISP-DM framework to transform raw accident data (FARS dataset) into a robust analytical asset, including advanced feature engineering and culminating in a Principal Component Analysis (PCA).

🎯 Context & Problem

Raw data on fatal traffic accidents (FARS) is rich but fragmented and full of inconsistencies. The problem was: how can we unify, clean, and enrich this data to create a single "source of truth" that allows data scientists to build reliable predictive models?

👤 My Role & Responsibilities

I took on the role of a Data Scientist focusing on the Data Understanding and Preparation phases of CRISP-DM. I was responsible for the entire ETL pipeline: joining tables, decoding variables, logical imputation of missing values, and creating new features.

⚙️ Technical Methodology

I transformed complex categorical variables into interpretable binary flags (`DRINKING`, `NIGHT_HOUR`) and created derived variables like vehicle age. I implemented a methodical cleaning of thousands of null values and special codes. Finally, I applied PCA to confirm the relevance of the new features.

🛠️ Tech Stack

R Tidyverse (dplyr, readr) stats (prcomp) RStudio

📊 Impact & Results

The project delivered three clean, model-ready datasets. The impact is that this time-consuming preparation work is already done, allowing future efforts to focus directly on prediction and prevention to improve road safety.

🔗 Relevant Links

🔗 View code on GitHub

Fitbit User Analysis (Google Capstone)

An end-to-end business case study from the Google Data Analyst Certificate. I analyzed Fitbit data to extract consumer behavior insights, using a hybrid R and SQL (BigQuery) pipeline to derive actionable business recommendations.

🎯 Context & Problem

The health-tech company Bellabeat needed to understand how consumers use tracking devices to identify market opportunities. The business problem was: What trends in Fitbit data can inspire new product features or marketing campaigns?

⚙️ Technical Methodology

The key technical highlight was the use of a hybrid pipeline: I processed most files in R with Tidyverse. However, for the large heart rate dataset, I uploaded it to Google BigQuery and used SQL to clean and aggregate it before re-importing it into R. I used ggplot2 for the visualizations.

🛠️ Tech Stack

R (Tidyverse, ggplot2) SQL Google BigQuery RStudio

📊 Impact & Results

The impact is measured by its final business recommendations. I proposed concrete actions such as a "sedentary alert" in the app or marketing campaigns focused on converting light to moderate activity, creating a clear bridge between data analysis and business strategy.

🔗 Relevant Links

🔗 View code on GitHub

Comparative Analysis of Clustering Algorithms: K-Means vs. DBSCAN

A rigorous comparative study demonstrating why DBSCAN (density-based) outperforms K-Means (centroid-based) in scenarios with noise and non-spherical clusters, validating results visually (PCA) and quantitatively (Dunn Index, Silhouette).

🎯 Context & Problem

In unsupervised ML, choosing the right algorithm is crucial. The research question was: Can we demonstrate and quantify the superiority of a density-based algorithm like DBSCAN over K-Means on a real-world dataset that presents noise and irregular-shaped clusters?

⚙️ Technical Methodology

Nulls were imputed with a robust conditional mean. OPTICS was used to guide the parameter selection for DBSCAN. Quantitative validation was the core of the project, using the `fpc` library to calculate metrics like the Dunn Index and Silhouette Width. The results showed a drastic improvement in the Dunn Index (from 0.004 to 0.24), proving DBSCAN's superiority.

🛠️ Tech Stack

R Tidyverse dbscan fpc ggbiplot

📊 Impact & Results

The impact is a clear, evidence-based demonstration of how to select the appropriate clustering model. The result is not just a set of clusters, but a deeper understanding of how and why these algorithms work, serving as a guide for other analysts.

🔗 Relevant Links

🔗 View code on GitHub

Data Warehouse & OLAP Cube Design for Multidimensional Sales Analysis

Designed a Data Warehouse solution in Microsoft SQL Server and built an OLAP Cube to enable multidimensional sales analysis (by product, customer, time, location), transforming data exploration for decision-making.

🎯 Context & Problem

A fictional company relied on static reports from its transactional database, making it impossible to answer complex business questions. The objective was to create an analytical data structure that would allow for "slicing and dicing" queries quickly and intuitively.

👤 My Role & Responsibilities

I took on the role of a Business Intelligence Analyst / Data Architect, responsible for designing the Data Warehouse schema (star model), creating the SQL scripts for the ETL, and finally, designing and configuring the Data Cube for analysis.

⚙️ Technical Methodology

The solution was based on a Data Warehouse with a fact table (`FactSales`) and several dimension tables (`DimProduct`, `DimCustomer`, `DimDate`, `DimLocation`), a fundamental structure for OLAP Cube performance that allows business users to explore data from any perspective without technical knowledge.

🛠️ Tech Stack

Microsoft SQL Server Data Warehousing OLAP Cube R R Markdown

📊 Impact & Results

The project provides the company with a powerful Business Intelligence tool, democratizing data access and empowering managers to perform their own analyses, leading to faster, data-driven decision-making.

🔗 Relevant Links

🔗 View code on GitHub

Classification Model for Predicting Obesity Levels

Developed an end-to-end classification model to predict obesity levels from habits and lifestyle. A comparative "bake-off" of multiple algorithms was performed to find the best-performing one.

🎯 Context & Problem

Obesity is a multifactorial problem. This project sought to answer: Can we, using only data on a person's habits, predict their weight category with a useful degree of accuracy and identify the most influential lifestyle factors?

⚙️ Technical Methodology

After a thorough Exploratory Data Analysis (EDA), the data was prepared and a range of classification models, including Logistic Regression, K-Nearest Neighbors (KNN), SVM, and Random Forest, were trained to determine the best performer for this public health problem.

🛠️ Tech Stack

Python Pandas Scikit-learn Matplotlib Seaborn

📊 Impact & Results

The project resulted in a functional model that can serve as an educational or preliminary screening tool, identifying the most potent lifestyle predictors to help design more effective prevention campaigns.

🔗 Relevant Links

🔗 View code on GitHub

Bankruptcy Prediction using Decision Trees

Building an interpretable classification model (Decision Tree) to predict the probability of a company's bankruptcy from its financial data, highlighting the importance of transparency in high-stakes applications.

🎯 Context & Problem

For investors and regulators, early detection of bankruptcy risk is critical. The problem was: Can we build a model that serves as an early warning system and also provides clear, interpretable rules to justify its predictions?

⚙️ Technical Methodology

The choice of a Decision Tree was deliberate to ensure interpretability. This model allows for the visualization of the decision rules it uses (e.g., "if debt ratio > X and ROC < Y, then high risk"), which is crucial for trust in a financial setting. Performance was evaluated with a special focus on the Recall of the minority class (at-risk companies).

🛠️ Tech Stack

Python Pandas Scikit-learn Interpretable AI

📊 Impact & Results

The project provides a decision support tool for financial risk management, enabling a shift from reactive analysis to proactive risk identification and timely corrective action.

🔗 Relevant Links

🔗 View code on GitHub

Exploratory Data Analysis (EDA) on TV Series Trends

A pure Exploratory Data Analysis and storytelling exercise that transforms a raw dataset about TV series into a coherent visual narrative to answer business questions about dominant platforms and genres.

⚙️ Technical Methodology

The entire analysis was structured around a series of key questions, using Pandas for data manipulation and Matplotlib/Seaborn to create the visualizations that answered them. A significant part of the work involved preprocessing text data to reliably analyze genres.

🛠️ Tech Stack

Python Pandas Matplotlib Seaborn Data Storytelling

📊 Impact & Results

The result is an analytical report that offers a data-driven snapshot of the television world. The impact is the demonstration of how EDA alone can generate value by converting data into knowledge and communicating complex insights simply and visually.

🔗 Relevant Links

🔗 View code on GitHub

Titanic Survival Prediction: A Classification Classic

An execution of the data science "rite of passage," focused on excellence in the fundamentals. It stands out for its meticulous feature engineering (title extraction, family variables) and a rigorous "bake-off" of five classification models.

⚙️ Technical Methodology

The project's strength was its creative feature engineering: I transformed `Cabin` into a binary `InCabin` flag, combined `SibSp` and `Parch` into `FamilySize`, and most importantly, extracted titles (`Mr`, `Mrs`, `Master`) from the `Name` column, revealing a powerful predictive signal. Five models (Logistic Regression, Decision Tree, Random Forest, k-NN, and SVM) were compared using the area under the ROC curve (AUC) as the primary metric.

🛠️ Tech Stack

Python Pandas Scikit-learn Feature Engineering Classification

📊 Impact & Results

The final SVM model achieved an AUC score of 0.82, demonstrating good predictive power. The real impact of the project is the demonstration of a robust and replicable classification workflow that I can apply to any other business problem, highlighting that great feature engineering often outweighs the most complex algorithm.

🔗 Relevant Links

🔗 View code on GitHub