Felipe Dias de Souza
PORTFOLIO

Data Engineer, Data Analyst and BI Analyts skilled in DataOps, AI, Statistics, Python, Azure and Viz.
#DataEngineer

Jan 30, 2025

Azure Data Lakehouse Pipeline

    Description: This project showcases the implementation of a modern data lakehouse architecture using Azure Data Factory, Azure Databricks, and Delta Lake. The pipeline follows a structured Bronze, Silver, and Gold layered approach to optimize data governance, transformation, and analytics.
    Objectives:
      - Automate data ingestion from SQL Server and GitHub using Azure Data Factory.
      - Implement a structured data lake using Azure Data Lake Gen2, ensuring efficient storage and processing.
      - Perform data transformations with Azure Databricks, processing data incrementally and structuring it into a Star Schema for analytics.
      - Secure data using Azure AD and Key Vault, ensuring proper authentication and governance.
      - Enable data visualization and insights using Power BI, allowing real-time reporting.
    Technologies Used:
      - Azure Data Factory (Data Ingestion & Orchestration)
      - Azure Databricks (Data Transformation)
      - Azure Data Lake Gen2 (Storage)
      - Delta Lake (Data Optimization & Transactions)
      - Power BI (Visualization & Reporting)
      - Azure AD & Key Vault (Security & Governance)
Azure Data Pipeline

The image illustrates the end-to-end data pipeline, starting from data ingestion using Azure Data Factory to a structured lakehouse architecture in Azure Data Lake Gen2. The transformation follows the Bronze, Silver, and Gold layer approach, optimizing data quality and analytics.

This approach ensures that raw data is collected in the Bronze layer, processed into structured tables in the Silver layer, and transformed into analytical models (Star Schema) in the Gold layer, stored in Delta Lake for efficient querying.

Jan 05, 2025

Data Engineering Pipeline

    Description: This project showcases a complete data engineering pipeline implemented with Azure services. It demonstrates the end-to-end process of ingesting, transforming, and analyzing large-scale data from on-premise systems to cloud analytics solutions.
    Objectives: The pipeline aims to deliver a scalable and secure data ecosystem to support real-time and batch analytics for business decision-making. It answers key business questions by transforming raw data into actionable insights using advanced cloud technologies.
    Key Steps:
  • Ingestion: Data is ingested from on-premise sources into Azure Data Lake via Azure Data Factory.
  • Transformation: Data flows through the Bronze (raw), Silver (cleaned), and Gold (aggregated) layers using Azure Databricks for processing.
  • Analytics: Transformed data is loaded into Azure Synapse Analytics and visualized in Power BI dashboards.
  • Security and Governance: The architecture ensures compliance and security using Azure Active Directory and Azure Key Vault for identity management and sensitive data protection.
Data Engineering Pipeline Architecture

This architecture highlights the core principles of modern data engineering—data lakehouse integration, scalable ETL/ELT workflows, and secure access control.

Aug 03, 2022

Marketing Campaign

    Description: In this case we are dealing with some account types, where there are normal ways in which sales are created and we have marketing tactics that we can utilize to drive the sales, such as Flyers, Emails, Phones and Visits. However some of these tactics are going to work with some of the accounts better than others, not all marketing tactics are equally effective.
    Objective: This project has as a objective to help we answer these questions:
    - What is the impact of each marketing strategy and sales visit on Sales (Amount Collected)?
    - Is the same strategy valid for all the different client types?
    Implications: Based on the analysis, it is evident that several factors are associated with the amount of accounts collected. Particularly, the sales contacts emerged as the most influential drivers of return on investment. Therefore, allocating additional resources to these areas would enhance the effectiveness of the overall campaign.

The plot of the distribution of the indicators, such as Amount Collected, Unit Sold, Montly Target, and Client Type helping identify which indicators have the strongest correlation with CPI-U.

This bar plot visualizes the return on investment (ROI) by variable and account type. Each bar represents a specific variable, and the height of the bar indicates the average return on investment in dollars ($) associated with that variable. The bars are grouped by account type, with different colors representing different types of accounts.

Aug 03, 2022

Economic Analysis Consumer Price Index - USA

    Description: This project aims to predict the Consumer Price Index for All Urban Consumers (CPI-U) in the United States by leveraging various economic indicators. The CPI is a crucial measure of inflation, reflecting changes in the cost of living over time. By accurately forecasting CPI movements, policymakers, economists, and businesses can make informed decisions about monetary policy, budgeting, and investment strategies. Such data was obtained from FRED.
    Objective: The primary goal is to build a robust predictive model capable of forecasting CPI movements based on changes in the selected economic indicators. The model's performance will be assessed using appropriate evaluation metrics, such as mean absolute error or root mean squared error, to ensure its reliability and effectiveness.
    Implications: Accurate CPI predictions have significant implications for various stakeholders, including policymakers, investors, businesses, and consumers. Understanding future inflation trends can inform monetary policy decisions, asset allocation strategies, pricing decisions, and budget planning.

The plot of the Heatmap above showing the correlation matrix between CPI-U and the economic indicators, helping identify which indicators have the strongest correlation with CPI-U.

Ploted above the historical trend of CPI-U alongside each economic indicator (Unemployment Rate, Labor Force Participation Rate, Treasury and Agency Securities, All Commercial Banks Data) over time.

The plot of the Feature Importance above ranking to show which economic indicators have the most significant impact on predicting CPI-U.

April 25, 2021

Steel Plate Defect

    Problem Statement: I must predict the probability for each of 7 defect categories: Pastry, Z_Scratch, K_Scatch, Stains, Dirtiness, Bumps, Other_Faults.
    Context: The aim is to develop a predictive model that can accurately forecast defect using relevant parameters. The model will be valuable for factories.
    Objective: Implement a Gradient Boost Classification, a regularized linear regression technique, to construct a predictive model capable of accurately forecasting the types of defect based on the identified significant variables.
    Analysis: Determine the relative contribution of each parameter to the model's prediction of Pastry, Z_Scratch, K_Scatch, Stains, Dirtiness, Bumps, Other_Faults. Additionally, we will rigorously evaluate the model's performance using the accuracy.

This plot above visualizes the ROC AUC Scores for Classification Models.

This plot above illustrates the classification report showing the main classification metrics such as precision, recall and f1-score.

April 10, 2021

Weather Analysis

    Problem Statement: Understand the factors affecting air temperature and build a predictive model to forecast air temperature based on weather parameters.
    Context: The aim is to develop a predictive model that can accurately forecast air temperature using relevant weather parameters. The model will be valuable for weather forecasting agencies, agricultural planning, and various industries dependent on weather conditions.
    Objective: Implement Ridge Regression, a regularized linear regression technique, to construct a predictive model capable of accurately forecasting air temperature based on the identified significant weather variables. This model selection aims to achieve optimal performance while mitigating overfitting, a common challenge in machine learning.
    Analysis: Determine the relative contribution of each weather parameter to the model's prediction of air temperature. Additionally, we will rigorously evaluate the model's performance using various statistical metrics such as mean squared error (MSE) and R-squared value.

This plot above visualizes the air temperature trends throughout the year 2023. The blue line represents the daily temperature variations, while the red dashed line indicates the mean temperature for the year. The plot helps to observe temperature fluctuations over time and highlights the average temperature level for the entire year.

This plot above illustrates the performance comparison of different regression models including Linear Regression, Ridge Regression, Lasso Regression, and Random Forest Regression. Each boxplot represents the distribution of Root Mean Squared Error (RMSE) scores obtained through cross-validation for a specific model. The lower the RMSE, the better the model's predictive performance. This analysis aids in selecting the most suitable regression model for the given dataset based on its predictive accuracy.

This plot above compares the target temperatures with the model temperatures over time. Each point represents a specific date, with the target temperatures indicated by one set of points and the model temperatures indicated by another set. The plot helps visualize the relationship between the observed and predicted temperatures, aiding in assessing the accuracy and performance of the model across different dates.

March 15, 2021

Segment Shopping Customer

    Problem Statement: understand the Target Customer for the marketing team to plan a strategy.
    Context: the manager wants identify the most important shopping groups based on income, age, and the mall shopping score. He wants the ideal number of groups with a label for each.
    Objective: divide the mall target market into approachable groups. Create subsets of a market based on demographics behavioural criteria to better understand the target for marketing activities.
    Analysis: - Target group would be cluster 7 which has a high Spending Score and high Annual income.
    - 57 percent of cluster 3 shoppers are women. We should look for ways to attract these customers using a marketing campaign targeting popular items in this cluster.
    - Cluster 5 presents an interesting opportunity to market to the customers for sales event on popular items which has a high Spending Score and low Annual income.

This plot visualizes the bivariate clustering of annual income and spending score. The black stars represent the cluster centers obtained from the clustering algorithm. Each point on the scatter plot represents a data point, with the x-axis indicating annual income and the y-axis indicating spending score. The plot provides insights into the relationships and patterns present in the data regarding spending behavior and income levels.