Data Scientist by day, Data Superhero by night.

My Tool 'Box'

Tools are exciting and are constantly evolving. I love working with tools to make my work more efficient, but understandng the business problem plus the right tool is the combo I'm excited about. Nonetheless, my tool 'box':

Programming Language: Python, SQL & R.

Data Analysis & Visualization: SQL, Power BI, Tableau,Matplotlib

Power Platform Development: Power Apps, Power Automate, & SharePoint.

Deploy: Azure.

Technolgy evolves and so do we.

See my resume here

Predicting Income with Python Classification Algorithms

This task focused on using classification algorithms to predict income levels in a selected demographic, utilizing Python's robustness and Azure MLD's cloud-based platform. The analysis was performed on the Adult Dataset from the UCI Machine Learning Repository.

The goal was to gain insights into income level determinants and broader socio-economic dynamics.

Three classification algorithms (Random Forest, KNN, Decision Tree) were applied, with Random Forest showing a good performance score of approximately 85.7% accuracy and a high ROC-AUC score, indicating effective class separation and robustness against overfitting.

The EDA and recommendations for this analysis, as well as a comprehensive report,are detailed in the github repository for this project.

This project has also been done using Azure Machine Learning Designer. The details are in the report.

View Project

A Clustering Study: Customer Segmentation and Behavior Analysis

The goal of this task was to apply clustering algorithms to the UCI Online Retail Store Dataset, focusing on customer segmentation and behavior analysis.

Two clustering algorithms, K-Means and Hierarchical Clustering, were applied. The optimal number of clusters was determined using the WCSS Elbow Method and Silhouette Scoring Method. The analysis included examining the distribution of the 'Quantity' feature across clusters through boxplots and visualizing clusters in 2D scatter plots. Clusters were characterized by various purchasing behaviors: mainstream transactions, higher value transactions, bulk purchases, and outlier transactions.

The EDA and recommendations for this analysis, as well as a comprehensive report,are detailed in the github repository for this project.

View Project

Analyzing Netflix’s Trends: Streaming sentiments and text mining

This project focuses on analyzing text data from the Netflix Movie and TV Show dataset to understand the sentiments and narratives in the streaming content industry. Utilizing Python libraries and NLTK for sentiment intensity analysis, the project aims to quantify emotional content and uncover trends in various dimensions such as genre, providing a data-driven narrative of the industry's landscape.

The analysis involved sentiment analysis using VADER (Valence Aware Dictionary and sEntiment Reasoner), frequency analysis, topic modeling, word clouds, and N-gram analysis. Sentiment scores were computed and analyzed through histograms, revealing a distribution with a peak around the neutral zone. Genre sentiment analysis was conducted by splitting and expanding genres, and average sentiment scores for top actors were plotted.

The EDA and recommendations for this analysis, as well as a comprehensive report,are detailed in the github repository for this project.

View Project

Global Health Dynamics: Demographics and Fiscal Impacts using R

This research provides a detailed analysis of global health systems, examining the interplay between demographics, fiscal allocation, mortality, health system equity, and efficacy from 2010 to 2021. It employs a blend of qualitative, quantitative, and domain-specific methodologies to offer a holistic view of health systems across various socioeconomic contexts. The primary aim is to identify patterns, disparities, and correlations, offering insights into global health systems to inform policies, best practices, and data scientists focusing on economics.

The analysis encompassed descriptive statistics to understand data distribution and properties. Statistical measures such as mean, median, mode, standard deviation, skewness, kurtosis, range, IQR, variance, and quantiles were calculated. Visual EDA included boxplots, histograms, scatterplots, and multivariate plots to understand data distribution, outliers, and relationships among variables.

The EDA and recommendations for this analysis, as well as a comprehensive report,are detailed in the github repository for this project.

View Project

TIME SERIES FORECASTING WITH SARIMA MODEL

ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. This is one of the easiest and effective machine learning algorithm to performing time series forecasting. This is the combination of Auto Regression and Moving average. ARIMA is an algorithm used for forecasting Time Series Data. Using this model, we can analyze and model time-series data to make future decisions. Some of the applications of Time Series Forecasting are weather forecasting, sales forecasting, business forecasting, stock price forecasting,etc For stationary data, ARIMA model is required; seasonal data, Seasonal ARIMA (SARIMA) is required.

View Project

Google Data Studio Visualization: D&S Health Sales Dashboard

The Google ecosystem is a fantastic innovation. It was my first go-to in my journey before I was swayed by Microsoft. For this simple analysis & Visualization, we highlight D&S Health Tech. D&S is a mental health tech-ed organization. D&S has just completed a sales campaign. The Sales team needs some insights(revenue,products,coupons,etc.) from some sales data submitted. This analysis and visulaization was done using Google Data Studio, while the dataset was collected on Google Sheets.

Data Exploratory Analysis with Python

Our client is a major gas and electricity utility that supplies to corporate, SME (Small & Medium Enterprise), and residential customers. The power-liberalization of the energy market in Europe has led to significant customer churn, especially in the SME segment. They want to diagnose the source of churning SME customers. A fair hypothesis is that price changes affect customer churn. Therefore, it is helpful to know which customers are more (or less) likely to churn at their current price, for which a good predictive model could be useful. Moreover, for those customers that are at risk of churning, a discount might incentivize them to stay with our client. The head of the SME division is considering a 20% discount that is considered large enough to dissuade almost anyone from churning (especially those for whom price is the primary concern). An initial team meeting was held to discuss various hypotheses, including churn due to price sensitivity. Deeper exploration is required on the hypothesis that the churn is driven by the customers’ price sensitivities. The client plans to use the predictive model on the 1st working day of every month to indicate to which customers the 20% discount should be offered. We would need to fomulate the hypothesis as a data science problem and lay out the major steps needed to test this hypothesis. Communicate your thoughts and findings to the client, focusing on the data that we would need from the client and the analytical models we would use to test such a hypothesis.

View Project

Data Cleaning with SQL

Data cleaning is a pertininet part of end to end data analysis process. SQL, in my opinion is a great tool. Apt, concise. The quality of the data we use determines the quality of the results and insights we get. Many professionals believe that we should dedicate more time to preparing and cleaning the data rather than other processes like EDA, ML, and others. Otherwise, we could finish with a bunch of inaccurate output. In this project I drive a cleaning data process to prepare data for analysis by modifying incomplete data, removing irrelevant and duplicated rows, splitting addresses, and modifying improperly formatted data. Data cleaning is not about erasing information to simplify the dataset, but rather finding a way to maximize the accuracy of the collected data. Let’s go over cleaning techniques with a Housing dataset. It has 56K+ rows. Let’s get started!.

View Project

Data Exploration in Structured Query Language

Covid-19 was tough. It's the first pandemic I have ever witnessed and my world turned 360. This project is a simple data exploration excercise using Structured Query Language with real data sets from the Covid-19 pandemic. I do intend on working more with this data because the dataset is extensive. On Data exploration and aggregation, these are two significant aspects of data analysis while processing with transactional data stored in SQL Server for statistical inferences. For data exploration using data science languages like SQL, data often needs to be filtered, re-ordered, transformed, aggregated and visualized. There are many possibilities for achieving these functionalities.

View Project

Web Scraping with Python (BeautifulSoup Lib)

Web scraping (or data scraping) is a technique used to collect content and data from the internet. This data is usually saved in a local file so that it can be manipulated and analyzed as needed. If you’ve ever copied and pasted content from a website into an Excel spreadsheet, this is essentially what web scraping is, but on a very small scale. However, when people refer to ‘web scrapers,’ they’re usually talking about software applications. Web scraping applications (or ‘bots’) are programmed to visit websites, grab the relevant pages and extract useful information. By automating this process, these bots can extract huge amounts of data in a very short time. SR:Careerfoundry

View Project

Data Visualization with Tableau & Power BI

The analyst is currently cooking new visualizations. Check back soon!

Coming Soon!

Correlation in Python

Currently writing multiple lines of code using python. Please check in soon

Coming soon!

PORTFOLIO FOR OLISEDEME CHINONYE FAVOUR

PORTFOLIO FOR
OLISEDEME CHINONYE FAVOUR