Welcome to the world of data analysis with Python! ๐
Here I share code examples, tips and experiences from real projects.
Although data science seems like a modern field, its roots go back centuries. From statistics to machine learning, from computer science to mathematics... Here's the fascinating journey of this interdisciplinary field! ๐
A successful data science project is conducted not only with technical skills, but with the right methodology. Here are the most used approaches in the industry and my experiences from real projects...
# Agile Data Science Sprint Plan (2-week iteration)
## Sprint 1: Data Discovery & Quick Wins
Week 1:
- [ ] Stakeholder interviews and problem definition
- [ ] Data source mapping and access setup
- [ ] Initial EDA and data quality assessment
- [ ] Quick baseline model (simple heuristics)
Week 2:
- [ ] Data cleaning pipeline creation
- [ ] Feature engineering brainstorming
- [ ] First ML model experiments
- [ ] Present preliminary results to stakeholders
## Sprint 2: Model Development
Week 3:
- [ ] Advanced feature engineering
- [ ] Multiple algorithm experimentation
- [ ] Start hyperparameter tuning
- [ ] Cross-validation setup
Week 4:
- [ ] Model performance optimization
- [ ] Model alignment with business metrics
- [ ] Model interpretability analysis
- [ ] Production readiness assessment
## Sprint 3: Deployment & Optimization
Week 5:
- [ ] Model deployment pipeline
- [ ] A/B testing framework setup
- [ ] Monitoring dashboard creation
- [ ] Performance tracking
Week 6:
- [ ] Production deployment
- [ ] Start live monitoring
- [ ] Feedback collection
- [ ] Next iteration planning
Problem: Goal to reduce customer churn by 30%
Guiding the future, finding insights that don't exist in the world, making meaningless chaotic data meaningful. Just thinking about it excites people! ๐คฉ
Data analysis is unthinkable without SQL! Here's a code example from real projects... ๐ช
SELECT TOP (100) PERCENT
derivedtbl_1.adet,
derivedtbl_1.event,
dbo.table_gunler.tarih
FROM (
SELECT TOP (100) PERCENT
dbo.Table_antep_trabzon_sakarya_events.event,
dbo.[Dataset 1_SMS_RAW].date,
COUNT(dbo.[Dataset 1_SMS_RAW].NUMBER_OF_REFUGEE_SMS) AS adet
FROM dbo.[Dataset 1_SMS_RAW]
INNER JOIN dbo.Base_Station_Location
ON dbo.[Dataset 1_SMS_RAW].OUTGOING_SITE_ID = dbo.Base_Station_Location.BTS_ID
FULL OUTER JOIN dbo.Table_antep_trabzon_sakarya_events
ON dbo.[Dataset 1_SMS_RAW].date = dbo.Table_antep_trabzon_sakarya_events.date
AND dbo.Base_Station_Location.MX_SAHAIL = dbo.Table_antep_trabzon_sakarya_events.city
GROUP BY
dbo.Base_Station_Location.MX_SAHAIL,
dbo.Table_antep_trabzon_sakarya_events.event,
dbo.[Dataset 1_SMS_RAW].date
HAVING (dbo.Base_Station_Location.MX_SAHAIL = 'GAZIANTEP')
) AS derivedtbl_1
FULL OUTER JOIN dbo.table_gunler
ON derivedtbl_1.date = dbo.table_gunler.tarih
ORDER BY dbo.table_gunler.tarih
๐ก Tip: This SQL query was developed for the D4R (Data for Refugees) project.
One of the cornerstones of Machine Learning: Splitting data in the right proportions! Here are the tips on how to do it with Python... ๐ฅ
# Let's split the data into 70% training, 30% testing
xtrain_70_sms_full = df_sms_full['adet'].fillna(method="pad").head(int(len(df_sms_full)*0.7))
xtest_30_sms_full = df_sms_full['adet'].fillna(method="pad").tail(int(len(df_sms_full)*0.3))
ytrain_70_sms_full = df_sms_full['event'].fillna(value=0).head(int(len(df_sms_full)*0.7))
ytest_30_sms_full = df_sms_full['event'].fillna(value=0).tail(int(len(df_sms_full)*0.3))
# Let's check data dimensions
print(f"๐ Training data size: {len(xtrain_70_sms_full)}")
print(f"๐งช Test data size: {len(xtest_30_sms_full)}")
print(f"โ
Total data: {len(df_sms_full)}")
fillna(method="pad") โ Fill missing data with previous valuehead() โ Take the first 70% portiontail() โ Take the last 30% portionfillna(value=0) โ Fill missing data with 0int(len()*0.7) โ Calculate 70%int(len()*0.3) โ Calculate 30%Why 70%-30%? This ratio is the optimal balance for the model to get enough training and provide reliable test results. 80%-20% can also be used for small datasets!
If you'd like to discuss data science, machine learning and technology: