Hypotheses
FAMILY_MULTI_SPECTRAL_INDICES - Experiment Log
FAMILY_MULTI_SPECTRAL_INDICES
**Data Pipeline**: 1. Load Sentinel-2 zarr store for target time periods 2. Apply BRP parcel masks for consumption potato fields 3. Calculate six vegetation indices using real band values 4. Aggregate to weekly temporal resolution 5. Extract statistical features per index
Experimentnotities
FAMILY_MULTI_SPECTRAL_INDICES - Experiment Log
Experimental Design
Index Calculation Methodology
Data Pipeline: 1. Load Sentinel-2 zarr store for target time periods 2. Apply BRP parcel masks for consumption potato fields 3. Calculate six vegetation indices using real band values 4. Aggregate to weekly temporal resolution 5. Extract statistical features per index
Index Formulas Using Real Bands:
# NDVI - Normalized Difference Vegetation Index
ndvi = (B08 - B04) / (B08 + B04 + 1e-8)
# EVI - Enhanced Vegetation Index
evi = 2.5 * (B08 - B04) / (B08 + 6*B04 - 7.5*B02 + 1)
# NDWI - Normalized Difference Water Index
ndwi = (B08 - B11) / (B08 + B11 + 1e-8)
# SAVI - Soil Adjusted Vegetation Index
savi = 1.5 * (B08 - B04) / (B08 + B04 + 0.5)
# GNDVI - Green Normalized Difference Vegetation Index
gndvi = (B08 - B03) / (B08 + B03 + 1e-8)
# MSI - Moisture Stress Index
msi = B11 / (B08 + 1e-8)
Anomaly Detection Strategy
Focus on Extreme Weather Years: - 2018: Spring/summer drought (April-August) - 2022: Record heat and drought - Detection method: Z-score > 2.0 from 5-year rolling mean
Anomaly Features:
# Calculate rolling statistics
rolling_mean = index.rolling(window=260, min_periods=52).mean()
rolling_std = index.rolling(window=260, min_periods=52).std()
# Z-score anomalies
z_score = (index - rolling_mean) / (rolling_std + 1e-8)
anomaly_flag = abs(z_score) > 2.0
# Cumulative stress
cumulative_stress = z_score.clip(lower=0).cumsum()
Feature Engineering Approach
Temporal Features per Index: 1. Current values: Raw index values at t 2. Moving averages: 2, 4, 8, 12 week windows 3. Growth rates: Week-over-week change 4. Volatility: Rolling standard deviation 5. Peak timing: Week of maximum value 6. Stress duration: Consecutive weeks below threshold
Index Interaction Features:
# Vegetation-moisture interaction
veg_moisture = ndvi * ndwi
# Stress indicator combining moisture and vegetation
stress_index = msi * (1 - ndvi)
# Enhanced vegetation accounting for moisture
evi_adjusted = evi * (1 + ndwi)
# Ratio features
ndvi_ndwi_ratio = ndvi / (ndwi + 0.1)
evi_savi_ratio = evi / (savi + 0.1)
Model Ensemble Strategy
Variant A - Core Ensemble: - Random Forest with 200 trees - Features: NDVI, NDWI, EVI (3 indices × 5 features = 15 total) - No hyperparameter tuning (baseline performance)
Variant B - Full Suite: - Gradient Boosting with interaction terms - Features: All 6 indices + ratios + interactions (~50 features) - Hyperparameter optimization via RandomizedSearchCV
Variant C - Anomaly Focus: - LightGBM with custom objective for extreme events - Features: Z-scores, cumulative stress, anomaly flags - Weighted training emphasizing 2018/2022 observations
Validation Methodology
Walk-Forward Cross-Validation:
# Proper time series split
train_end = len(data) - 156 # Reserve 3 years for test
for week in range(104, train_end, 4): # Weekly steps
train = data[:week]
val = data[week:week+12] # 12-week forecast
# Fit model on train
# Predict on validation
# Store results
Baseline Comparison:
All four mandatory baselines from experiments/_shared/baselines.py:
1. Persistent (current = future)
2. Seasonal naive (52-week lag)
3. AR(2) autoregressive
4. Naive (last value)
Expected Results
Hypothesis Support Criteria: - Variant A: >5% improvement (moderate support) - Variant B: >8% improvement (strong support) - Variant C: >10% improvement on extreme years (strong support)
Feature Importance Expected: 1. NDWI during drought periods (highest) 2. MSI cumulative stress 3. NDVI growth rate changes 4. EVI peak timing 5. Index interaction terms
Implementation Notes
Data Access Pattern
import xarray as xr
from src.sources.brp.brp_api.brp import BRPApi
from src.sources.boerderij_nl.boerderij_nl_api import BoerderijApi
# Load satellite data (REAL zarr store)
sentinel = xr.open_zarr("lake_31UFU_medium_polder.zarr")
# Get parcel masks (REAL boundaries)
brp = BRPApi()
mask_2018 = brp.get_consumption_potato_mask(
bbox=(5.5, 52.5, 5.6, 52.6),
date_range=(date(2018,1,1), date(2018,12,31)),
grid_shape=(len(sentinel.y), len(sentinel.x)),
grid_coords=(sentinel.x.values, sentinel.y.values),
target_crs='EPSG:32631'
)
# Get price data (REAL market prices)
prices = BoerderijApi().get_data(
product_id="NL.157.2086",
date_range=("2015-01-01", "2024-12-31"),
granularity="weekly"
)
Computational Optimizations
- Zarr Chunking: Optimize for temporal access patterns
- Dask Integration: Parallel index calculation
- Caching: Store computed indices to disk
- Batch Processing: Process multiple parcels simultaneously
Quality Controls
- Cloud Filtering: Use SCL band, require >80% clear pixels
- Outlier Detection: Remove index values >3 std from mean
- Gap Filling: Linear interpolation for max 2-week gaps
- Validation Split: Ensure no data leakage in time series
Results Tracking
Variant A Results
- Date: [PENDING]
- MAE: [PENDING]
- Improvement: [PENDING]
- Verdict: [PENDING]
Variant B Results
- Date: [PENDING]
- MAE: [PENDING]
- Improvement: [PENDING]
- Verdict: [PENDING]
Variant C Results
- Date: [PENDING]
- MAE: [PENDING]
- Improvement: [PENDING]
- Verdict: [PENDING]
Decision Log
HE Notes
- 2025-08-23: Created hypothesis based on FAMILY_SATELLITE_NDVI_REAL learnings
- Multi-index approach addresses single NDVI weakness
- Focus on real extreme weather events (2018, 2022)
- All data sources verified as REAL through repository interfaces
EX Notes
[To be added after experiment execution]
Next Actions
- Implement index calculation pipeline
- Run Variant A as baseline
- Optimize Variant B with full feature set
- Test Variant C on extreme years specifically
- Deploy best performer for operational use
Geen Codex-samenvatting
Voeg codex_validated.md toe om de status te documenteren.