Let op: dit experiment is nog niet Codex-gevalideerd. Gebruik de bevindingen als voorlopige aanwijzingen.

Hypotheses

FAMILY_MULTI_SPECTRAL_INDICES - Experiment Log

FAMILY_MULTI_SPECTRAL_INDICES

**Data Pipeline**: 1. Load Sentinel-2 zarr store for target time periods 2. Apply BRP parcel masks for consumption potato fields 3. Calculate six vegetation indices using real band values 4. Aggregate to weekly temporal resolution 5. Extract statistical features per index

Laatste update
2025-12-01
Repo-pad
hypotheses/FAMILY_MULTI_SPECTRAL_INDICES
Codex-bestand
Ontbreekt

Experimentnotities

FAMILY_MULTI_SPECTRAL_INDICES - Experiment Log

Experimental Design

Index Calculation Methodology

Data Pipeline: 1. Load Sentinel-2 zarr store for target time periods 2. Apply BRP parcel masks for consumption potato fields 3. Calculate six vegetation indices using real band values 4. Aggregate to weekly temporal resolution 5. Extract statistical features per index

Index Formulas Using Real Bands:

# NDVI - Normalized Difference Vegetation Index
ndvi = (B08 - B04) / (B08 + B04 + 1e-8)

# EVI - Enhanced Vegetation Index  
evi = 2.5 * (B08 - B04) / (B08 + 6*B04 - 7.5*B02 + 1)

# NDWI - Normalized Difference Water Index
ndwi = (B08 - B11) / (B08 + B11 + 1e-8)

# SAVI - Soil Adjusted Vegetation Index
savi = 1.5 * (B08 - B04) / (B08 + B04 + 0.5)

# GNDVI - Green Normalized Difference Vegetation Index
gndvi = (B08 - B03) / (B08 + B03 + 1e-8)

# MSI - Moisture Stress Index
msi = B11 / (B08 + 1e-8)

Anomaly Detection Strategy

Focus on Extreme Weather Years: - 2018: Spring/summer drought (April-August) - 2022: Record heat and drought - Detection method: Z-score > 2.0 from 5-year rolling mean

Anomaly Features:

# Calculate rolling statistics
rolling_mean = index.rolling(window=260, min_periods=52).mean()
rolling_std = index.rolling(window=260, min_periods=52).std()

# Z-score anomalies
z_score = (index - rolling_mean) / (rolling_std + 1e-8)
anomaly_flag = abs(z_score) > 2.0

# Cumulative stress
cumulative_stress = z_score.clip(lower=0).cumsum()

Feature Engineering Approach

Temporal Features per Index: 1. Current values: Raw index values at t 2. Moving averages: 2, 4, 8, 12 week windows 3. Growth rates: Week-over-week change 4. Volatility: Rolling standard deviation 5. Peak timing: Week of maximum value 6. Stress duration: Consecutive weeks below threshold

Index Interaction Features:

# Vegetation-moisture interaction
veg_moisture = ndvi * ndwi

# Stress indicator combining moisture and vegetation
stress_index = msi * (1 - ndvi)

# Enhanced vegetation accounting for moisture
evi_adjusted = evi * (1 + ndwi)

# Ratio features
ndvi_ndwi_ratio = ndvi / (ndwi + 0.1)
evi_savi_ratio = evi / (savi + 0.1)

Model Ensemble Strategy

Variant A - Core Ensemble: - Random Forest with 200 trees - Features: NDVI, NDWI, EVI (3 indices × 5 features = 15 total) - No hyperparameter tuning (baseline performance)

Variant B - Full Suite: - Gradient Boosting with interaction terms - Features: All 6 indices + ratios + interactions (~50 features) - Hyperparameter optimization via RandomizedSearchCV

Variant C - Anomaly Focus: - LightGBM with custom objective for extreme events - Features: Z-scores, cumulative stress, anomaly flags - Weighted training emphasizing 2018/2022 observations

Validation Methodology

Walk-Forward Cross-Validation:

# Proper time series split
train_end = len(data) - 156  # Reserve 3 years for test
for week in range(104, train_end, 4):  # Weekly steps
    train = data[:week]
    val = data[week:week+12]  # 12-week forecast

    # Fit model on train
    # Predict on validation
    # Store results

Baseline Comparison: All four mandatory baselines from experiments/_shared/baselines.py: 1. Persistent (current = future) 2. Seasonal naive (52-week lag) 3. AR(2) autoregressive 4. Naive (last value)

Expected Results

Hypothesis Support Criteria: - Variant A: >5% improvement (moderate support) - Variant B: >8% improvement (strong support) - Variant C: >10% improvement on extreme years (strong support)

Feature Importance Expected: 1. NDWI during drought periods (highest) 2. MSI cumulative stress 3. NDVI growth rate changes 4. EVI peak timing 5. Index interaction terms

Implementation Notes

Data Access Pattern

import xarray as xr
from src.sources.brp.brp_api.brp import BRPApi
from src.sources.boerderij_nl.boerderij_nl_api import BoerderijApi

# Load satellite data (REAL zarr store)
sentinel = xr.open_zarr("lake_31UFU_medium_polder.zarr")

# Get parcel masks (REAL boundaries)
brp = BRPApi()
mask_2018 = brp.get_consumption_potato_mask(
    bbox=(5.5, 52.5, 5.6, 52.6),
    date_range=(date(2018,1,1), date(2018,12,31)),
    grid_shape=(len(sentinel.y), len(sentinel.x)),
    grid_coords=(sentinel.x.values, sentinel.y.values),
    target_crs='EPSG:32631'
)

# Get price data (REAL market prices)
prices = BoerderijApi().get_data(
    product_id="NL.157.2086",
    date_range=("2015-01-01", "2024-12-31"),
    granularity="weekly"
)

Computational Optimizations

  1. Zarr Chunking: Optimize for temporal access patterns
  2. Dask Integration: Parallel index calculation
  3. Caching: Store computed indices to disk
  4. Batch Processing: Process multiple parcels simultaneously

Quality Controls

  1. Cloud Filtering: Use SCL band, require >80% clear pixels
  2. Outlier Detection: Remove index values >3 std from mean
  3. Gap Filling: Linear interpolation for max 2-week gaps
  4. Validation Split: Ensure no data leakage in time series

Results Tracking

Variant A Results

  • Date: [PENDING]
  • MAE: [PENDING]
  • Improvement: [PENDING]
  • Verdict: [PENDING]

Variant B Results

  • Date: [PENDING]
  • MAE: [PENDING]
  • Improvement: [PENDING]
  • Verdict: [PENDING]

Variant C Results

  • Date: [PENDING]
  • MAE: [PENDING]
  • Improvement: [PENDING]
  • Verdict: [PENDING]

Decision Log

HE Notes

  • 2025-08-23: Created hypothesis based on FAMILY_SATELLITE_NDVI_REAL learnings
  • Multi-index approach addresses single NDVI weakness
  • Focus on real extreme weather events (2018, 2022)
  • All data sources verified as REAL through repository interfaces

EX Notes

[To be added after experiment execution]

Next Actions

  1. Implement index calculation pipeline
  2. Run Variant A as baseline
  3. Optimize Variant B with full feature set
  4. Test Variant C on extreme years specifically
  5. Deploy best performer for operational use

Geen Codex-samenvatting

Voeg codex_validated.md toe om de status te documenteren.