Chapter 4: Feature Engineering and Feature Stores

Learning Objectives

Section 1: Feature Engineering Fundamentals

Pre-Section Quiz — Features Fundamentals

1. A linear regression model performs poorly when its numeric features span vastly different scales. Which transformation is the most appropriate default fix?

Apply one-hot encoding to each numeric column.
Standardize each feature to mean 0 and standard deviation 1 (z-score).
Bin every feature into 100 equal-width buckets.
Use the raw values; XGBoost-style models are scale-invariant.

2. You must encode a 5-million-cardinality user_id column for a streaming click model where new IDs appear constantly. Which encoding is most suitable?

One-hot encoding.
Ordinal encoding by first-seen order.
Hashing trick into a fixed bucket count.
Mean target encoding without smoothing.

3. Which rule is the single most important guardrail when constructing time-series features?

Always use random train/validation splits to maximize sample size.
Every feature window must end strictly before the prediction time.
Encode the month as a single integer 1–12.
Replace missing values with the global mean across all dates.

Feature engineering is the bridge between raw events and the matrix of numbers a model consumes. Most production teams start with simple, robust transformations and only reach for embeddings or contextual transformer features when offline gains clearly justify the operational cost.

Scaling, Encoding, and Binning of Numeric Features

Linear models, neural networks, k-means, and PCA are all sensitive to feature magnitude; tree-based models like XGBoost and LightGBM are largely scale-invariant. The canonical transforms are:

Discipline matters more than choice: fit scalers on training data only, persist parameters as part of the model artifact, and apply identical transformations online.

Categorical Encoding

MethodCardinality fitBest use
One-hotLow (<50–100)Country, product category
Ordinal / labelAny (ordered)Education level, ratings
Target / meanMedium-highURL, zip code, merchant ID
Hashing trickVery highStreaming features, schema drift
Learned embeddingsVery high (IDs)User / product IDs in recsys

For high-cardinality IDs, three strategies dominate: target encoding with out-of-fold and smoothing, the hashing trick for bounded dimensionality, and learned embeddings with a starting heuristic of d ~ min(50, sqrt(cardinality)).

Text: Bag-of-Words, TF-IDF, and Embeddings

Time-Series Features

The three workhorses are lag features (e.g. xt-1, xt-7), rolling-window statistics (mean, std, EMA), and cyclic/calendar features (sin/cos of hour and day-of-week). The cardinal rule is no future leakage: every window must end strictly before the prediction time, and validation uses time-based splits.

flowchart TD A[Raw time series x_t] --> B[Lag features] A --> C[Rolling window stats] A --> D[Calendar / cyclic features] B --> B1[x_t-1, x_t-7, x_t-30] C --> C1[mean, std, min, max over window W] C --> C2[EMA, rolling counts] D --> D1[hour, day_of_week, month] D --> D2[sin/cos cyclic encoding] B1 --> E[Feature vector at time t] C1 --> E C2 --> E D1 --> E D2 --> E E --> F{Window ends strictly before t?} F -->|Yes| G[Safe to train / serve] F -->|No| H[Future leakage - reject]

Animation: Rolling-Window Mean Over a Time Series

t−14 t window W = 4 mean(W) updates as the window slides

A backward-looking window of width W slides one step at a time, computing a fresh rolling mean per position. The window never crosses the prediction line.

Key Points

Post-Section Quiz — Features Fundamentals

1. A linear regression model performs poorly when its numeric features span vastly different scales. Which transformation is the most appropriate default fix?

Apply one-hot encoding to each numeric column.
Standardize each feature to mean 0 and standard deviation 1 (z-score).
Bin every feature into 100 equal-width buckets.
Use the raw values; XGBoost-style models are scale-invariant.

2. You must encode a 5-million-cardinality user_id column for a streaming click model where new IDs appear constantly. Which encoding is most suitable?

One-hot encoding.
Ordinal encoding by first-seen order.
Hashing trick into a fixed bucket count.
Mean target encoding without smoothing.

3. Which rule is the single most important guardrail when constructing time-series features?

Always use random train/validation splits to maximize sample size.
Every feature window must end strictly before the prediction time.
Encode the month as a single integer 1–12.
Replace missing values with the global mean across all dates.

Section 2: The Feature Store Pattern

Pre-Section Quiz — Feature Store Pattern

1. Which problem is the feature store pattern most directly designed to solve?

Reducing model file size for edge deployment.
Train-serve skew caused by reimplementing the same feature in two code paths.
Eliminating the need for a model registry.
Replacing GPUs with cheaper inference hardware.

2. Why do feature stores split storage into an offline store and an online store?

To prevent data scientists from accessing production data.
Because training needs large historical scans while inference needs millisecond key-value reads.
Because S3 cannot store more than 1 TB of feature data.
To reduce the size of the feature registry.

3. In a point-in-time join, which feature row is selected for a training example at label_time = t_p?

The chronologically latest row in the feature table, regardless of time.
The latest row where event_timestamp ≤ t_p AND created_timestamp ≤ t_p.
The row exactly at t_p; nothing else is eligible.
A random row from the entity's history to add label noise.

Without a feature store, the same "average spend over 30 days" feature gets written three times — once in Snowflake SQL, once in a Python serving microservice, once in a Spark batch job — and each implementation drifts. A feature store provides one shared system for feature definition, computation, storage, and serving with correct time semantics.

Online vs Offline Stores

The offline store (S3/Parquet, BigQuery, Snowflake, Delta) holds large historical datasets for training and backfills. The online store (Redis, DynamoDB) is keyed by entity ID for single-digit-millisecond reads at inference time.

Materialization is the process that moves features into these stores:

flowchart LR DS1[Warehouse / Lake] --> FE[Feature Engineering] DS2[Kafka / Kinesis streams] --> FE DS3[Operational DBs] --> FE FE --> REG[Feature Registry / Catalog] REG --> OFF[(Offline Store
S3/Parquet, BigQuery,
Snowflake)] REG --> ON[(Online Store
Redis, DynamoDB)] OFF --> TRAIN[Training
point-in-time joins] ON --> SERVE[Online Serving
millisecond reads] TRAIN --> MODEL[Model Artifact] MODEL --> SERVE

Animation: Feature Store Architecture — One Definition, Two Stores

Warehouse / Lake Kafka / Kinesis streams Operational DBs Feature Registry one definition per feature Offline Store S3 / Parquet / BigQuery Online Store Redis / DynamoDB Training PIT joins Serving < 10 ms consistency

Sources feed one feature definition. The registry materializes to an offline store (large historical scans for training) and an online store (millisecond key-value reads for serving). The dotted line marks the consistency contract: the same value lives in both places.

Feature Registry and Metadata

The registry is the catalog of feature definitions, owners, lineage, tags, and versions. It answers questions like "who owns customer_lifetime_value_90d?", "which models consume it?", and "what version did model v3.2 train on?". In Feast the registry is a file or small SQL DB; Tecton's is a rich service with UI and ACLs; SageMaker uses Feature Group definitions with IAM.

Point-in-Time Joins

The point-in-time (AS-OF) join returns, for each (entity, tp) row, the latest feature row where event_timestamp ≤ t_p AND created_timestamp ≤ t_p. Tracking both timestamps prevents late-arriving corrections from sneaking into training.

sequenceDiagram participant E as Entity Timeline participant F as Feature Store participant J as PIT Join participant T as Training Row Note over E: t1: balance=1200 (event) Note over E: t2: balance=850 (event) Note over E: t_p: label_time = 2023-03-31 Note over E: t3: balance=100 (after t_p) E->>F: write feature rows with
event_ts + created_ts J->>F: find latest row where
event_ts <= t_p AND
created_ts <= t_p F-->>J: returns t2 row (balance=850) Note over J: t3 row excluded -
not yet known at t_p J->>T: balance=850, label=1

Credit-risk worked example. Customer 7 defaulted with label_time = 2023-03-31. A naive "latest balance" join would pick the post-default balance = 100 row from 2023-04-02, leaking the label. A PIT join picks the 2023-03-30 row with balance = 850 — the most recent value actually known on March 31.

Animation: Point-in-Time Join — Only Past Values Are Eligible

past future label_time t_p balance=1200 2023-03-29 balance=850 2023-03-30 balance=100 2023-04-02 balance=50 2023-04-05 PIT join result balance = 850 (from 2023-03-30)

Green rows have event_ts ≤ t_p and are eligible. Red rows sit after the cutoff and are excluded — the model would not have known them at prediction time. The PIT join returns the latest eligible row.

Key Points

Post-Section Quiz — Feature Store Pattern

1. Which problem is the feature store pattern most directly designed to solve?

Reducing model file size for edge deployment.
Train-serve skew caused by reimplementing the same feature in two code paths.
Eliminating the need for a model registry.
Replacing GPUs with cheaper inference hardware.

2. Why do feature stores split storage into an offline store and an online store?

To prevent data scientists from accessing production data.
Because training needs large historical scans while inference needs millisecond key-value reads.
Because S3 cannot store more than 1 TB of feature data.
To reduce the size of the feature registry.

3. In a point-in-time join, which feature row is selected for a training example at label_time = t_p?

The chronologically latest row in the feature table, regardless of time.
The latest row where event_timestamp ≤ t_p AND created_timestamp ≤ t_p.
The row exactly at t_p; nothing else is eligible.
A random row from the entity's history to add label noise.

Section 3: Feature Store Implementations

Pre-Section Quiz — Feature Store Implementations

1. Your team is fully on AWS, has minimal platform engineering staff, and wants a managed feature store. Which option is the most natural fit?

Feast self-hosted on EC2 with a custom Airflow stack.
Amazon SageMaker Feature Store.
Hopsworks installed on-prem.
Vertex AI Feature Store.

2. Which is the most accurate description of Feast?

A managed AWS service that owns its own warehouse and serving stack.
An open-source feature serving and registry layer that wires together your existing data infrastructure.
A commercial platform that includes built-in streaming compute and a governance UI by default.
A Python library that replaces Redis and DynamoDB with its own KV store.

3. A large bank needs first-class streaming features with built-in governance, lineage, and a UI. Which platform best matches that profile?

DIY Redis + nightly Airflow jobs.
Feast on top of BigQuery and Redis.
Tecton.
A single Snowflake warehouse without an online store.

Feast

Feast is an open-source feature serving and registry layer on top of existing data infrastructure. You bring the warehouse (BigQuery, Snowflake, Redshift), you bring the online store (Redis, DynamoDB, Postgres), and you bring the orchestrator (Airflow, Dagster, cron). Feast wires them together with a consistent Python SDK. The feast materialize-incremental command reads new rows from the offline store and writes the latest values to the online store.

Tecton, Hopsworks, and Databricks

Tecton is a commercial end-to-end platform built by ex-Uber Michelangelo engineers, with first-class support for real-time streaming features, managed compute (Spark/Flink), governance UI, and serving APIs. Hopsworks is open-core, popular in EU and on-prem deployments. Databricks Feature Store is the natural choice on Databricks, integrated with Delta Lake, MLflow, and Unity Catalog.

DIY Redis + Parquet

Early-stage teams can build a credible feature store from a warehouse + Redis + Airflow + a thin Python client. The risk to manage is governance; as the catalog grows you'll want lineage, access control, and a UI, which is where Feast or Tecton come in.

Vertex AI and SageMaker

SageMaker Feature Store groups features into Feature Groups with offline storage in S3/Parquet (queryable via Athena) and online storage in a managed DynamoDB-backed KV layer. A single write populates both. Vertex AI Feature Store is GCP's analog, deeply integrated with BigQuery and Vertex AI Pipelines.

DimensionFeastTectonSageMaker FS
TypeOSSCommercial platformManaged AWS
CloudCloud-agnosticMajor cloudsAWS only
Online storePluggable (Redis, etc.)Managed by TectonManaged DynamoDB-backed
OrchestrationYou provide (Airflow)Tecton-managedYour ETL (Glue, EMR)
StreamingDIY integrationFirst-classKinesis/Lambda; you orchestrate
Best forStrong platform team, OSSMid/large orgs needing governanceAWS-centric teams wanting managed

Key Points

Post-Section Quiz — Feature Store Implementations

1. Your team is fully on AWS, has minimal platform engineering staff, and wants a managed feature store. Which option is the most natural fit?

Feast self-hosted on EC2 with a custom Airflow stack.
Amazon SageMaker Feature Store.
Hopsworks installed on-prem.
Vertex AI Feature Store.

2. Which is the most accurate description of Feast?

A managed AWS service that owns its own warehouse and serving stack.
An open-source feature serving and registry layer that wires together your existing data infrastructure.
A commercial platform that includes built-in streaming compute and a governance UI by default.
A Python library that replaces Redis and DynamoDB with its own KV store.

3. A large bank needs first-class streaming features with built-in governance, lineage, and a UI. Which platform best matches that profile?

DIY Redis + nightly Airflow jobs.
Feast on top of BigQuery and Redis.
Tecton.
A single Snowflake warehouse without an online store.

Section 4: Pipelining Features

Pre-Section Quiz — Pipelining Features

1. Which materialization cadence best matches a feature like clicks_last_1h?

Daily batch materialization from the warehouse.
Per-request on-demand computation.
5–15 minute incremental batch or micro-batch.
Weekly full recompute.

2. When a feature definition changes in a backwards-incompatible way, the recommended versioning practice is to:

Silently update the existing feature so all models see the new version.
Suffix the new feature (e.g. purchases_30d_v2) and let models opt in.
Delete the old feature; all consumers must migrate within 24 hours.
Rename the source table instead of the feature.

3. For real-time fraud features like num_transactions_5m, the structural way to prevent train-serve skew is to:

Write separate windowing logic in Java for serving and Python for training.
Use one shared windowing definition that emits to both online and offline stores.
Skip offline training and learn the model purely online.
Recompute features only at request time from raw events.

Materialization and Refresh Schedules

Each feature has its own freshness SLA. A useful schema:

Feature classExampleCadencePipeline
Slowly changingcustomer_countryDailyBatch SQL + nightly materialize
Daily aggregatesavg_spend_30dHourly to dailyBatch from warehouse
Recent activityclicks_last_1h5–15 minIncremental batch / micro-batch
Real-timetransactions_5mSecondsStreaming (Flink, Spark Streaming)
On-demandtime_since_last_loginPer requestComputed at predict time

Incremental materialization is critical: a feature like purchases_30d should not be recomputed from scratch every hour. Read only changed rows since last run and update those entity keys.

flowchart LR SRC[Source Tables
events, transactions] --> CDC[Detect new rows
since last run] CDC --> COMP[Feature Computation
SQL / Spark / Flink] COMP --> OFF[(Offline Store
partitioned by date)] COMP --> ON[(Online Store
keyed by entity_id)] OFF --> BACKFILL[Backfills /
historical training] ON --> SERVE[Low-latency
inference reads] SCHED[Scheduler
Airflow / Dagster] -.triggers.-> CDC

Versioning

Train/Serve Skew Prevention

The structural fix is to use the same FeatureView definition for both get_historical_features and get_online_features. Operationally:

  1. One source of truth for transformations — never reimplement in serving code.
  2. Track event and created timestamps so backfills don't leak into training.
  3. Backward-looking windows ending strictly before prediction time.
  4. Time-based train/validation splits.
  5. Production drift monitoring on feature distributions.
flowchart TD DEF[One FeatureView Definition
transformation + window + timestamps] DEF --> OFFP[Offline path:
get_historical_features] DEF --> ONP[Online path:
get_online_features] OFFP --> TRAIN[Training Dataset
time-based split] ONP --> PRED[Prediction Service] TRAIN --> MODEL[Trained Model] MODEL --> PRED PRED --> MON[Production Monitoring
distributions + freshness] TRAIN --> BASE[Training Baseline] BASE --> DRIFT{Drift detected?} MON --> DRIFT DRIFT -->|Yes| ALERT[Alert / retrain / fix definition] DRIFT -->|No| OK[Continue serving] ALERT -.update.-> DEF

Real-Time Features on Streams

A Kafka topic of card transactions feeds a Flink job maintaining tumbling and sliding windows of count and sum(amount) per (user_id, card_id). The aggregates are written to DynamoDB on every update, and a backfill replays the same windowing logic across historical events. Because the windowing code is one definition, offline and online stay consistent by construction.

Key Points

Post-Section Quiz — Pipelining Features

1. Which materialization cadence best matches a feature like clicks_last_1h?

Daily batch materialization from the warehouse.
Per-request on-demand computation.
5–15 minute incremental batch or micro-batch.
Weekly full recompute.

2. When a feature definition changes in a backwards-incompatible way, the recommended versioning practice is to:

Silently update the existing feature so all models see the new version.
Suffix the new feature (e.g. purchases_30d_v2) and let models opt in.
Delete the old feature; all consumers must migrate within 24 hours.
Rename the source table instead of the feature.

3. For real-time fraud features like num_transactions_5m, the structural way to prevent train-serve skew is to:

Write separate windowing logic in Java for serving and Python for training.
Use one shared windowing definition that emits to both online and offline stores.
Skip offline training and learn the model purely online.
Recompute features only at request time from raw events.

Your Progress

Answer Explanations