Marcos Nespolo
Production

HarmonizAi

Wine recommender by food pairing — classical NLP, no LLMs, fully auditable scores.

PythonspaCyrapidfuzzFastAPISQLiteNext.js

Functional end-to-end with three interfaces (CLI, HTTP API, web). The active improvements are scoring tuning and dataset enrichment; the core pipeline is stable.

The problem

"AI sommelier" tools today fall into two failure modes:

  1. LLM-only recommenders — type a dish, get a paragraph that sounds plausible but might be hallucinated. No way to audit why that wine was picked over another.
  2. Vector retrieval over reviews — returns wines whose tasting notes look textually similar to the query, ignoring the specific structural heuristics sommeliers actually use (acidity cuts fat, tannin grips protein, oak fights raw fish).

I wanted the opposite: a system where every score component is a number you can read, where the dish→wine logic is grounded in sommelier literature, and where reproducibility is built in (same input → same output, always).

The approach

Four design choices flow from "no LLMs":

  1. Curated dish knowledge base. 101 dishes hand-mapped to attributes (Vivino food tags, target structure ranges, matching/excluding flavor keywords, suggested wine types, forbidden styles). Validated against What to Drink with What You Eat (Dornenburg & Page) and a handful of cross-checked sommelier guides — each anchor dish carries its source.
  2. Classical NLP for input parsing. spaCy PhraseMatcher (exact, multi-word) first, rapidfuzz token-set fallback only if exact fails. Tolerant to word order and typos, strict enough to reject conflicting terms.
  3. Multi-signal weighted scoring with hard penalties. Four independent components, each in [0, 1], summed with explicit weights — and two penalty paths (incompatible flavors, forbidden styles) that can drop a wine to zero regardless of the rest.
  4. Three thin interfaces over one engine. CLI, FastAPI, Next.js web — all calling the same RecommendationEngine. The recommender doesn't know or care which UI is asking.

Architecture

Diagram (mermaid · placeholder)

Pipeline: free-text query → FoodMatcher resolves it to a dish_id (or null) using exact phrase match first, fuzzy fallback second → RecommendationEngine queries SQLite with the dish's vivino_food_tags, target_structure, and flavor keywords → scorer.py computes four components per candidate and applies penalties → top-N returned with full breakdown. Every request is logged to a harmonization_requests table for later analysis.

Tech decisions & trade-offs

Why no LLMs? This is the entire premise — but the practical benefits are concrete. Zero inference cost (the system runs offline on a laptop), perfect reproducibility, every recommendation explainable down to four numbers, and the project demonstrates classical NLP and feature engineering rather than "I called the Anthropic API."

Why SQLite with FTS5? The dataset is 1,688 wines after deduplication. Postgres or a vector DB would be overkill, add deployment complexity, and gain nothing. SQLite is embedded, has full-text search via FTS5, and ships zero-config.

Why curate 101 dishes by hand instead of auto-generating? Quality of the dish→attribute mapping is the ceiling of the entire system. Auto-generated mappings from review text reproduce the biases that vector retrieval already has. Each anchor dish was validated against at least two sommelier sources; the YAML logs the source.

Why split scoring into four components instead of one model? Auditability is the product. Every recommendation surfaces [Food: 0.85] [Flavor: 0.60] [Struct: 0.90] [Rating: 0.65], so a user — or me, debugging — can see exactly which signal carried the match. A neural ranker would score better on a benchmark and worse on the actual goal.

Dataset

Scraped from Vivino: 437 raw JSON files → merged → deduplicated by wine.id → normalised into SQLite.

1,688

unique wines

From 437 raw JSONs after dedup

16

countries

France, Italy, Portugal, Argentina lead

92%

structure coverage

acidity/body/tannin/sweetness

101

curated dishes

11 cuisines, literature-validated

Wine type distribution: 1,284 reds · 264 whites · 51 sparkling · 44 fortified · 34 rosés · 11 dessert. Cuisines covered: Brazilian, Italian, French, Japanese, Argentinian, Spanish, Chinese, Thai, Indian, Portuguese, international.

Scoring detail

score =
    0.40 × s_food_tags     # match against wine.style.food
  + 0.15 × s_flavor        # match against wine.taste.flavor keywords
                           #   (with penalty for excluded flavors)
  + 0.45 × s_structure     # range fit on body/acidity/tannin/sweetness
  + 0.01 × s_rating        # tiebreaker only
  − style_penalty          # fatal (-4.0) if wine falls in dish.avoid_styles

Two hard rules complement the weighted sum:

  • Flavor exclusion: a wine with oak flavor is zeroed out for sushi (flavor_keywords_exclude: [oak, vanilla, smoke]).
  • Style exclusion: Moscato or Late Harvest get fatal penalty on savoury dishes (avoid_styles: [Moscato, Late Harvest]).

Both rules live in dishes.yaml per dish, so adding a new dish adds new rules without touching the engine.

What's there

  • CLI with full breakdown printout (auditable in the terminal).
  • FastAPI at POST /api/recommend returning structured JSON with breakdown, label image, Vivino link, Google Shopping link.
  • Next.js web with skeleton-loader cards consuming the API.
  • Price-intent extraction (budget/moderate/premium, max_price) parsed from the query — surfaced in the response, not yet wired into ranking (waiting on price enrichment).
  • Request log in SQLite for offline tuning analysis.

What I'd do differently

  • Curate dishes with a sommelier from day one. I validated against books and articles, but a real sommelier reviewing the YAML would catch edge cases (regional pairings, modern fusion) faster than I can.
  • Build the eval set before tuning weights. I tuned weights by reading top-5 outputs by hand. The tests/test_coverage.py with 50 synthetic queries came later — having it earlier would have made the blockbuster fix less anecdotal.
  • Don't ship price intent without price data. The NLP detects "vinho barato pra sushi" cleanly; the dataset has no prices. Either ship the enrichment together or leave the feature out until it can do its job.

Status & links