About

Assort Design is a demo web app that showcases an agentic AI pipeline for professional audiences. It turns raw input (pasted text or a URL) into audience-specific, decision-oriented artifacts using a multi-step controlled loop: route → generate → evaluate → cite/risk → persist (with a revision loop if needed).

Live demo: assortdemo.duckdns.org
Source code: github.com/sanuwar/assort-design

Pipeline Overview

Ingest: paste text, provide a URL, or use sample content.
Route: classify the audience (or honor a manual selection).
Specialist generate: produce one-line summary, decision bullets, tags, key clues, and mind map.
Evaluate: check required sections, word count, and quality rules.
Revise: iterate up to max retries, then persist outputs.

Key idea: this is not a single LLM call. The system uses a deterministic ML model for routing, multiple LLM calls for generation and evaluation, dedicated tool nodes for citation and risk analysis, and may iterate the generation loop until all constraints are satisfied.

What You Get

Each processed document produces:

One-line summary — audience-specific headline
Decision bullets (3–5) — structured takeaways for the routed audience
Tags — extracted topics, normalized via alias map
Key clues — notable signals or evidence phrases
Mind map — Mermaid diagram of concept relationships
Citation claims — specific claim/quote pairs with character-level source offsets and confidence scores
Risk & compliance flags — severity-ranked issues with category, text span, and suggested fix
Attempt history — every revision attempt with evaluation feedback (pass/fail, missing sections, word count, reasons)

Lifecycle (Chronological)

Step 1 — Intake
You submit pasted text or a URL. The app applies input size limits and safety checks (SSRF protections for URLs, redirect limits, private network blocking).

Step 2 — Audience Routing (ML router or LLM)
A TF-IDF + Logistic Regression ML model attempts routing first (Mode A: ML-first). The model returns a probability distribution over three specialist audiences: Commercial, Medical Affairs, and R&D. If the top-class probability is below the confidence threshold (ml_router_threshold), or the gap between the top two classes is too narrow (ml_router_margin), the system falls back to an LLM routing call. If no trained model exists, the LLM is used directly. Either path returns a confidence score, candidate audiences, routing reasons, and top feature signals. Results below the LLM confidence threshold fall back to a Cross-Functional audience.

Step 3 — Specialist Generation (LLM)
Using the chosen audience profile (system prompt, required sections, max words from agent_profiles.yaml), the LLM generates structured artifacts: one-line summary, decision bullets, tags, key clues, and a Mermaid mind map.

Step 4 — Evaluation (LLM)
An evaluator LLM validates output against constraints: all required sections present in the decision bullets, and total output within the word limit. Returns a structured pass/fail with per-section feedback and word count.

Step 5 — Tool Steps (citation extraction + risk & compliance flagging)
After evaluation, three tool nodes run in sequence:

tool_citation — extracts factual claim/quote pairs from the generated content, each with character-level source offsets and a confidence score. Supports citability review.
tool_risk — scans output for risk and compliance issues. Each flag includes a severity level (low/medium/high), a category (e.g. regulatory, safety, legal), the flagged text span, and a suggested fix.
tool_gate — decides whether to persist the result or trigger another revision based on gate criteria (e.g. critical risk flags, citation support count).

Step 6 — Revision Loop (optional)
If evaluation fails or the gate blocks the result, the system retries generation with corrective context until it passes, reaches max_retries, or hits the pipeline timeout.

Step 7 — Persistence
All results are written to SQLite: the input document, routing artifacts, each generation attempt, evaluation feedback, citation claims, risk flags, tag summaries, and final output. Every field is inspectable in the UI after the job completes.

LangGraph Orchestration

LangGraph is used to orchestrate the pipeline as a stateful graph. Each step is modeled as a node; edges determine when to continue, revise, or stop. This makes the workflow explicit, inspectable, and reliable under retries and timeouts.

Core nodes:

route_audience — ML router (TF-IDF + LR), LLM fallback, manual override
specialist_generate — generates structured artifacts (LLM)
evaluate — validates constraints and returns pass/fail (LLM)
tool_citation — extracts claim/quote pairs with source offsets and confidence
tool_risk — flags risk and compliance content by severity and category
tool_gate — decides whether to persist or trigger a revision
persist_attempt — writes the current attempt to the database
revise — increments attempt count and injects evaluator feedback
persist_results — finalizes job status, replaces tags/clues, stores tool outputs

Multiple attempts on a record mean the graph drove a revision loop based on evaluator feedback.

Where to See the Evidence (UI)

Records — full history of processed documents with status, audience, and tag counts
Job Details — routing method (ML/LLM/manual), confidence, top signals, attempt log, evaluation feedback, citation claims, and risk flags per attempt
View Source — the exact input text the pipeline ran on
Tag Insights — aggregated intelligence across all documents (see below)

Tag Intelligence & Insights ● Evolving — Under Development

The Tag Insights page derives pattern intelligence from the canonical tags and inferred topic domains accumulated across all processed documents. It is intentionally designed as an evolving, data-driven feature — the more documents are processed, the more meaningful the signals become.

What the Insights page shows:

Topic Lanes (Domains) — documents are bucketed into department-aligned domains (Clinical & Medical Strategy, Translational Science & Drug R&D, Regulatory Launch & Market Strategy, Corporate & Investor Updates, General / Other) based on which canonical tags appear. Each lane links to the filtered document list for that domain.
Business logic: domain assignment uses a curated keyword-to-domain indicator map in tag_intel.py; documents are assigned to the domain whose indicator tags best match their canonical tag set.
Rising Tags — tags appearing significantly more often in the most recent 20 documents compared to the prior 20, surfacing emerging topics. Requires ≥ 40 documents of history.
Business logic: delta = count(recent 20) − count(prior 20); tags with delta ≥ 2 are surfaced.
Co-occurring Tag Pairs — the most frequently co-occurring tag pairs within single documents, showing which concepts cluster together across the corpus.
Business logic: counts unordered (tag_a, tag_b) pairs from canonical tag lists stored per job in DocumentTagSummary.
Bridge Tags (Cross-Domain) — tags that appear in documents spanning multiple distinct domains, identifying concepts that cross topic boundaries.
Business logic: a tag is a bridge tag if its associated documents span ≥ 2 different inferred domains.

Tag normalization (Tag Alias Manager):
Before any aggregation, raw LLM-generated tags are normalized to canonical forms via an alias map. Examples: genai → generative ai, rct → clinical trial, u.s. fda approval → fda approval. Built-in defaults live in tag_intel.py; database entries managed via the Tag Aliases page override them at runtime. This normalization is what makes cross-document tag aggregation reliable.

Note: Tag intelligence is an evolving feature under active development. Domain indicators, rising-tag thresholds, co-occurrence logic, and bridge-tag detection are all working implementations, but are expected to be refined as the corpus grows and business requirements around topic taxonomy become clearer. The alias map in particular is expected to expand as new documents surface noisy or variant tag forms.

Safety & Limits

Input limit: maximum size enforced for both pasted text and URL fetch results
URL safety: SSRF protections (blocks private/local network targets) + redirect limits
Rate limiting: per-IP limit to reduce abuse

Source Code & Deployment

The project is containerized and published as an image, built via CI on merges, and served from a Linux VPS. If you're reviewing this as part of an engineering evaluation, the repository includes Docker + CI/CD + deployment notes.

Routing Decision Mechanism (ML + LLM Fallback)

Why this matters

Assort Design uses a hybrid routing strategy to determine the target audience for each document. There are two routing paths:

ML-first routing using a TF-IDF + Logistic Regression classifier
LLM fallback routing when the ML router is uncertain (low confidence or ambiguity)

This is a practical agentic design choice: use fast, low-cost ML routing when confidence is strong; escalate to LLM reasoning when the ML result is uncertain.

Routing Behavior (High Level)

Primary path: ML router
The ML router attempts to identify the audience first (e.g., commercial, medical_affairs, r_and_d).

Fallback path: LLM router
If the ML router is not confident enough, the pipeline falls back to the LLM for the final routing decision.

Current confidence threshold
If ML top confidence is below the configured threshold (default: 0.58, set in agent_profiles.yaml), the request is sent to the LLM router.

Ambiguity guardrail (margin rule)
If the top two ML class probabilities are too close, the result is treated as ambiguous and can also fall back to the LLM.

Where the Routing Logic Lives

1) Training pipeline — app/train_router.py

Loads labeled examples from the database: document.content (input text) + job.audience (label)
Builds a scikit-learn pipeline: TfidfVectorizer (unigrams + bigrams, sublinear TF, up to ~30k features) + LogisticRegression (lbfgs solver, balanced class weights)
Splits data into train/test (80/20 stratified)
Evaluates model performance (accuracy + per-class precision/recall/F1)
Saves artifacts to artifacts/: vectorizer.pkl, classifier.pkl, metadata.json

2) Inference logic — app/ml_router.py

load() — lazily loads .pkl artifacts into memory
predict() — vectorizes incoming text, computes class probabilities, applies threshold + margin decision rules

Core decision rules:

If top probability (p1) is below threshold → treat as uncertain
If top two probabilities are too close (p1 − p2 < margin) → treat as ambiguous
Otherwise → use the top predicted class directly

3) Pipeline integration — app/graph.py

Inside route_audience_node(), the routing flow is:

Try loading ML artifacts
If artifacts exist → run ML prediction
If ML is confident → use ML result directly (routing_source = "ml")
If ML is uncertain/ambiguous → fall back to LLM (routing_source = "ml+llm_fallback")
If no ML artifacts exist → use LLM routing only (routing_source = "llm")

How ML Confidence Is Computed (p1 and p2)

Step 1: Convert text to TF-IDF features
The incoming document text is transformed into a sparse feature vector using the trained TfidfVectorizer. Uses the vocabulary learned during training; includes unigrams and bigrams; produces a numeric feature vector (not raw text matching).

Step 2: Logistic Regression outputs class probabilities
The classifier computes a score for each audience class and converts those scores to probabilities via softmax over class scores.

P(class_i) = exp(W_i · X + b_i) / ∑_j exp(W_j · X + b_j)

Example output: commercial: 0.12, medical_affairs: 0.75, r_and_d: 0.13

Step 3: Extract top two probabilities
The router sorts probabilities in descending order: p1 = highest (top predicted class), p2 = second-highest (runner-up).

Step 4: Apply routing guardrails

Confidence rule: p1 < threshold
Ambiguity rule: (p1 − p2) < margin

If either rule is triggered, the ML result is treated as not reliable enough, and the pipeline falls back to the LLM.

What Is Stored in the ML Artifacts (`.pkl`)

vectorizer.pkl stores the learned vocabulary, tokenization / TF-IDF transformation logic, and feature mapping rules. It does not store original training documents for direct text matching.

classifier.pkl stores learned logistic regression coefficients (weights), intercepts, and class mapping / model state. It does not store text samples.

Key idea: This is mathematical pattern recognition, not document-to-document text matching. The vectorizer converts text to numbers; the classifier multiplies those numbers by learned weights to produce class probabilities.

Auto-Retraining Behavior (Production / VPS)

The ML router can retrain automatically in production after enough new completed jobs.

Trigger: After each completed job, graph.py calls _maybe_retrain(session). It compares total completed jobs in the database against n_docs from artifacts/metadata.json. If the difference ≥ the retrain interval (default: 20), retraining is triggered.

After retraining: _ml_router.reload() loads the updated model into memory immediately — no app restart required.

Routing modes over time:

Before ML artifacts exist → LLM-only routing
After initial training → ML-first with LLM fallback
As labeled jobs accumulate → automatic retraining every N jobs + hot reload

Configuration: RETRAIN_EVERY_N_JOBS=20 (retrain every 20 jobs, default) or RETRAIN_EVERY_N_JOBS=0 (disable auto-retraining).

Concurrency note: A threading lock in graph.py prevents duplicate retraining from concurrent requests in a multi-user environment.

Architecture at a Glance

Web: FastAPI + Jinja2 templates
Orchestration: LangGraph (stateful graph — nodes, edges, retries, timeouts)
ML routing: scikit-learn TF-IDF + Logistic Regression (ML-first, LLM fallback)
LLM: OpenAI Responses API — generation, evaluation, LLM routing fallback
Tool nodes: citation extraction, risk & compliance flagging, quality gate
Tag intelligence: domain inference, rising tags, co-occurrence, bridge tags, alias normalization
Tracing: optional request-level tracing for node/LLM visibility
Persistence: SQLite — documents, jobs, attempts, tags, clues, claims, risk flags, tag summaries

Key files:

app/main.py — routes, handlers, UI flow
app/graph.py — LangGraph pipeline (nodes, edges, state, retries, timeouts)
app/ml_router.py — ML router (lazy load, predict, top-feature signals)
app/train_router.py — CLI training script (query DB → train → save artifacts)
app/tools.py — tool node implementations (citation, risk, gate)
app/tag_intel.py — tag normalization, domain inference, insights logic
app/llm.py — LLM calls, parsing, timeouts, mock mode, tracing hooks
app/config.py — loads and validates agent profiles and routing config
app/models.py — data model (all traceability tables)
app/db.py — SQLite init, schema migrations, index setup
app/agent_profiles.yaml — prompts, routing thresholds, required sections, max words

Project Visuals (Interactive)

Zoom and pan the diagrams directly from this page.