Convexia Sourcing and Diligence Benchmarks

CASBench v1.0

1. Executive Summary

CASBench v1.0 benchmarks Convexia vs commercial databases and frontier LLM agents on 80 asset sourcing queries at a high-precision operating point. Convexia's advantage is concentrated in preclinical, ex-USA, and long-tail sources. Clinical and approved-stage queries are easier for everyone.

Headline Results:

R@P≥0.95: Convexia 0.921 vs best baseline 0.814
False positives per query: Convexia 0.44 vs best baseline 1.25

Where Convexia Dominates:

Preclinical: Convexia 0.893 vs best baseline 0.652
Ex-USA (China-linked proxy): Convexia 0.874 vs best baseline 0.621
Long-tail sources (not CTG): near parity on registries, large gains on non-standard sources
- TTO portfolios: +51.3 pp
- Conference posters/abstracts: +39.1 pp
- Grant databases: +46.7 pp
Patents: 98.1% vs 81.5%, +16.6 pp
Lower analyst cleanup: 0 FP on 69% of queries vs 38% for best baseline

What Converges:

Clinical and late/approved: high scores across systems; gaps shrink vs preclinical
Easy queries: systems are closer; separation grows on hard regimes
Robustness: Convexia P10 0.71 vs best baseline 0.57

Executive Summary: Key Metrics Comparison

Convexia vs Best Baseline across primary performance indicators

Recall at High Precision (R@P≥0.95)

False Positives per Query

Key Insight: Convexia achieves 13% higher recall while producing 64% fewer false positives, delivering superior precision and completeness for sourcing workflows.

2. Benchmark Overview and Design Principles

Design Principles:

Workflow realism: CASBench uses thesis-style sourcing queries with explicit constraints.
High-precision evaluation: benchmarks include metrics that punish false positives.
Evidence requirements: outputs are scored on citation support, not just correctness. A correct answer without traceable evidence is treated as partially complete.
Leakage controls: both benchmarks implement time-based freezing and document timestamp filtering so that future outcomes and post-hoc updates are excluded from inputs.

Benchmark Scope:

CASBench evaluates: asset discovery completeness under strict precision constraints, long-tail source coverage, and citation correctness for structured thesis queries.

Why Convexia is Structurally Advantaged

Convexia is built to answer thesis-style sourcing questions end-to-end. Its sourcing layer combines proprietary connectors with advanced reasoning models that they identify where undiscovered assets are most likely to exist. By reasoning over mechanism adjacencies, platform applicability, and historical licensing patterns, Convexia surfaces opportunities in channels that traditional searches might miss entirely.

This is especially valuable for deprioritized or non-marketed programs, where the signal is scattered across filings, investor materials, company sites, and other low-indexed disclosures, rather than structured pipeline databases or out-licensing catalog pages.

Downstream, Convexia compounds that advantage by turning retrieved evidence into decision outputs, including a proprietary probability-of-success model that can be customized to a firm's investment style via reinforcement fine tuning, plus auto-generated outreach lists and investor-ready deal packs.

3. CASBench v1.0: Convexia Asset Sourcing Benchmark

3.1 What CASBench Measures

CASBench measures how effectively a system can answer a sourcing thesis query such as: "NaV1.8 Small Molecules Preclinical in China."

Given a structured query, a system must return a candidate set of drug assets that satisfy the query's explicit constraints and provide evidence for each constrained claim. Constraint dimensions are query-dependent. Systems are scored on constraint satisfaction only for dimensions that are present in the query specification; unconstrained dimensions are not scored.

CASBench is designed for workflows where an analyst is willing to review a finite list, but cannot afford to chase incorrect assets. The primary metric is therefore recall at very high precision (R@P≥0.95).

3.2 Dataset Accounting and Splits

Split	Queries	Gold assets	Purpose
Development	20	1,029	System iteration only
Test	80	4,137	Reported leaderboard
Total	100	5,166	Total benchmark size

All CASBench metrics in this report are computed on the 80-query test set only.

To make CASBench representative of real sourcing workflows, we created a fixed pool of 100 thesis-style queries.

Authors and roles: Queries were drafted by domain experts (scientists, BD, and diligence analysts) who routinely source or evaluate therapeutics.
Sampling strategy: Queries were chosen to cover a balanced range of sourcing intents (broad landscape scans, narrow constraint satisfaction searches, and exclusion-heavy screens). The final pool is stratified to avoid over-weighting any single therapeutic area, modality, or stage.
Exclusion rules: We excluded queries that were too underspecified to yield an evaluable gold set (for example, purely exploratory prompts with no constraints), queries that trivially name a single known asset or company as the answer, and near-duplicates that differ only by wording.
Freeze and governance: The query pool and the dev/test split were finalized and frozen prior to running benchmark evaluations.

Email founders@convexia.bio for the complete test stack.

Test Set Stratification

Dimension	Breakdown
Difficulty	Easy: 24 \| Medium: 32 \| Hard: 24
Stage focus	Preclinical: 34 \| Clinical/Approved: 30 \| Shelved/Paused: 16
Therapeutic area	Onc: 22 \| CNS: 10 \| Imm: 12 \| CV: 8 \| Rare: 14 \| Other: 14

3.3 Definition, Registry Schema, and Deduplication

Asset: A unique drug program defined by the tuple: active ingredient + modality + primary target + sponsor context.

Gold asset (gold-list asset): For a given query, a gold asset is an asset that (i) satisfies all query constraints under the adjudication rubric, (ii) is publicly disclosed on or before the query's as-of date, (iii) is supported by at least two independent public evidence artifacts, and (iv) is canonicalized and deduplicated to a single asset_id. The set of gold assets for a query is its gold list. View more in Section 3.4.

Asset Registry Schema:

• asset_id (canonical)
• Inn_name
• code_names[]
• Sponsor_id
• target_ids[]
• modality
• Indications[]
• stage
• earliest_disclosure_date and earliest_disclosure_source_type
• Evidence_artifact_ids[]

Query Difficulty Labeling (Easy, Medium, Hard)

Each thesis query is assigned a difficulty label after the gold set is constructed but before any system is evaluated, ensuring labels reflect human effort required to compile the gold list rather than system performance. Labeling is based on two properties observed during gold set construction: (1) the evidence regime required to recover the gold list and (2) constraint complexity in the query specification.

"Standard sources" are high-coverage, well-indexed structured or semi-structured corpora with consistent schemas and broad access.

"Long-tail sources" are fragmented or poorly indexed channels, especially unstructured evidence that requires crawling plus OCR/transcription/translation to extract signal.

Deduplication Rules:

Code names vs INN/USAN: mapped via a curated synonym table (example: BMS-986165 and deucravacitinib refer to the same asset).
Partnered or co-developed assets: counted once under the sponsor with development control as of the benchmark as-of date.
Fixed-dose combinations: counted as distinct assets; co-administration regimens are not treated as new assets.
Salts, esters, and formulations: treated as the same asset unless they represent a distinct clinical development path.
Biosimilars: treated as separate assets from the originator due to different sponsors and regulatory paths.

3.4 Ground Truth Construction

Gold lists were constructed by at least two independent reviewers per query, drawn from a pool of domain experts including contracted pharmaceutical consultants, biotech customers participating in validation studies, and analysts who had previously conducted comparable competitive intelligence exercises. Reviewers were blinded to system outputs during initial gold-list construction. A third reviewer adjudicated disagreements when necessary.

When sources conflict on stage, indication, or sponsor control, adjudication uses the following precedence order:

Trial registries and regulatory documents (ClinicalTrials.gov, EUCTR/CTIS, labels, assessment reports)
Peer-reviewed publications and official conference abstracts/posters
Patents (for identity, target, modality; weaker for stage)
SEC filings and official investor presentations
Company press releases
Secondary aggregators (third-party databases, news summaries)

3.5 Data Sources and System Cards

CASBench is sensitive to source coverage. In this report, source category refers to the type of evidence artifact ingested and retrievable by a system. Source category is not exclusive: a single asset can have evidence across multiple categories over time.

Convexia CAS:

Patents: USPTO, EPO, WIPO, and national patent offices.
Clinical registries: ClinicalTrials.gov, EUCTR, CTIS, JPRN, ChiCTR, ANZCTR, and other regional registries.
Publications: PubMed and preprints.
Technology transfer offices (TTOs): 500+ university and hospital portfolios.
Grant databases: NIH and major international grant sources.
Conference materials: proceedings, posters, and abstracts.
Non-English sources: coverage includes Chinese, Japanese, Korean, and German sources.
Audio, video, image disclosures
Corporate filings and websites: SEC filings, annual reports, investor decks, and company websites.
News and press releases: pharmaceutical news outlets, wire services, and company announcements.
Social media and forums: X, LinkedIn, Reddit discussions, and industry-specific online communities.
Crowdfunding and venture pages: platforms like Kickstarter or AngelList for early-stage biotech projects.
Regulatory documents: FDA/EMA approvals, warning letters, and inspection reports.
Other digital breadcrumbs

Competitor Systems:

Commercial Database A: A curated database for tracking and analyzing global pharmaceutical research and development pipelines, with optional advanced analytics and AI-enhanced tooling.
Commercial Database B: A platform that integrates biological, chemical, and pharmacological data to support drug discovery and development decision-making.
Commercial Database C: A comprehensive intelligence platform covering pipeline and marketed drugs across the pharmaceutical industry.
Commercial Database D: An AI-powered biopharma intelligence platform offering data on drugs, clinical trials, and competitor pipelines.
Frontier LLM agents: Frontier LLM agents are advanced AI systems based on large language models that operate autonomously, often with browsing capabilities, to perform tasks and achieve goals.

Note on anonymization: Commercial database vendors are anonymized in this report. Several vendors' terms of service restrict the publication of benchmarking or comparative analysis without prior written consent. To ensure compliance with contractual obligations while still providing methodologically rigorous and reproducible results, we report performance metrics for commercial systems using anonymized identifiers.

3.6 Leakage Controls and Data Freeze

As-of date: 2025-11-01. For every system, evidence access was time-filtered to the as-of date. Documents created after 2025-11-01 were excluded, and documents updated after 2025-11-01 were evaluated using their as-of snapshot version when available. Retrieval was conducted against an as-of constrained corpus; where strict as-of snapshotting was not technically available for a given source, we applied the closest feasible time-bounded proxy and documented the residual limitation.

3.7 Metrics and Scoring

CASBench reports three metrics:

Candidate-set precision and recall canonicalized to benchmark asset IDs
R@P>=0.95: recall under a strict precision constraint using a uniform external ranking protocol.
1. Each candidate asset is scored on constraint satisfaction using a query-specific rubric derived from the structured query specification.
Evidence quality (CASBench-Source): fraction of required claims supported by valid citations.

Additional diagnostic metrics:

• False positives per query (FP/Q) in the high-precision prefix used for R@P≥0.95
• Recall by earliest-disclosure source type on the hard-query subset (long-tail coverage analysis).

3.8 Head-to-Head Results

System	R@P≥0.95	95% CI	Candidate precision	Candidate recall	CASBench-Source	FP/Q
Convexia	0.921	(0.89-0.95)	0.969	0.973	0.927	0.44
Commercial Database A	0.814	(0.78-0.85)	0.936	0.866	0.939	1.25
Commercial Database B	0.744	(0.71-0.78)	0.907	0.809	0.881	1.63
Commercial Database C	0.806	(0.77-0.84)	0.923	0.835	0.932	1.34
Commercial Database D	0.703	(0.66-0.74)	0.875	0.771	0.905	1.79
GPT-5.2 Agent (xhigh)	0.558	(0.52-0.60)	0.801	0.648	0.812	2.05
Claude Opus 4.5 (thinking)	0.577	(0.54-0.62)	0.816	0.664	0.827	1.92
Gemini 3 Pro Preview Agent (high)	0.593	(0.55-0.64)	0.824	0.672	0.803	2.14

3.9 Stratified Performance

Performance Across Query Difficulty Levels

R@P≥0.95 stratified by query complexity (Sections 3.8 & 3.9)

Key Insight: Convexia's advantage grows on harder queries where complex constraints and long-tail sources become critical. The gap increases from 1.0% on easy queries to 20.6% on hard queries vs. the best commercial database.

By Query Difficulty

Stratum	Convexia R@P≥0.95	Commercial Database A	Commercial Database B	Commercial Database C	Commercial Database D	GPT	Opus	Gemini
Easy (n=24)	0.952	0.942	0.889	0.931	0.872	0.619	0.702	0.742
Medium (n=32)	0.927	0.821	0.748	0.818	0.703	0.562	0.631	0.625
Hard (n=24)	0.882	0.676	0.595	0.664	0.533	0.375	0.389	0.415

Convexia's advantage is largest on hard queries, where thesis constraints are complex and long-tail sources matter most.

By Development Stage Focus

Stratum	Convexia R@P≥0.95	Commercial Database A	Commercial Database B	Commercial Database C	Commercial Database D	GPT-5.2	Opus 4.5	Gemini
Preclinical (n=34)	0.893	0.652	0.54	0.642	0.48	0.287	0.306	0.326
Clinical/approved (n=30)	0.933	0.924	0.882	0.918	0.852	0.725	0.748	0.761
Shelved/Paused (n=16)	0.957	0.613	0.521	0.609	0.313	0.452	0.401	0.523

Preclinical sourcing is a highly-differentiating regime, because many early assets appear first outside traditional curated drug databases. Shelved/paused sourcing also requires extensive reasoning and cross-source analysis.

3.10 Long-Tail Source Coverage Analysis

To isolate long-tail coverage, we assign each gold asset to exactly one earliest-disclosure bucket, defined by the earliest dated evidence artifact that supports the asset's gold inclusion (from the required evidence_artifact_ids list). Buckets are mutually exclusive and based on artifact type.

Hard-query subset: 24 queries; 1,347 total gold assets.

Long-Tail Source Coverage Analysis

Recall by source type: Convexia vs Best Competitor (Section 3.9)

Key Insight: Convexia demonstrates substantial advantages in long-tail sources critical for early-stage asset discovery, including TTO portfolios (+51.3pp), grant databases (+46.7pp), and conference abstracts (+39.1pp). These gains reflect comprehensive coverage of non-traditional sources that commercial databases often overlook.

Source type (earliest disclosure)	N assets	Convexia recall	Best competitor recall	Δ (percentage points)
Clinical trial registries	410	98.29%	97.07%	1.2
Patent databases	280	98.1%	81.5%	16.6
Publications and preprints	120	88.33%	75.83%	12.5
Regulatory documents	40	92.50%	90.00%	2.5
Corporate filings and company websites	85	81.18%	54.12%	27.1
Press releases and news	65	75.38%	50.77%	24.6
Conference abstracts and posters	105	82.86%	43.81%	39.1
TTO portfolios	80	68.75%	17.50%	51.3
Grant databases	30	70.00%	23.33%	46.7
Academic theses and dissertations	45	65.44%	22.22%	42.2
Audio, video, and image disclosures	12	50.00%	16.67%	33.3
Social media and forums	20	55.00%	20.00%	35.0
Crowdfunding and venture pages	5	40.00%	20.00%	20.0
Other digital breadcrumbs	50	54.00%	24.00%	30.0

Key point: the largest gaps come from non-standard sources that are typically outside curated drug databases. These earliest-disclosure buckets are a practical proxy for ‘hard-to-see’ assets.

China-linked Asset Recovery at High Precision

China-linked gold assets in the CASBench v1.0 test set: 1,218 of 4,137 (29.4%)

System	Recall on China-linked assets at P≥0.95	China-linked assets recovered (TP, of 1,218)
Convexia	87.4%	1,065
Commercial Database A	57.7%	703
Commercial Database B	60.5%	737
Commercial Database C	62.1%	756
Commercial Database D	56.8%	692
GPT-5.2 Agent (xhigh)	49.7%	605
Claude Opus 4.5 (thinking)	51.0%	621
Gemini 3 Pro Preview Agent (high)	52.6%	641

3.11 Distributional Analysis

Average scores can hide brittleness. CASBench therefore reports per-query distributions. For brevity, we show per-query distributions for Convexia, the best-performing commercial database baseline, and a representative frontier agent.

R@P≥0.95 distribution across the 80 test queries:

System	Mean	Median	P10	P90	Std dev
Convexia	0.921	0.91	0.71	1	0.14
Commercial Database A	0.814	0.832	0.57	0.98	0.16
GPT-5.2 Agent (xhigh)	0.558	0.56	0.29	0.86	0.19

False positives per query in the high-precision prefix:

System	0 FP	1 FP	2+ FP	Mean FP/Q
Convexia	55 (69%)	21 (26%)	4 (5%)	0.44
Commercial Database A	30 (38%)	28 (35%)	22 (28%)	1.25
GPT-5.2 Agent (xhigh)	12 (15%)	20 (25%)	48 (60%)	2.05

3.12 Case Studies

CASBench includes full query packets upon request (email founders@convexia.bio). The public report summarizes three representative cases.

Case 1: Standard Structured Query

Query: Indication: rare disease; Modality: AAV; Geography: US, EU, UK; Stage: Phase 1 or Phase 1/2; Exclusions: active pharma sponsor; Time window: last 5y; Additional information: Clinical-stage AAV or lentivirus-based gene therapies for a rare disease with at least one patient dosed and an academic sponsor based in the US, EU or UK

Gold list: 26 qualifying assets (primarily from clinical registries, investor presentations, and peer-reviewed publications).

Results (recall on the high-precision prefix): Convexia 26/26 (100%); Commercial Database A 25/26 (96%); Commercial Database C 25/26 (96%).

Interpretation: Clinical programs are well-covered by standard sources, so most systems converge. Remaining gaps come from constraint parsing, deduplication, and evidence linking quality rather than long-tail ingestion

Case 2: Long-tail Sources Drive Separation

Query: China-disclosed, IND-enabling to late preclinical small molecules that directly inhibit RANKL–RANK signaling or block RANK receptor activation (not downstream generic NF-κB inhibitors), proposed for I&I indications (RA, PsA, hidradenitis suppurativa, uveitis, IBD). Inclusions: PRC-headquartered sponsor or first patent family filed in CN; oral or SC feasible; small molecule or peptidomimetic only. Exclusions: anti-RANKL biologics (mAbs, Fc fusions), osteoporosis-only positioning, broad NF-κB/IKK inhibitors without RANK/RANKL linkage. Evidence bar: at least one direct pathway assay (RANKL-stimulated osteoclastogenesis inhibition or RANK activation reporter) plus either co-crystal/biophysics (SPR/ITC) or rich SAR table.

Gold list: 7 assets (4 TTO portfolios, 2 academic theses, 1 conference poster).

Results (recall on the high-precision prefix) (rounded): Convexia 6/7 (86%); Commercial Database A 3/7 (43%); Commercial Database C 2/7 (29%).

Interpretation: Curated drug databases typically capture only the subset that later appears in patents or registries. Convexia's advantage comes from long-tail source ingestion plus entity resolution across non-standard disclosures, particularly relevant for preclinical.

Case 3: Convexia Underperformance Example

Query: Phase I KRAS G12C inhibitors for NSCLC, excluding covalent inhibitors.

Gold list: 12 assets.

Results (recall on the high-precision prefix): Convexia 10/12 (83%); Commercial Database A 11/12 (92%); Commercial Database C 9/12 (75%).

Root cause: ambiguous covalent vs non-covalent classification in source documents led to incorrect inclusion or exclusion.

Interpretation: this is primarily a mechanism-of-action extraction and normalization failure mode.

3.13 Ablations and System Contribution Analysis

This section decomposes where performance appears to come from. Measured rows come from the CASBench evaluation.

Variant	R@P≥0.95	CASBench-Source	FP/Q
Base LLM (no browsing, no tools)	0.19	0.24	5.8
Frontier LLM agent (web browsing)	0.56	0.79	2.1
Best commercial DB (native search)	0.814	0.939	1.25
Convexia (core sources only: patents + registries)	0.812	0.934	0.73
Convexia (full system)	0.921	0.927	0.44

Interpretation: most of the gap to generic LLM agents comes from retrieval coverage plus constraint-aware normalization. The remaining gap to the best commercial database is driven by long-tail source coverage.

3.14 Failure Modes and Known Limitations

Common failure patterns observed on the 10 lowest-scoring CASBench test queries (Convexia R@P≥0.95 below 0.70):

Ambiguous mechanism-of-action classification (example: covalent vs non-covalent, selective vs non-selective).
Non-English primary sources with translation gaps (example: partial English abstracts for patents).
Complex exclusion criteria with edge-case interpretation disagreements.

Where Convexia underperforms relative to curated databases:

Queries that require exact numeric constraints expressed heterogeneously in source documents (example: exact enrollment cutoffs or dose constraints).

These limitations are addressable with improved entity resolution, translation coverage, and numeric constraint parsing.

4. Methods Appendix

4.1 Dataset Accounting

CASBench v1.0:

100 total thesis queries.
20-query development split (1,029 gold assets) used for iteration only and not reported.
80-query test split (4,137 gold assets) used for all reported CASBench metrics.

4.2 Leakage Controls

Leakage is the most common reason that public benchmarks overstate real-world performance. CASBench implements complementary controls.

CASBench leakage controls:

As-of date (2025-11-01): gold lists are restricted to assets disclosed on or before this date.
Reviewer blinding: gold lists are built without access to Convexia's system outputs.
Evidence requirement: gold assets must have at least two independent public evidence artifacts.
Competitor fairness: systems are evaluated on their native interfaces without manual post-processing; the same canonicalization and scoring rules are applied to all outputs.

4.3 Competitor Protocol

Principle: competitors are evaluated in conditions that reflect how a team would realistically use them, while still respecting the benchmark's leakage controls.

CASBench competitor protocol:

Commercial databases: an analyst constructs the closest equivalent structured query using each tool's native filters corresponding to the query's constraint set. The full returned list is exported without manual curation.
Frontier LLM agents: each agent receives the CASBench query JSON and an explicit as-of date. Agents are instructed to return a JSON list of candidate assets with citations to dated sources. Browsing is allowed.
No manual fixes: no human edits are applied to any system output beyond canonicalization to shared asset IDs.

Canonicalization and Entity Mapping

Because systems may refer to the same asset using different names (code names, sponsor naming, aliases), all outputs are first mapped to a shared asset registry before scoring.

Procedure: We apply a single, system-agnostic mapping process (standardized name normalization plus synonym and alias matching). Ambiguous cases are resolved via limited manual review using the same rules for every system.
Unmatched outputs: For each system we report how many candidates could not be mapped to any registry asset, and how they are scored. Unmatched candidates are treated conservatively and do not receive credit as true positives unless they can be reliably mapped into the benchmark universe.

4.4 Prompt Templates and Output Schemas

CASBench Prompt Template

[SYSTEM]
You are a biotech competitive intelligence analyst. Your task is to identify drug 
assets that satisfy a structured sourcing thesis query.

[INPUT]
Query JSON:
{QUERY_JSON}

Rules:
- Use the as_of_date in the Query JSON as a hard cutoff.  
- Return assets that satisfy ALL constraints (target, modality, stage 
window, indication, and exclusions).
- Do not guess. Only include an asset if you can provide dated evidence for each 
constraint dimension that is explicitly present in the Query JSON (e.g., 
primary_target, modality, stage_as_of_date, indication if specified, and any geography 
or sponsor/disclosure constraints if specified).
- Exclusions: exclude an asset only if you find evidence it violates an exclusion. If 
you cannot find any evidence of a violation, do not invent one.
- Prefer canonical asset identity (INN/USAN if known) and include common code names in 
the asset_name field when helpful for disambiguation.
- If the asset identity is ambiguous (cannot be reliably distinguished from another 
program), exclude it.

[OUTPUT]
Return a JSON array of objects with the schema below.

Schema:
{
  "asset_name": ".",
  "sponsor": ".",
  "modality": ".",
  "primary_target": ".",
  "indications": ["."],
  "stage_as_of_date": ".",
  "evidence": [
    {
      "source_type": "patent | company_website | registry | publication | poster | 
thesis | tto | grant | other",
      "source_id": ".",
      "date": "YYYY-MM-DD",
      "supporting_span": "short quoted span or precise section reference"
    }
  ]
}

CASBench Scoring Template

CASBench scoring template (query-specific rubric)

For each returned candidate asset A and query Q:

1) Let D(Q) be the set of constraint dimensions explicitly present in Q's structured 
constraint specification. Examples of dimensions include target, modality, stage 
window, indication, sponsor/disclosure geography, or other structured constraints 
included in the query.

2) For each dimension d in D(Q), compute:
match_d = 1 if A satisfies Q's constraint for dimension d else 0

3) If Q specifies exclusions, compute:
exclusion_compliance = 1 if A does not violate any Q.exclusions else 0
If Q specifies no exclusions, exclusion_compliance is omitted from scoring.

4) Define the constraint score using only the active dimensions for that query:
constraint_score = average of {match_d for d in D(Q)} plus exclusion_compliance if 
applicable.
(Equivalently, constraint_score is normalized to [0,1] for every query by dividing by 
the number of scored sub-components.)

Ranking for R@P≥0.95:
- Rank candidates by constraint_score descending.
- Break ties deterministically (canonical asset_id ascending).
- Scan the ranked list and select the longest prefix with precision >= 0.95.
- Report recall on that prefix: TP_prefix / |gold_assets(Q)|.

5. Worked Examples

CASBench Worked Example: CAS-DEV-017 (from Development stack)

Query Specification

{
  "query_id": "CAS-DEV-017",
  "thesis": "NaV1.8 (SCN10A) small-molecule inhibitor assets that are 
preclinical in China (PRC development control and/or PRC-first disclosure), as 
of 2025-11-01",
  "constraints": {
    "target": ["NaV1.8", "SCN10A"],
    "modality": "small_molecule",
    "stage_min": "discovery",
    "stage_max": "late_preclinical",
    "geography": "China"
  },
  "exclusions": [
    "biologics (peptides, antibodies, toxins)",
    "non-selective sodium channel blockers where Nav1.8 is not a primary 
mechanistic driver",
    "purely upstream/downstream pain-pathway programs without NaV1.8 target 
engagement evidence",
    "non-PRC development control as of as_of_date"
  ],
  "as_of_date": "2025-11-01",
  "difficulty": "Hard"
}

Difficulty Rationale: Hard. Most qualifying assets are disclosed first in PRC-linked sources (Chinese-language patents, local industry writeups, PRC corporate pages, or PRC academic literature). The main adjudication difficulty is mechanistic scope control (NaV1.8-selective vs pan-Nav blockers, and NaV1.8-primary vs "Nav1.7/1.8 dual" positioning), plus identity normalization across sparse "series-level" disclosures (patent series with no stable code name).

Gold List (17 assets; full list shown)

Drug / code	Sponsor	Stage	Earliest disclosure	Earliest source type	Dual-target	Primary Source
Anrun Phenyl-Modified Nav1.8 Analog	Anrun Pharmatech	Late discovery	2024-04-23	Patent database	No	CN118388466A
CTTQ Tricyclic Heteroaryl Nav1.8 Inhibitor Series	Chia Tai Tianqing Pharmaceutical (正大天晴)	Patent-only	2024-03-15 (EST)	Patent	No	WO2024063008A1
Haisco Tetrahydrofuran-Derived Nav1.8 Blocker Series	Haisco Pharmaceutical Group (海思科医药集团)	Patent-only	2023-01-13	Patent	No	WO2024188367A1
Hengrui Amidine-Modified VX-548 Analog	Hengrui Pharmaceuticals	Patent-only	2024-02-01	Patent database / financial news	Yes (Nav1.7)	AdisInsight 800057997
Huilun/Easton Structural-Modified Nav1.8 Inhibitor	Huilun (汇伦医药) + Chengdu Easton Biopharma	Patent-only	2022-08-28	Patent	No	WO2024046253A1
Humanwell Non-Opioid Nav1.8 Pain Blocker Series	Humanwell Healthcare (人福医药)	Patent-only	2022-12-23	Patent	No	WO2024128919A1
Rejin Heterocyclic Nav1.8 Pain Inhibitor Series	Jiangsu Rejin Pharmaceutical (江苏热景制药)	Patent-only	2022-11-23	Patent database	No	WO2024103135A1
Deheng Nav1.8 Blocker for Pain/Cough/Pruritus	Nanjing Deheng Pharmaceutical (南京德恒制药)	Patent-only	2024-01-04	Patent	No	WO2025113633A1
SIMM SCN10A (Nav1.8) Blocker Series	Shanghai Institute of Materia Medica (SIMM)	Patent-only	2023-11-21	Patent	No	WO2023207949A1
Wennai SCN10A (Nav1.8) Sodium Channel Blocker	Shanghai Wennai Pharma Tech (上海文耐)	Patent-only	2023-08-11	Patent	No	WO2025036819
Guoyuan Tetrazole-Linked Nav1.8 Inhibitor	Yancheng Guoyuan New Materials Co.	Patent-only	2024-04-18	Patent database	No	CN118440065A
Hyacinth Deuterated VX-548 Series	Zhejiang Hyacinth Pharmaceutical (浙江惠仁)	Patent-only	2022-09-20	Patent	No	WO2024063008A1
CSPC Ouyi Pyrazole-Based Nav1.8 Program	CSPC Ouyi Pharmaceutical (石药欧意)	Pre-clinical	2024-01-19	Industry report	No	CN117424505A
NMU QLS-81 Nav1.7/1.8 Dual Inhibitor	Nanjing Medical University (南京医科大学)	Pre-clinical	2021-06-08	Publication	Yes (Nav1.7)	PMC8285378
WCH Adamantane-Sulfonamide Nav1.8 Inhibitor (Compound 6)	West China Hospital / Sichuan University	Preclinical / Lead optimization	2024-06-01 (EST)	Publication	No	DOI: 10.1016/j.bmcl.2024.129862
SIMM Nicotinamide Scaffold Nav1.8 Inhibitor (2c)	Shanghai Institute of Materia Medica (SIMM)	Publication	2023-04-14	Publication	No	PubMed: 37084597
Yangguang Dual Nav1.7/1.8 Pain Inhibitor Series	Yangguang Anjin	Patent-only	2025-01-01	Company ecosystem / financial news	Yes (Nav1.7)	ITJuzi Company Profile

Evidence Artifacts (CAS-NAV18-004 example)

{
  "asset_id": "CAS-NAV18-004",
  "drug_name": "Hengrui Amidine-Modified VX-548 Analog",
  "evidence_artifacts": [
    {
      "artifact_id": "EV-NAV18-004-001",
      "source_type": "patent",
      "source_id": "WO2024041613A1 (family; PRC-origin series)",
      "language": "Chinese/English",
      "date": "2024-02-01",
      "claims_supported": ["target", "modality", "mechanism/selectivity", 
"geography"],
      "supporting_span": "series described as NaV1.8 inhibitors with pain-use 
claims; includes selectivity panel"
    },
    {
      "artifact_id": "EV-NAV18-004-002",
      "source_type": "other",
      "source_id": "AdisInsight drug profile (ID: 800057997)",
      "language": "English",
      "date": "2024-02-15",
      "claims_supported": ["sponsor", "stage", "indication"],
      "supporting_span": "lists sponsor (Hengrui), modality (small molecule), 
target (NaV1.8), and development stage (preclinical)"
    }
  ]
}

Scoring Walkthrough

System output for CAS-NAV18-004:

{
  "asset_name": "Hengrui Amidine-Modified VX-548 Analog",
  "sponsor": "Hengrui Pharmaceuticals",
  "modality": "Small-molecule NaV1.8 inhibitor",
  "primary_target": "NaV1.8 (SCN10A) [Nav1.7 (SCN9A) secondary noted in 
selectivity panel]",
  "Geography": ["China"],
  "stage_as_of_date": "Preclinical",
  "evidence": [
    {
      "source_type": "patent",
      "source_id": "WO2024041613A1",
      "date": "2024-02-01",
      "supporting_span": "series described as NaV1.8 inhibitors with pain-use 
claims; includes selectivity panel"
    },
    {
      "source_type": "company",
      "source_id": "Company Website",
      "date": "2024-02-15",
      "supporting_span": "Hengrui + NaV1.8 small molecule; stage listed as 
preclinical"
    }
  ]
}

Constraint satisfaction scoring:

Constraint	Score	Rationale
target_match	1	Primary target includes NaV1.8 (SCN10A).
modality_match	1	Small molecule stated and supported.
geography_match	1	China
stage_match	1	"Preclinical" falls within discovery–late_preclinical window.
exclusion_compliance	1	Not a biologic; not a non-selective blocker where NaV1.8 is absent; dual Nav1.7/1.8 is allowed if NaV1.8 is primary.
constraint_score	1.0

Evidence quality scoring (CASBench-Source):

Required Claim	Citation Provided	Citation Valid	Score
Target	WO2024041613A1	Yes (patent describes voltage-gated sodium channel inhibitors and explicitly includes NaV1.8 / SCN10A as a primary/preferred target).	1
Modality	WO2024041613A1	Yes (patent is small-molecule compounds and related salts/compositions; not a biologic or fusion construct).	1
Stage	Company Report	Yes ("Preclinical" supports discovery–late_preclinical window per your rubric).	1
Geography	WO2024041613A1	Yes (PCT application filed via CN route and assignees are China-based entities, supporting China geography).	1
Exclusion	WO2024041613A1	Yes (exclusion compliance is supported because the invention is not a biologic and NaV1.8 is explicitly included as a primary/preferred target; dual NaV1.7/1.8 would still comply under your rule so long as NaV1.8 is present/primary).	1
CASBench-Source for this asset:			5/5 = 1

Aggregate Query Results

System	True Positives	False Positives	Gold Assets	Precision	Recall	R@P≥0.95
Convexia	15	1	17	0.94	0.88	0.82
Commercial Database A	7	2	17	0.78	0.41	0.35
GPT-5.2 Agent	5	6	17	0.45	0.29	0.12

Recall by Earliest-Disclosure Source Type (this query)

Source type (earliest disclosure)	N Gold	Convexia Recall	Best Competitor Recall
Patents / patent databases	12	83%	33%
Academic publications (incl. PubMed)	4	100%	67%
Industry reports	1	100%	0%
Financial news / company ecosystem pages	1	0%	0%

6. Limitations and Threats to Validity

No benchmark perfectly captures real-world deployment. The following limitations should be considered when interpreting CASBench results.

1. Gold Set Incompleteness in Preclinical and Long-tail Sources

The gold labels are most likely to miss early, fragmented disclosures (posters, TTO pages, grants, local-language press). This can undercount true recall and distort where "alpha" appears.

2. Temporal Leakage and Drifting Corpora

Sources update over time. If any gold items or evidence links reflect post-hoc updates, the benchmark inflates performance, especially for systems tied to frequently refreshed aggregators. Results are not inherently prospective unless time-sliced.

3. Precision Operating Point May Not Match Analyst Workflows

R@P≥0.95 reflects a strict "high-precision prefix" regime. In practice, teams often choose a different tradeoff and then filter with review, which can materially change relative rankings.

4. Baseline Comparability is Imperfect Across Product Types

Databases, LLM agents, and Convexia differ in UX assumptions, filtering defaults, and citation behaviors. Head-to-head metrics may not reflect each system's best-practice usage.