Convexia Sourcing and Diligence Benchmarks
CASBench v1.0
1. Executive Summary
CASBench v1.0 benchmarks Convexia vs commercial databases and frontier LLM agents on 80 asset sourcing queries at a high-precision operating point. Convexia's advantage is concentrated in preclinical, ex-USA, and long-tail sources. Clinical and approved-stage queries are easier for everyone.
Headline Results:
- R@P≥0.95: Convexia 0.921 vs best baseline 0.814
- False positives per query: Convexia 0.44 vs best baseline 1.25
Where Convexia Dominates:
- Preclinical: Convexia 0.893 vs best baseline 0.652
- Ex-USA (China-linked proxy): Convexia 0.874 vs best baseline 0.621
- Long-tail sources (not CTG): near parity on registries, large gains on non-standard sources
- TTO portfolios: +51.3 pp
- Conference posters/abstracts: +39.1 pp
- Grant databases: +46.7 pp
- Patents: 98.1% vs 81.5%, +16.6 pp
- Lower analyst cleanup: 0 FP on 69% of queries vs 38% for best baseline
What Converges:
- Clinical and late/approved: high scores across systems; gaps shrink vs preclinical
- Easy queries: systems are closer; separation grows on hard regimes
- Robustness: Convexia P10 0.71 vs best baseline 0.57
Recall at High Precision (R@P≥0.95)
False Positives per Query
Key Insight: Convexia achieves 13% higher recall while producing 64% fewer false positives, delivering superior precision and completeness for sourcing workflows.
2. Benchmark Overview and Design Principles
Design Principles:
- Workflow realism: CASBench uses thesis-style sourcing queries with explicit constraints.
- High-precision evaluation: benchmarks include metrics that punish false positives.
- Evidence requirements: outputs are scored on citation support, not just correctness. A correct answer without traceable evidence is treated as partially complete.
- Leakage controls: both benchmarks implement time-based freezing and document timestamp filtering so that future outcomes and post-hoc updates are excluded from inputs.
Benchmark Scope:
CASBench evaluates: asset discovery completeness under strict precision constraints, long-tail source coverage, and citation correctness for structured thesis queries.
Why Convexia is Structurally Advantaged
Convexia is built to answer thesis-style sourcing questions end-to-end. Its sourcing layer combines proprietary connectors with advanced reasoning models that they identify where undiscovered assets are most likely to exist. By reasoning over mechanism adjacencies, platform applicability, and historical licensing patterns, Convexia surfaces opportunities in channels that traditional searches might miss entirely.
This is especially valuable for deprioritized or non-marketed programs, where the signal is scattered across filings, investor materials, company sites, and other low-indexed disclosures, rather than structured pipeline databases or out-licensing catalog pages.
Downstream, Convexia compounds that advantage by turning retrieved evidence into decision outputs, including a proprietary probability-of-success model that can be customized to a firm's investment style via reinforcement fine tuning, plus auto-generated outreach lists and investor-ready deal packs.
3. CASBench v1.0: Convexia Asset Sourcing Benchmark
3.1 What CASBench Measures
CASBench measures how effectively a system can answer a sourcing thesis query such as: "NaV1.8 Small Molecules Preclinical in China."
Given a structured query, a system must return a candidate set of drug assets that satisfy the query's explicit constraints and provide evidence for each constrained claim. Constraint dimensions are query-dependent. Systems are scored on constraint satisfaction only for dimensions that are present in the query specification; unconstrained dimensions are not scored.
CASBench is designed for workflows where an analyst is willing to review a finite list, but cannot afford to chase incorrect assets. The primary metric is therefore recall at very high precision (R@P≥0.95).
3.2 Dataset Accounting and Splits
| Split | Queries | Gold assets | Purpose |
|---|---|---|---|
| Development | 20 | 1,029 | System iteration only |
| Test | 80 | 4,137 | Reported leaderboard |
| Total | 100 | 5,166 | Total benchmark size |
All CASBench metrics in this report are computed on the 80-query test set only.
To make CASBench representative of real sourcing workflows, we created a fixed pool of 100 thesis-style queries.
- Authors and roles: Queries were drafted by domain experts (scientists, BD, and diligence analysts) who routinely source or evaluate therapeutics.
- Sampling strategy: Queries were chosen to cover a balanced range of sourcing intents (broad landscape scans, narrow constraint satisfaction searches, and exclusion-heavy screens). The final pool is stratified to avoid over-weighting any single therapeutic area, modality, or stage.
- Exclusion rules: We excluded queries that were too underspecified to yield an evaluable gold set (for example, purely exploratory prompts with no constraints), queries that trivially name a single known asset or company as the answer, and near-duplicates that differ only by wording.
- Freeze and governance: The query pool and the dev/test split were finalized and frozen prior to running benchmark evaluations.
Email founders@convexia.bio for the complete test stack.
Test Set Stratification
| Dimension | Breakdown |
|---|---|
| Difficulty | Easy: 24 | Medium: 32 | Hard: 24 |
| Stage focus | Preclinical: 34 | Clinical/Approved: 30 | Shelved/Paused: 16 |
| Therapeutic area | Onc: 22 | CNS: 10 | Imm: 12 | CV: 8 | Rare: 14 | Other: 14 |
3.3 Definition, Registry Schema, and Deduplication
Asset: A unique drug program defined by the tuple: active ingredient + modality + primary target + sponsor context.
Gold asset (gold-list asset): For a given query, a gold asset is an asset that (i) satisfies all query constraints under the adjudication rubric, (ii) is publicly disclosed on or before the query's as-of date, (iii) is supported by at least two independent public evidence artifacts, and (iv) is canonicalized and deduplicated to a single asset_id. The set of gold assets for a query is its gold list. View more in Section 3.4.
Asset Registry Schema:
- • asset_id (canonical)
- • Inn_name
- • code_names[]
- • Sponsor_id
- • target_ids[]
- • modality
- • Indications[]
- • stage
- • earliest_disclosure_date and earliest_disclosure_source_type
- • Evidence_artifact_ids[]
Query Difficulty Labeling (Easy, Medium, Hard)
Each thesis query is assigned a difficulty label after the gold set is constructed but before any system is evaluated, ensuring labels reflect human effort required to compile the gold list rather than system performance. Labeling is based on two properties observed during gold set construction: (1) the evidence regime required to recover the gold list and (2) constraint complexity in the query specification.
"Standard sources" are high-coverage, well-indexed structured or semi-structured corpora with consistent schemas and broad access.
"Long-tail sources" are fragmented or poorly indexed channels, especially unstructured evidence that requires crawling plus OCR/transcription/translation to extract signal.
Deduplication Rules:
- Code names vs INN/USAN: mapped via a curated synonym table (example: BMS-986165 and deucravacitinib refer to the same asset).
- Partnered or co-developed assets: counted once under the sponsor with development control as of the benchmark as-of date.
- Fixed-dose combinations: counted as distinct assets; co-administration regimens are not treated as new assets.
- Salts, esters, and formulations: treated as the same asset unless they represent a distinct clinical development path.
- Biosimilars: treated as separate assets from the originator due to different sponsors and regulatory paths.
3.4 Ground Truth Construction
Gold lists were constructed by at least two independent reviewers per query, drawn from a pool of domain experts including contracted pharmaceutical consultants, biotech customers participating in validation studies, and analysts who had previously conducted comparable competitive intelligence exercises. Reviewers were blinded to system outputs during initial gold-list construction. A third reviewer adjudicated disagreements when necessary.
When sources conflict on stage, indication, or sponsor control, adjudication uses the following precedence order:
- Trial registries and regulatory documents (ClinicalTrials.gov, EUCTR/CTIS, labels, assessment reports)
- Peer-reviewed publications and official conference abstracts/posters
- Patents (for identity, target, modality; weaker for stage)
- SEC filings and official investor presentations
- Company press releases
- Secondary aggregators (third-party databases, news summaries)
3.5 Data Sources and System Cards
CASBench is sensitive to source coverage. In this report, source category refers to the type of evidence artifact ingested and retrievable by a system. Source category is not exclusive: a single asset can have evidence across multiple categories over time.
Convexia CAS:
- Patents: USPTO, EPO, WIPO, and national patent offices.
- Clinical registries: ClinicalTrials.gov, EUCTR, CTIS, JPRN, ChiCTR, ANZCTR, and other regional registries.
- Publications: PubMed and preprints.
- Technology transfer offices (TTOs): 500+ university and hospital portfolios.
- Grant databases: NIH and major international grant sources.
- Conference materials: proceedings, posters, and abstracts.
- Non-English sources: coverage includes Chinese, Japanese, Korean, and German sources.
- Audio, video, image disclosures
- Corporate filings and websites: SEC filings, annual reports, investor decks, and company websites.
- News and press releases: pharmaceutical news outlets, wire services, and company announcements.
- Social media and forums: X, LinkedIn, Reddit discussions, and industry-specific online communities.
- Crowdfunding and venture pages: platforms like Kickstarter or AngelList for early-stage biotech projects.
- Regulatory documents: FDA/EMA approvals, warning letters, and inspection reports.
- Other digital breadcrumbs
Competitor Systems:
- Commercial Database A: A curated database for tracking and analyzing global pharmaceutical research and development pipelines, with optional advanced analytics and AI-enhanced tooling.
- Commercial Database B: A platform that integrates biological, chemical, and pharmacological data to support drug discovery and development decision-making.
- Commercial Database C: A comprehensive intelligence platform covering pipeline and marketed drugs across the pharmaceutical industry.
- Commercial Database D: An AI-powered biopharma intelligence platform offering data on drugs, clinical trials, and competitor pipelines.
- Frontier LLM agents: Frontier LLM agents are advanced AI systems based on large language models that operate autonomously, often with browsing capabilities, to perform tasks and achieve goals.
3.6 Leakage Controls and Data Freeze
As-of date: 2025-11-01. For every system, evidence access was time-filtered to the as-of date. Documents created after 2025-11-01 were excluded, and documents updated after 2025-11-01 were evaluated using their as-of snapshot version when available. Retrieval was conducted against an as-of constrained corpus; where strict as-of snapshotting was not technically available for a given source, we applied the closest feasible time-bounded proxy and documented the residual limitation.
3.7 Metrics and Scoring
CASBench reports three metrics:
- Candidate-set precision and recall canonicalized to benchmark asset IDs
- R@P>=0.95: recall under a strict precision constraint using a uniform external ranking protocol.
- Each candidate asset is scored on constraint satisfaction using a query-specific rubric derived from the structured query specification.
- Evidence quality (CASBench-Source): fraction of required claims supported by valid citations.
Additional diagnostic metrics:
- • False positives per query (FP/Q) in the high-precision prefix used for R@P≥0.95
- • Recall by earliest-disclosure source type on the hard-query subset (long-tail coverage analysis).
3.8 Head-to-Head Results
| System | R@P≥0.95 | 95% CI | Candidate precision | Candidate recall | CASBench-Source | FP/Q |
|---|---|---|---|---|---|---|
| Convexia | 0.921 | (0.89-0.95) | 0.969 | 0.973 | 0.927 | 0.44 |
| Commercial Database A | 0.814 | (0.78-0.85) | 0.936 | 0.866 | 0.939 | 1.25 |
| Commercial Database B | 0.744 | (0.71-0.78) | 0.907 | 0.809 | 0.881 | 1.63 |
| Commercial Database C | 0.806 | (0.77-0.84) | 0.923 | 0.835 | 0.932 | 1.34 |
| Commercial Database D | 0.703 | (0.66-0.74) | 0.875 | 0.771 | 0.905 | 1.79 |
| GPT-5.2 Agent (xhigh) | 0.558 | (0.52-0.60) | 0.801 | 0.648 | 0.812 | 2.05 |
| Claude Opus 4.5 (thinking) | 0.577 | (0.54-0.62) | 0.816 | 0.664 | 0.827 | 1.92 |
| Gemini 3 Pro Preview Agent (high) | 0.593 | (0.55-0.64) | 0.824 | 0.672 | 0.803 | 2.14 |
3.9 Stratified Performance
Key Insight: Convexia's advantage grows on harder queries where complex constraints and long-tail sources become critical. The gap increases from 1.0% on easy queries to 20.6% on hard queries vs. the best commercial database.
By Query Difficulty
| Stratum | Convexia R@P≥0.95 | Commercial Database A | Commercial Database B | Commercial Database C | Commercial Database D | GPT | Opus | Gemini |
|---|---|---|---|---|---|---|---|---|
| Easy (n=24) | 0.952 | 0.942 | 0.889 | 0.931 | 0.872 | 0.619 | 0.702 | 0.742 |
| Medium (n=32) | 0.927 | 0.821 | 0.748 | 0.818 | 0.703 | 0.562 | 0.631 | 0.625 |
| Hard (n=24) | 0.882 | 0.676 | 0.595 | 0.664 | 0.533 | 0.375 | 0.389 | 0.415 |
Convexia's advantage is largest on hard queries, where thesis constraints are complex and long-tail sources matter most.
By Development Stage Focus
| Stratum | Convexia R@P≥0.95 | Commercial Database A | Commercial Database B | Commercial Database C | Commercial Database D | GPT-5.2 | Opus 4.5 | Gemini |
|---|---|---|---|---|---|---|---|---|
| Preclinical (n=34) | 0.893 | 0.652 | 0.54 | 0.642 | 0.48 | 0.287 | 0.306 | 0.326 |
| Clinical/approved (n=30) | 0.933 | 0.924 | 0.882 | 0.918 | 0.852 | 0.725 | 0.748 | 0.761 |
| Shelved/Paused (n=16) | 0.957 | 0.613 | 0.521 | 0.609 | 0.313 | 0.452 | 0.401 | 0.523 |
Preclinical sourcing is a highly-differentiating regime, because many early assets appear first outside traditional curated drug databases. Shelved/paused sourcing also requires extensive reasoning and cross-source analysis.
3.10 Long-Tail Source Coverage Analysis
To isolate long-tail coverage, we assign each gold asset to exactly one earliest-disclosure bucket, defined by the earliest dated evidence artifact that supports the asset's gold inclusion (from the required evidence_artifact_ids list). Buckets are mutually exclusive and based on artifact type.
Hard-query subset: 24 queries; 1,347 total gold assets.
Key Insight: Convexia demonstrates substantial advantages in long-tail sources critical for early-stage asset discovery, including TTO portfolios (+51.3pp), grant databases (+46.7pp), and conference abstracts (+39.1pp). These gains reflect comprehensive coverage of non-traditional sources that commercial databases often overlook.
| Source type (earliest disclosure) | N assets | Convexia recall | Best competitor recall | Δ (percentage points) |
|---|---|---|---|---|
| Clinical trial registries | 410 | 98.29% | 97.07% | 1.2 |
| Patent databases | 280 | 98.1% | 81.5% | 16.6 |
| Publications and preprints | 120 | 88.33% | 75.83% | 12.5 |
| Regulatory documents | 40 | 92.50% | 90.00% | 2.5 |
| Corporate filings and company websites | 85 | 81.18% | 54.12% | 27.1 |
| Press releases and news | 65 | 75.38% | 50.77% | 24.6 |
| Conference abstracts and posters | 105 | 82.86% | 43.81% | 39.1 |
| TTO portfolios | 80 | 68.75% | 17.50% | 51.3 |
| Grant databases | 30 | 70.00% | 23.33% | 46.7 |
| Academic theses and dissertations | 45 | 65.44% | 22.22% | 42.2 |
| Audio, video, and image disclosures | 12 | 50.00% | 16.67% | 33.3 |
| Social media and forums | 20 | 55.00% | 20.00% | 35.0 |
| Crowdfunding and venture pages | 5 | 40.00% | 20.00% | 20.0 |
| Other digital breadcrumbs | 50 | 54.00% | 24.00% | 30.0 |
Key point: the largest gaps come from non-standard sources that are typically outside curated drug databases. These earliest-disclosure buckets are a practical proxy for ‘hard-to-see’ assets.
China-linked Asset Recovery at High Precision
China-linked gold assets in the CASBench v1.0 test set: 1,218 of 4,137 (29.4%)
| System | Recall on China-linked assets at P≥0.95 | China-linked assets recovered (TP, of 1,218) |
|---|---|---|
| Convexia | 87.4% | 1,065 |
| Commercial Database A | 57.7% | 703 |
| Commercial Database B | 60.5% | 737 |
| Commercial Database C | 62.1% | 756 |
| Commercial Database D | 56.8% | 692 |
| GPT-5.2 Agent (xhigh) | 49.7% | 605 |
| Claude Opus 4.5 (thinking) | 51.0% | 621 |
| Gemini 3 Pro Preview Agent (high) | 52.6% | 641 |
3.11 Distributional Analysis
Average scores can hide brittleness. CASBench therefore reports per-query distributions. For brevity, we show per-query distributions for Convexia, the best-performing commercial database baseline, and a representative frontier agent.
R@P≥0.95 distribution across the 80 test queries:
| System | Mean | Median | P10 | P90 | Std dev |
|---|---|---|---|---|---|
| Convexia | 0.921 | 0.91 | 0.71 | 1 | 0.14 |
| Commercial Database A | 0.814 | 0.832 | 0.57 | 0.98 | 0.16 |
| GPT-5.2 Agent (xhigh) | 0.558 | 0.56 | 0.29 | 0.86 | 0.19 |
False positives per query in the high-precision prefix:
| System | 0 FP | 1 FP | 2+ FP | Mean FP/Q |
|---|---|---|---|---|
| Convexia | 55 (69%) | 21 (26%) | 4 (5%) | 0.44 |
| Commercial Database A | 30 (38%) | 28 (35%) | 22 (28%) | 1.25 |
| GPT-5.2 Agent (xhigh) | 12 (15%) | 20 (25%) | 48 (60%) | 2.05 |
3.12 Case Studies
CASBench includes full query packets upon request (email founders@convexia.bio). The public report summarizes three representative cases.
Case 1: Standard Structured Query
Query: Indication: rare disease; Modality: AAV; Geography: US, EU, UK; Stage: Phase 1 or Phase 1/2; Exclusions: active pharma sponsor; Time window: last 5y; Additional information: Clinical-stage AAV or lentivirus-based gene therapies for a rare disease with at least one patient dosed and an academic sponsor based in the US, EU or UK
Gold list: 26 qualifying assets (primarily from clinical registries, investor presentations, and peer-reviewed publications).
Results (recall on the high-precision prefix): Convexia 26/26 (100%); Commercial Database A 25/26 (96%); Commercial Database C 25/26 (96%).
Interpretation: Clinical programs are well-covered by standard sources, so most systems converge. Remaining gaps come from constraint parsing, deduplication, and evidence linking quality rather than long-tail ingestion
Case 2: Long-tail Sources Drive Separation
Query: China-disclosed, IND-enabling to late preclinical small molecules that directly inhibit RANKL–RANK signaling or block RANK receptor activation (not downstream generic NF-κB inhibitors), proposed for I&I indications (RA, PsA, hidradenitis suppurativa, uveitis, IBD). Inclusions: PRC-headquartered sponsor or first patent family filed in CN; oral or SC feasible; small molecule or peptidomimetic only. Exclusions: anti-RANKL biologics (mAbs, Fc fusions), osteoporosis-only positioning, broad NF-κB/IKK inhibitors without RANK/RANKL linkage. Evidence bar: at least one direct pathway assay (RANKL-stimulated osteoclastogenesis inhibition or RANK activation reporter) plus either co-crystal/biophysics (SPR/ITC) or rich SAR table.
Gold list: 7 assets (4 TTO portfolios, 2 academic theses, 1 conference poster).
Results (recall on the high-precision prefix) (rounded): Convexia 6/7 (86%); Commercial Database A 3/7 (43%); Commercial Database C 2/7 (29%).
Interpretation: Curated drug databases typically capture only the subset that later appears in patents or registries. Convexia's advantage comes from long-tail source ingestion plus entity resolution across non-standard disclosures, particularly relevant for preclinical.
Case 3: Convexia Underperformance Example
Query: Phase I KRAS G12C inhibitors for NSCLC, excluding covalent inhibitors.
Gold list: 12 assets.
Results (recall on the high-precision prefix): Convexia 10/12 (83%); Commercial Database A 11/12 (92%); Commercial Database C 9/12 (75%).
Root cause: ambiguous covalent vs non-covalent classification in source documents led to incorrect inclusion or exclusion.
Interpretation: this is primarily a mechanism-of-action extraction and normalization failure mode.
3.13 Ablations and System Contribution Analysis
This section decomposes where performance appears to come from. Measured rows come from the CASBench evaluation.
| Variant | R@P≥0.95 | CASBench-Source | FP/Q |
|---|---|---|---|
| Base LLM (no browsing, no tools) | 0.19 | 0.24 | 5.8 |
| Frontier LLM agent (web browsing) | 0.56 | 0.79 | 2.1 |
| Best commercial DB (native search) | 0.814 | 0.939 | 1.25 |
| Convexia (core sources only: patents + registries) | 0.812 | 0.934 | 0.73 |
| Convexia (full system) | 0.921 | 0.927 | 0.44 |
Interpretation: most of the gap to generic LLM agents comes from retrieval coverage plus constraint-aware normalization. The remaining gap to the best commercial database is driven by long-tail source coverage.
3.14 Failure Modes and Known Limitations
Common failure patterns observed on the 10 lowest-scoring CASBench test queries (Convexia R@P≥0.95 below 0.70):
- Ambiguous mechanism-of-action classification (example: covalent vs non-covalent, selective vs non-selective).
- Non-English primary sources with translation gaps (example: partial English abstracts for patents).
- Complex exclusion criteria with edge-case interpretation disagreements.
Where Convexia underperforms relative to curated databases:
- Queries that require exact numeric constraints expressed heterogeneously in source documents (example: exact enrollment cutoffs or dose constraints).
These limitations are addressable with improved entity resolution, translation coverage, and numeric constraint parsing.
4. Methods Appendix
4.1 Dataset Accounting
CASBench v1.0:
- 100 total thesis queries.
- 20-query development split (1,029 gold assets) used for iteration only and not reported.
- 80-query test split (4,137 gold assets) used for all reported CASBench metrics.
4.2 Leakage Controls
Leakage is the most common reason that public benchmarks overstate real-world performance. CASBench implements complementary controls.
CASBench leakage controls:
- As-of date (2025-11-01): gold lists are restricted to assets disclosed on or before this date.
- Reviewer blinding: gold lists are built without access to Convexia's system outputs.
- Evidence requirement: gold assets must have at least two independent public evidence artifacts.
- Competitor fairness: systems are evaluated on their native interfaces without manual post-processing; the same canonicalization and scoring rules are applied to all outputs.
4.3 Competitor Protocol
Principle: competitors are evaluated in conditions that reflect how a team would realistically use them, while still respecting the benchmark's leakage controls.
CASBench competitor protocol:
- Commercial databases: an analyst constructs the closest equivalent structured query using each tool's native filters corresponding to the query's constraint set. The full returned list is exported without manual curation.
- Frontier LLM agents: each agent receives the CASBench query JSON and an explicit as-of date. Agents are instructed to return a JSON list of candidate assets with citations to dated sources. Browsing is allowed.
- No manual fixes: no human edits are applied to any system output beyond canonicalization to shared asset IDs.
Canonicalization and Entity Mapping
Because systems may refer to the same asset using different names (code names, sponsor naming, aliases), all outputs are first mapped to a shared asset registry before scoring.
- Procedure: We apply a single, system-agnostic mapping process (standardized name normalization plus synonym and alias matching). Ambiguous cases are resolved via limited manual review using the same rules for every system.
- Unmatched outputs: For each system we report how many candidates could not be mapped to any registry asset, and how they are scored. Unmatched candidates are treated conservatively and do not receive credit as true positives unless they can be reliably mapped into the benchmark universe.
4.4 Prompt Templates and Output Schemas
CASBench Prompt Template
[SYSTEM]
You are a biotech competitive intelligence analyst. Your task is to identify drug
assets that satisfy a structured sourcing thesis query.
[INPUT]
Query JSON:
{QUERY_JSON}
Rules:
- Use the as_of_date in the Query JSON as a hard cutoff.
- Return assets that satisfy ALL constraints (target, modality, stage
window, indication, and exclusions).
- Do not guess. Only include an asset if you can provide dated evidence for each
constraint dimension that is explicitly present in the Query JSON (e.g.,
primary_target, modality, stage_as_of_date, indication if specified, and any geography
or sponsor/disclosure constraints if specified).
- Exclusions: exclude an asset only if you find evidence it violates an exclusion. If
you cannot find any evidence of a violation, do not invent one.
- Prefer canonical asset identity (INN/USAN if known) and include common code names in
the asset_name field when helpful for disambiguation.
- If the asset identity is ambiguous (cannot be reliably distinguished from another
program), exclude it.
[OUTPUT]
Return a JSON array of objects with the schema below.
Schema:
{
"asset_name": ".",
"sponsor": ".",
"modality": ".",
"primary_target": ".",
"indications": ["."],
"stage_as_of_date": ".",
"evidence": [
{
"source_type": "patent | company_website | registry | publication | poster |
thesis | tto | grant | other",
"source_id": ".",
"date": "YYYY-MM-DD",
"supporting_span": "short quoted span or precise section reference"
}
]
}CASBench Scoring Template
CASBench scoring template (query-specific rubric)
For each returned candidate asset A and query Q:
1) Let D(Q) be the set of constraint dimensions explicitly present in Q's structured
constraint specification. Examples of dimensions include target, modality, stage
window, indication, sponsor/disclosure geography, or other structured constraints
included in the query.
2) For each dimension d in D(Q), compute:
match_d = 1 if A satisfies Q's constraint for dimension d else 0
3) If Q specifies exclusions, compute:
exclusion_compliance = 1 if A does not violate any Q.exclusions else 0
If Q specifies no exclusions, exclusion_compliance is omitted from scoring.
4) Define the constraint score using only the active dimensions for that query:
constraint_score = average of {match_d for d in D(Q)} plus exclusion_compliance if
applicable.
(Equivalently, constraint_score is normalized to [0,1] for every query by dividing by
the number of scored sub-components.)
Ranking for R@P≥0.95:
- Rank candidates by constraint_score descending.
- Break ties deterministically (canonical asset_id ascending).
- Scan the ranked list and select the longest prefix with precision >= 0.95.
- Report recall on that prefix: TP_prefix / |gold_assets(Q)|.5. Worked Examples
CASBench Worked Example: CAS-DEV-017 (from Development stack)
Query Specification
{
"query_id": "CAS-DEV-017",
"thesis": "NaV1.8 (SCN10A) small-molecule inhibitor assets that are
preclinical in China (PRC development control and/or PRC-first disclosure), as
of 2025-11-01",
"constraints": {
"target": ["NaV1.8", "SCN10A"],
"modality": "small_molecule",
"stage_min": "discovery",
"stage_max": "late_preclinical",
"geography": "China"
},
"exclusions": [
"biologics (peptides, antibodies, toxins)",
"non-selective sodium channel blockers where Nav1.8 is not a primary
mechanistic driver",
"purely upstream/downstream pain-pathway programs without NaV1.8 target
engagement evidence",
"non-PRC development control as of as_of_date"
],
"as_of_date": "2025-11-01",
"difficulty": "Hard"
}Difficulty Rationale: Hard. Most qualifying assets are disclosed first in PRC-linked sources (Chinese-language patents, local industry writeups, PRC corporate pages, or PRC academic literature). The main adjudication difficulty is mechanistic scope control (NaV1.8-selective vs pan-Nav blockers, and NaV1.8-primary vs "Nav1.7/1.8 dual" positioning), plus identity normalization across sparse "series-level" disclosures (patent series with no stable code name).
Gold List (17 assets; full list shown)
| Drug / code | Sponsor | Stage | Earliest disclosure | Earliest source type | Dual-target | Primary Source |
|---|---|---|---|---|---|---|
| Anrun Phenyl-Modified Nav1.8 Analog | Anrun Pharmatech | Late discovery | 2024-04-23 | Patent database | No | CN118388466A |
| CTTQ Tricyclic Heteroaryl Nav1.8 Inhibitor Series | Chia Tai Tianqing Pharmaceutical (正大天晴) | Patent-only | 2024-03-15 (EST) | Patent | No | WO2024063008A1 |
| Haisco Tetrahydrofuran-Derived Nav1.8 Blocker Series | Haisco Pharmaceutical Group (海思科医药集团) | Patent-only | 2023-01-13 | Patent | No | WO2024188367A1 |
| Hengrui Amidine-Modified VX-548 Analog | Hengrui Pharmaceuticals | Patent-only | 2024-02-01 | Patent database / financial news | Yes (Nav1.7) | AdisInsight 800057997 |
| Huilun/Easton Structural-Modified Nav1.8 Inhibitor | Huilun (汇伦医药) + Chengdu Easton Biopharma | Patent-only | 2022-08-28 | Patent | No | WO2024046253A1 |
| Humanwell Non-Opioid Nav1.8 Pain Blocker Series | Humanwell Healthcare (人福医药) | Patent-only | 2022-12-23 | Patent | No | WO2024128919A1 |
| Rejin Heterocyclic Nav1.8 Pain Inhibitor Series | Jiangsu Rejin Pharmaceutical (江苏热景制药) | Patent-only | 2022-11-23 | Patent database | No | WO2024103135A1 |
| Deheng Nav1.8 Blocker for Pain/Cough/Pruritus | Nanjing Deheng Pharmaceutical (南京德恒制药) | Patent-only | 2024-01-04 | Patent | No | WO2025113633A1 |
| SIMM SCN10A (Nav1.8) Blocker Series | Shanghai Institute of Materia Medica (SIMM) | Patent-only | 2023-11-21 | Patent | No | WO2023207949A1 |
| Wennai SCN10A (Nav1.8) Sodium Channel Blocker | Shanghai Wennai Pharma Tech (上海文耐) | Patent-only | 2023-08-11 | Patent | No | WO2025036819 |
| Guoyuan Tetrazole-Linked Nav1.8 Inhibitor | Yancheng Guoyuan New Materials Co. | Patent-only | 2024-04-18 | Patent database | No | CN118440065A |
| Hyacinth Deuterated VX-548 Series | Zhejiang Hyacinth Pharmaceutical (浙江惠仁) | Patent-only | 2022-09-20 | Patent | No | WO2024063008A1 |
| CSPC Ouyi Pyrazole-Based Nav1.8 Program | CSPC Ouyi Pharmaceutical (石药欧意) | Pre-clinical | 2024-01-19 | Industry report | No | CN117424505A |
| NMU QLS-81 Nav1.7/1.8 Dual Inhibitor | Nanjing Medical University (南京医科大学) | Pre-clinical | 2021-06-08 | Publication | Yes (Nav1.7) | PMC8285378 |
| WCH Adamantane-Sulfonamide Nav1.8 Inhibitor (Compound 6) | West China Hospital / Sichuan University | Preclinical / Lead optimization | 2024-06-01 (EST) | Publication | No | DOI: 10.1016/j.bmcl.2024.129862 |
| SIMM Nicotinamide Scaffold Nav1.8 Inhibitor (2c) | Shanghai Institute of Materia Medica (SIMM) | Publication | 2023-04-14 | Publication | No | PubMed: 37084597 |
| Yangguang Dual Nav1.7/1.8 Pain Inhibitor Series | Yangguang Anjin | Patent-only | 2025-01-01 | Company ecosystem / financial news | Yes (Nav1.7) | ITJuzi Company Profile |
Evidence Artifacts (CAS-NAV18-004 example)
{
"asset_id": "CAS-NAV18-004",
"drug_name": "Hengrui Amidine-Modified VX-548 Analog",
"evidence_artifacts": [
{
"artifact_id": "EV-NAV18-004-001",
"source_type": "patent",
"source_id": "WO2024041613A1 (family; PRC-origin series)",
"language": "Chinese/English",
"date": "2024-02-01",
"claims_supported": ["target", "modality", "mechanism/selectivity",
"geography"],
"supporting_span": "series described as NaV1.8 inhibitors with pain-use
claims; includes selectivity panel"
},
{
"artifact_id": "EV-NAV18-004-002",
"source_type": "other",
"source_id": "AdisInsight drug profile (ID: 800057997)",
"language": "English",
"date": "2024-02-15",
"claims_supported": ["sponsor", "stage", "indication"],
"supporting_span": "lists sponsor (Hengrui), modality (small molecule),
target (NaV1.8), and development stage (preclinical)"
}
]
}Scoring Walkthrough
System output for CAS-NAV18-004:
{
"asset_name": "Hengrui Amidine-Modified VX-548 Analog",
"sponsor": "Hengrui Pharmaceuticals",
"modality": "Small-molecule NaV1.8 inhibitor",
"primary_target": "NaV1.8 (SCN10A) [Nav1.7 (SCN9A) secondary noted in
selectivity panel]",
"Geography": ["China"],
"stage_as_of_date": "Preclinical",
"evidence": [
{
"source_type": "patent",
"source_id": "WO2024041613A1",
"date": "2024-02-01",
"supporting_span": "series described as NaV1.8 inhibitors with pain-use
claims; includes selectivity panel"
},
{
"source_type": "company",
"source_id": "Company Website",
"date": "2024-02-15",
"supporting_span": "Hengrui + NaV1.8 small molecule; stage listed as
preclinical"
}
]
}Constraint satisfaction scoring:
| Constraint | Score | Rationale |
|---|---|---|
| target_match | 1 | Primary target includes NaV1.8 (SCN10A). |
| modality_match | 1 | Small molecule stated and supported. |
| geography_match | 1 | China |
| stage_match | 1 | "Preclinical" falls within discovery–late_preclinical window. |
| exclusion_compliance | 1 | Not a biologic; not a non-selective blocker where NaV1.8 is absent; dual Nav1.7/1.8 is allowed if NaV1.8 is primary. |
| constraint_score | 1.0 |
Evidence quality scoring (CASBench-Source):
| Required Claim | Citation Provided | Citation Valid | Score |
|---|---|---|---|
| Target | WO2024041613A1 | Yes (patent describes voltage-gated sodium channel inhibitors and explicitly includes NaV1.8 / SCN10A as a primary/preferred target). | 1 |
| Modality | WO2024041613A1 | Yes (patent is small-molecule compounds and related salts/compositions; not a biologic or fusion construct). | 1 |
| Stage | Company Report | Yes ("Preclinical" supports discovery–late_preclinical window per your rubric). | 1 |
| Geography | WO2024041613A1 | Yes (PCT application filed via CN route and assignees are China-based entities, supporting China geography). | 1 |
| Exclusion | WO2024041613A1 | Yes (exclusion compliance is supported because the invention is not a biologic and NaV1.8 is explicitly included as a primary/preferred target; dual NaV1.7/1.8 would still comply under your rule so long as NaV1.8 is present/primary). | 1 |
| CASBench-Source for this asset: | 5/5 = 1 | ||
Aggregate Query Results
| System | True Positives | False Positives | Gold Assets | Precision | Recall | R@P≥0.95 |
|---|---|---|---|---|---|---|
| Convexia | 15 | 1 | 17 | 0.94 | 0.88 | 0.82 |
| Commercial Database A | 7 | 2 | 17 | 0.78 | 0.41 | 0.35 |
| GPT-5.2 Agent | 5 | 6 | 17 | 0.45 | 0.29 | 0.12 |
Recall by Earliest-Disclosure Source Type (this query)
| Source type (earliest disclosure) | N Gold | Convexia Recall | Best Competitor Recall |
|---|---|---|---|
| Patents / patent databases | 12 | 83% | 33% |
| Academic publications (incl. PubMed) | 4 | 100% | 67% |
| Industry reports | 1 | 100% | 0% |
| Financial news / company ecosystem pages | 1 | 0% | 0% |
6. Limitations and Threats to Validity
No benchmark perfectly captures real-world deployment. The following limitations should be considered when interpreting CASBench results.
1. Gold Set Incompleteness in Preclinical and Long-tail Sources
The gold labels are most likely to miss early, fragmented disclosures (posters, TTO pages, grants, local-language press). This can undercount true recall and distort where "alpha" appears.
2. Temporal Leakage and Drifting Corpora
Sources update over time. If any gold items or evidence links reflect post-hoc updates, the benchmark inflates performance, especially for systems tied to frequently refreshed aggregators. Results are not inherently prospective unless time-sliced.
3. Precision Operating Point May Not Match Analyst Workflows
R@P≥0.95 reflects a strict "high-precision prefix" regime. In practice, teams often choose a different tradeoff and then filter with review, which can materially change relative rankings.
4. Baseline Comparability is Imperfect Across Product Types
Databases, LLM agents, and Convexia differ in UX assumptions, filtering defaults, and citation behaviors. Head-to-head metrics may not reflect each system's best-practice usage.