Analyzing a Genetic Variant of “Unknown Significance: a case study of Science-ing in the agentic era [Part 1]

I’m a genetic testing geek. My future husband and I compared genomes before our first date (23andme). Later I had my full gene workup done by Nebula genomics. When I needed standard embryo genetic testing, I sprung for the full genome variety. And as information comes in about the disease history of family members, I’ve reexamined my genetics to look for any clues. For this, I did some genetic counseling and a specialized panel with Invitae.

As part of this I discovered a “variant of unknown significance” on a gene that predisposes me to a particular disease. I’ll stay vague here for privacy, but note that it’s not BRCA1 (which might be the default supposition). This variant is not present in population databases – meaning it hasn’t really been seen before. The report from Invitae states “Advanced modeling of protein sequence and biophysical properties (such as structural, functional, and spatial information, amino acid conservation, physicochemical variation, residue mobility, and thermodynamic stability) that this variant is not expected to disrupt protein function.” In summary – it’s unknown, but Invitae’s best guess is it’s not a big deal.

Since I like to geek about this, now that I have a reliable graduate student Claude Code I collaborate with, and I wanted to get more experience with protein folding and DNA prediction models, we decided to see what the science said outside of Invitae’s proprietary modeling. I’m currently 25% of my 1M context window in on this project and it’s already been a great one!

First the sanity check – is this result for real?

I had Claude code cross reference my full genome from Nebula with the specialized Invitae panel. No need to go down the analysis rabbit hole before confirming. I used the prompt

I’d like you to verify that this variant of unknown significance also appears as such in my full genome as a cross reference.

This came back with:

Confirmed. The variant is present in your Nebula whole-genome VCF. The Nebula WGS independently confirms the same VUS reported by Invitae. This is a solid orthogonal cross-reference: two independent sequencing platforms (Invitae panel + BGI/Nebula WGS via MegaBOLT) agree on the call with balanced allele depth.

Out of curiosity, I asked Claude Code how it ran the query. It located the genome files, installed the query tools (htslib and bcftools), verified the reference build to confirm the coordinates matched between the two files, and then queried the variant site with a small window around the exact position. Finally it helped interpreting the record, including the quality score to which Claude said “Every quality metric is clean, so the call is high-confidence and matches the Invitae finding exactly.”

Getting started with the analysis

I started out with the friendly prompt:

hello! I’d like to use arc institutes Evo model for a small project with claude code, this folder contains a variety of genetic data from myself from nebula genomics (full genome screen), 23and me, as well as a specialized report in which I have a variant of unknown significance. I’m particularly interested in something related to that, what would you suggest

Then we kicked it off, initially with Evo 2. Claude decided to score my variant against verified pathogenic nonsense variants on the same gene, by fetching the 512 base pairs before the variant and compute the likelihood of my variant; it did the same for the clearly deleterious nonsense variants. It gave me some options to run on Google Colab’s GPU or a local GPU, but I had in another session figured out NVIDIA hosted an Evo 2 model for use. I got a free API key for this.

Thankfully, and not surprisingly my variant came out more “benign” (being less surprising to the Evo 2 model) than the pathogenic variants. I asked to write things up in the style of an academic paper.

Human Review Round 1

But we didn’t stop there, I started to act in a more peer review capacity. First I had some back and forth on the plots we had – asking for additional labels and explanations. Then I asked for additional inclusion of every known pathogenic or conflicting mutation on the gene related to the disease of my interest, as well as an explanation to how they were determined to be so. The results in the end made sense to me.

Agentic Review Round 1

I sent this to a friend and collaborator, who himself does a lot of agentic development; whereas I have one OpenClaw, one Codex and one Claude Code Max running, he has multiple of each. I mentioned I was particularly interested in whether we had the right methodology to compute probability. So it was no surprise after we chatted that he offered to have his review agent (because who doesn’t have one of those!) review my paper. He sent me a markdown of the review.

This review brought up few points which I thought were very fair
1. This departs from the canonical Evo 2 variant-effect scoring: we scored P(alt_base | 512 bp left context) at the variant position, while the published BRCA1 zero-shot protocol computes full-sequence pseudo-log-likelihood across the entire sequence.

2. My benign pool for comparison was too small and should focus in on mutations of the same type.

3. I should use other models outside of Evo 2, especially ones that have simple interfaces.

I gave the markdown file to my Claude Code graduate student and had it develop a plan to address these issues.

Comparing to other models outside of Evo 2

Claude came up with a command quickly to do so, and I had it explain where it got the information. There’s an API it turns out! rest.ensembl.org API is provided by Ensembl, a joint project based at the Wellcome Genome Campus in Hinxton, UK. They very helpfully provide

  • Ensembl genome browser & annotation: stable gene/transcript IDs (e.g. ENST00000357654), canonical/MANE transcripts, regulatory features.
  • VEP (Variant Effect Predictor): tool behind /vep/human/region/…. It consequence-annotates variants and plugs in third-party scores.
  • REST API: public, rate-limited HTTP interface you’re calling (max ~15 req/sec per IP; no key required for light use).

They also integrate with third-party predictors; the REST calls can surface them. Some I used

  • SIFT: originally JCVI/Ng & Henikoff
  • PolyPhen-2: Harvard/Sunyaev lab
  • AlphaMissense: Google DeepMind
  • CADD:University of Washington / Kircher & Shendure labs
  • REVEL, MetaRNN, PrimateAI, ESM1b, BayesDel: various academic groups, aggregated via dbNSFP (Liu lab, Baylor/UTHealth)

Here’s an example query and result with a BRCA1 mutation [note not my gene or mutation of interest for anonymity!]. This corresponds to BRCA1 p.Cys61Gly (c.181T>G), a classic RING-domain missense. GRCh38 coordinate is 17:43106487 A>C (BRCA1 is on the minus strand), MANE transcript ENST00000357654:

curl -s 'https://rest.ensembl.org/vep/human/region/17:43106487-43106487/C?content-type=application/json&CADD=1&AlphaMissense=1&dbNSFP=REVEL_score,REVEL_rankscore,CADD_phred,MetaRNN_score,MetaRNN_pred,PrimateAI_score,ESM1b_score,BayesDel_noAF_score' \
  | jq '.[0].transcript_consequences[] | select(.transcript_id=="ENST00000357654") | {
      gene: .gene_symbol,
      transcript: .transcript_id,
      cds_pos: .cds_start,
      codons,
      aa_change: .amino_acids,
      protein_pos: .protein_start,
      consequence: .consequence_terms,
      AlphaMissense: .alphamissense,
      CADD_phred: .cadd_phred,
      CADD_raw: .cadd_raw,
      PolyPhen2: {score: .polyphen_score, call: .polyphen_prediction},
      SIFT: {score: .sift_score, call: .sift_prediction},
      PrimateAI: .primateai_score,
      MetaRNN_score: .metarnn_score,
      MetaRNN_pred: .metarnn_pred,
      REVEL_score: .revel_score,
      REVEL_rankscore: .revel_rankscore,
      ESM1b_score: .esm1b_score,
      BayesDel_noAF: .bayesdel_noaf_score
    }'
{
  "gene": "BRCA1",
  "transcript": "ENST00000357654",
  "cds_pos": 181,
  "codons": "Tgt/Ggt",
  "aa_change": "C/G",
  "protein_pos": 61,
  "consequence": [
    "missense_variant"
  ],
  "AlphaMissense": {
    "am_pathogenicity": 0.9904,
    "am_class": "likely_pathogenic"
  },
  "CADD_phred": "invalid_field",
  "CADD_raw": 4.174515,
  "PolyPhen2": {
    "score": 0.027,
    "call": "benign"
  },
  "SIFT": {
    "score": 0,
    "call": "deleterious"
  },
  "PrimateAI": 0.648884236813,
  "MetaRNN_score": "0.99176705,0.99176705,.,0.99176705,0.99176705,0.99176705,0.99176705,.,0.99176705,0.99176705,0.99176705,.,.,.,0.99176705,0.99176705,.,.,0.99176705,0.99176705",
  "MetaRNN_pred": "D,D,.,D,D,D,D,.,D,D,D,.,.,.,D,D,.,.,D,D",
  "REVEL_score": "0.948,0.948,.,0.948,0.948,0.948,0.948,.,.,0.948,0.948,.,.,.,.,0.948,.,.,0.948,0.948",
  "REVEL_rankscore": 0.98989,
  "ESM1b_score": "-16.379795,-14.650078,-14.3566675,-16.26193,-12.273527,-14.650078,-14.650078,-10.750148,-14.281068,-14.6825905,-14.650078,-14.650078,-14.650078,-14.650078,-12.273527,-15.487761,-14.650078,-15.99081,-10.750148,-16.31102",
  "BayesDel_noAF": 0.565018
}

Pretty cool this is all easily available!

After querying for my gene and mutation of interest, Claude Code integrated this information into the paper. Of interest was that not all models agreed [AKA still a variant of uncertain significance, Evo 2 and Invitae concordance not withstanding]. Of course I’d prefer if all the models were like “no big deal u got this”, however no matter what they all say in reality nature has the final say and staying on top of the science as it evolves will hopefully result in clearer and more accurate predictions over time.

Rerunning the analysis on my “own” GPU

I asked if we could setup Evo 2 on my M3 Max Mac Pro (answer: no), so we went the AWS route to rent a GPU. I setup an AWS instance and a permissions file (and budget) for my Claude Code to connect to. Claude Code helped me do this, but for this part I took things out of auto mode and chose to approve each command one by one. I don’t want Claude getting any ideas here… we definitely worked to set a budget with alerts to flow.

Evo 2 needs CUDA compute capability ≥ 8.0 (Ampere or newer) plus enough VRAM for the 40B model in FP16/BF16. As of April 2026 our research turned up:

Evo 2 sizeMin VRAM (FP16)Viable AWS instanceOn-demand $/hr (us-east-1)
40B (the real thing)~80 GB, realistically 160 GB+ with activationsp5.48xlarge (8× H100 80 GB) or p4de.24xlarge (8× A100 80 GB)~$98 / ~$40
40B with tensor parallelism across 2 GPUs2× 80 GBp4d.24xlarge (8× A100 40 GB) might work; p4de is safer~$33 / ~$40
7B (cheaper, still Evo 2)~16 GBg6e.xlarge (1× L40S 48 GB), g5.2xlarge (1× A10G 24 GB), p3.2xlarge (1× V100 16 GB — V100 is compute cap 7.0, may fail)~$2 / ~$1.2 / ~$3
1B / base~3 GBg5.xlarge (1× A10G 24 GB)~$1


To use these, I had to request some GPU instances. It turns out this requires a manual review. So while we waited for approval we honed the statistical approach. Stay tuned for part 2…