In Part 1: Analyzing a Genetic Variant of “Unknown Significance: a case study of Science-ing in the agentic era I went over my analysis of my personal genetic variant, using Claude Code, Evo 2 (the NVIDIA hosted version), and comparing with various variant pathogenicity predictors.
Part 1 technique recap: Score based on short window before the variant
My initial use of Evo 2 in part 1 was simple- it just input the previous 512 base pairs as context, in part because I was limited by what the NVIDIA’s Hosted Evo 2 API supported. Although limited in its hosted capabilities, it’s awesome NVIDIA supports this, and you can access it with a free API key (after registering), getting access to the model without any extra setup.
An example call is as simple as:
curl -sS -X POST \
https://health.api.nvidia.com/v1/biology/arc/evo2-40b/generate \
-H "Authorization: Bearer ${NVIDIA_API_KEY}" \
-H "Content-Type: application/json" \
-d '{
"sequence": "ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT",
"num_tokens": 1,
"top_k": 4,
"enable_logits": true
}'
Go ahead, try it yourself! You’ll see not surprisingly (since a human LLM can spot A is next), the logit corresponding to A is the highest. To then interpret this, I needed to postprocess the logits vector: convert to log-probabilities via log-softmax, and compute the variant effect delta. The output was a delta between my variant and the model, where a negative number means the model assigns lower likelihood to the mutant sequence than to the reference, suggesting the substitution violates evolutionary constraints.
NVIDIA’s hosted model supports simply predicting the next nucleotide based on a previous sequence, so do anything more complex I needed to run the model myself. And I wanted to do that, because using the information after my variant would be helpful for scoring as well. For that, my M3 Mac wouldn’t cut it. I needed a bone fide GPU. I decided to try out driving the process with Claude Code to rent and use one from Amazon. That is the subject of this blog post. Welcome to part 2 of the agentic science-ing. Let us therefore do a science.
Part 2’s Technique: Scoring over a larger window around the variant
Referencing the technique in Arc’s BRCA1notebook [my variant is not BRCA1, but the technique stands], this time I used wide 8,192 base pair window around the genomic position of my variant, instead of just a short stretch beforehand. In this way, a pathogenic variant which disrupts a downstream splice motif or a transcription factor binding site, or that just sits in a heavily constrained piece of DNA, gets caught in a way single-position scoring would miss. Just as in part 1, I also compared what this delta was for my variant of unknown significance to a variety of variants of known significance (from deleterious to benign) so I could further calibrate where my variant falls.
Spoiler alert: The results didn’t fundamentally change
My results matched part 1 in ordering of my variant, and other variants by potential severity. Both part 1 and part 2 agree on direction. My variant of unknown significance scores less deleterious than the curated pathogenic-missense comparators under both. Whatever the model “thinks” about this variant, that part doesn’t change between scoring conventions. The magnitude of the results are different: the single-position rule (part 1) produces deltas on the order of 1 to 10. Working with the window produces deltas on the order of 0.001 to 0.02, meaning I saw three orders of magnitude of compression of the signal. I didn’t do a deep statistical analysis on how much to read into the veracity of these results, but my interpretation is that there’s less signal in the part 2, more accurate approach.
The ops: provisioning a GPU agentically, and spinning up Evo 2 on AWS infrastructure
I had to decide on a permissions philosophy, manually request a GPU quota from AWS (which was a surprise to me), then develop and launch the analysis.
IAM philosophy: a scoped dedicated user for the coding agent
I already am an AWS user for many a hobbyist project over the years. And I have crawled to AWS humbly to get rebated when I ran a little wild with resources by accident racking up an unexpected $500 bill. I had no desire to repeat this process when I handed the reins over to my coding agent, so I approached this part gingerly. At this same time, this fulltime working mom of 4 ain’t got time to AWS by hand, so let’s proceed cautiously to create some guardrails and then let my agent take the wheel.
I created a dedicated user for my coding agent (as always, my bud Claude Code). This dedicated user I worked on an IAM policy (Amazon’s identity management) which gave it access to as little as possible, and ensured that it could not create new users, modify policies or rotate its own credentials. It also has no access to billing or the ability to change instance type at runtime. I also gave it regional isolation to a regio that I don’t have other resources in to further isolate it. Finally, and importantly I set a budget for $200/month with alerts at 50, 80 and 100%. That can only be done with a console action under the admin login.
Now – I was coached heavily by Claude in terms of how to set it up. So if Claude was out to get me, I think it’s possible there’d be a back door in the policy. I love a good Claude code in Auto mode, but this part was all manual, approving each step, and auditing as I went. I also had to go on console myself to setup much, which was another check on the process.
Quota request: aka waiting around for a few days and bugging support, where’s my flying car!
Something I learned in this process, is that AWS accounts default to zero vCPU of GPU on-demand quota. The quota is per-region and per-instance-family: “Running On-Demand G and VT instances” covers g5, g6, g6e (Ampere and Ada Lovelace consumer-grade GPUs); “Running On-Demand P instances” covers p4d, p4de, p5 (Ampere and Hopper datacenter-grade GPUs). This is current as of the writing of this article.
I filed a G-family quota increase via Service Quotas, asking for 8 vCPU (more than I needed but I wanted to future proof the ask). Nothing happened for awhile, and I then had the sense to add a narrative reply to the support case explaining the scientific use case. It stuck around in support for more than a day despite replies from customer service, until I opened up the chat and conversed with customer support in real time. They mentioned that the original request was never actually opened! This was frustratingly manual for just a tiny bit of compute, and delayed the analysis by a few days. But I’ll be ready with my GPU quota in the waiting for my next cool project, whatever it is.
Handing the quota over to my agent: I’m glad Claude took the wheel!
Now that I had the policy and the GPU, it was time to drive! The agent was also able to start the GPU instance due to the setup above. I ran the agent locally on my Mac, and had the agent SSH into the GPU instance through a shared SSH connection. This saved time and had a central dispatch location. This process showed its value as the agent cycled through the various software installs, and realized it needed a g6.xlarge GPU instead of the g5.xlarge GPU it originally started with. I was very glad that after the initial setup, I didn’t have to deal with these headaches. The analysis runs, which were longer, were wrapped in a tmux session so they could survive disconnects.
Cost: $2.10 from start to finish, not bad
| Phase | Wall time | Compute time | Cost |
|---|---|---|---|
| g5.xlarge launch + install + smoke (failed on FP8) | ~30 min | ~30 min | $0.50 |
| Terminate g5, launch g6.xlarge, reinstall | ~25 min | ~25 min | $0.34 |
| Smoke test (1B model) | ~2 min | ~2 min | $0.03 |
| 7B calibration + position-matched null | ~25 min | ~25 min | $0.34 |
| Result inspection, instance stopped (no compute billing) | ~30 min | $0 | $0 |
| EBS volume orphan period | ~5 min | n/a | <$0.01 |
| Terminate, EBS auto-delete | n/a | n/a | n/a |
| Total | ~2 hours | ~80 min | ~$2.10 |
That’s for a one-shot run of canonical Evo 2 PLL on a few dozen variants plus a position-matched null. If we wanted to do this on the 40B reference checkpoint, it would require P-family quota (a separate request, harder to approve) and run on a p4de.24xlarge at $40/hr, roughly $160 for an equivalent-scope analysis. I was happy with this level of accuracy, but it’s great to know the process if I wanted to go up in accuracy.
A model modeling
This was a very fun project, it helped me get familiar with Evo 2, let me analyze my own genome for something potentially useful, and got me setup with coding agentically on AWS infrastructure. I also closed things out by drafting a 10 page PDF report detailing charts and graphs of all my results as well as interpretations. I feel like in prior eras this might have been a semester project, a bachelor’s thesis, or even a master’s thesis. For me, it was a few night’s work after my baby went to bed. I’m loving the future of science!