Eyes Wide Open: How Karpathy's Autoresearch Framework Could Democratize Glaucoma Research — A Blueprint for Patient-Led, AI-Driven Discovery in Vision Restoration

Eyes Wide Open: How Karpathy’s Autoresearch Framework Could Democratize Glaucoma Research

Introduction

Glaucoma is a chronic optic neuropathy that progressively destroys the retinal ganglion cells (RGCs) and leads to irreversible vision loss. It affects millions worldwide – an estimated 64.3 million people in 2013, projected to rise above 110 million by 2040 (physionet.org). Worryingly, about half of all cases remain undiagnosed until vision loss has already begun (physionet.org). Traditional glaucoma care is focused on lowering intraocular pressure (IOP) through medications or surgery, but these treatments cannot reverse damage or fully prevent blindness (pmc.ncbi.nlm.nih.gov) (physionet.org). As a result, there is an urgent need for new discovery in areas like neuroprotection, RGC/optic nerve regeneration, and innovative gene and cell therapies. However, academic and Pharma research on these frontiers remains under-resourced, partly because they are long-term, high-risk efforts. Meanwhile, advances in machine learning (ML) and artificial intelligence (AI) are empowering new approaches to data analysis and generative design.

Recent work (for example, Andrej Karpathy’s “autoresearch” project (www.theneuron.ai) (medium.com)) suggests that AI agents can autonomously run hundreds of small experiments on a single GPU based only on simple high-level instructions. In this paradigm, a human writes a short program.md describing the research goal, and an AI agent iteratively tweaks the model or hyperparameters, running 5-minute training runs, keeping successful changes, and discarding others (medium.com) (www.theneuron.ai). Overnight, this loop can perform on the order of 100 experiments, exploring architecture and parameter space without manual coding.

This article explores how Karpathy’s autoresearch framework could be applied to glaucoma research by motivated patients, caregivers, citizen scientists, and open-source developers. We will survey under-explored glaucoma research areas (neuroprotection, regeneration, etc.) and identify machine-learning tasks in each domain where small-model experimentation could plausibly help. For each task we suggest specific public datasets, baseline models/architectures, evaluation metrics, and outline what the agent’s program.md instructions might look like. We then discuss practical steps for a community to set up and share such experiments, including hardware considerations, data preparation, and collaboration platforms. We examine the specific context of vision restoration therapies and whether autoresearch-style loops might speed up optimization of neural prostheses or other interventions. Finally, we address how citizen-generated hypotheses could be validated and escalated to clinicians, and lay out a concrete 90-day roadmap for launching a patient-led autoresearch initiative – including how to avoid pitfalls of “research theater” and ensure real impact. Throughout, we cite current sources on glaucoma research and AI in vision, aiming for a balanced, realistic, and accessible guide.

1. The Glaucoma Research Landscape & Unmet Needs

Glaucoma research spans multiple fronts – from understanding disease mechanisms to developing new therapies for neuroprotection and vision restoration. Many promising areas are under-resourced:

Neuroprotection: Interventions that protect RGCs from dying (independent of IOP). Examples include neurotrophic factors and metabolic support. For instance, implants releasing ciliary neurotrophic factor (CNTF) have shown potential in early trials (pmc.ncbi.nlm.nih.gov), and other molecules like nerve growth factor and citicoline are being investigated (pmc.ncbi.nlm.nih.gov) (pmc.ncbi.nlm.nih.gov). However, these are not yet standard care, and more work is needed to translate them to patients. A 2025 review warns that neuroprotective glaucoma therapies are a “future treatment” needing further trials (pmc.ncbi.nlm.nih.gov), reflecting an unmet need.
RGC Regeneration & Optic Nerve Regeneration: Once RGCs and their axons die, current medicine has no way to reverse that. Some animal studies use gene therapies to reprogram RGCs or stimulate regrowth. For example, CRISPR-based repression of PTEN (a negative growth regulator) has promoted axon regrowth in rat neural cells (pmc.ncbi.nlm.nih.gov), and experiments co-deleting PTEN and SOCS3 drove sustained optic nerve regeneration in mice (pmc.ncbi.nlm.nih.gov). However, these breakthroughs remain in lab models. The underlying biology – e.g. how to recapitulate retinal development or bypass growth inhibitors – is complex. There is a huge demand for modalities (small molecules, genes, biomaterials) that could stimulate RGC survival or axon regrowth, but progress to human trials is slow.
Gene and Cell Therapies: New technologies like CRISPR, viral vectors, and stem-cell-derived RGCs hold promise for glaucoma. Strategies include gene editing to reduce IOP (e.g. targeting aqueous humor production) or modulate neurodegenerative pathways (pubmed.ncbi.nlm.nih.gov) (pmc.ncbi.nlm.nih.gov). Stem cells could (theoretically) replace lost trabecular meshwork cells or RGCs and secrete protective factors (pubmed.ncbi.nlm.nih.gov). Early work has shown that certain transcription factors (e.g. Oct4-Sox2-Klf4) can reprogram non-RGCs into RGC-like neurons in mice (restoring vision in optic nerve injury) (pmc.ncbi.nlm.nih.gov). Yet these approaches face safety and delivery challenges before reaching patients. Several recent reviews highlight gene therapy as an exciting but not-yet-clinical frontier for glaucoma (pubmed.ncbi.nlm.nih.gov) (pmc.ncbi.nlm.nih.gov). In sum, molecular and cell innovations are advancing, but resources and trial data are limited – creating an opportunity for computational exploration (e.g. designing optimal viral constructs or predicting effective gene edits).
Electrical and Optogenetic Stimulation for Vision Restoration: For patients with advanced glaucoma (or combined diseases like retinitis pigmentosa), artificial vision prostheses or optogenetic therapies aim to bypass damaged RGCs. Retinal implants (epiretinal or subretinal electrode arrays) and cortical implants have generated artificial percepts (“phosphenes”), but resolution is low and results vary widely. A recent 2025 review on AI in visual prostheses notes that “AI algorithms show promise in optimizing prosthetic vision, particularly through enhanced image saliency extraction and stimulation strategies,” though so far most studies are simulations (pmc.ncbi.nlm.nih.gov). In other words, machine-learning can help transform camera images into patterns of stimulation that are most informative given the device’s limits. Optogenetics (making surviving retinal cells light-sensitive) and transcorneal electrical stimulation (TES) pulses are also being trialed for glaucoma-related vision loss. All these areas need extensive parameter tuning (e.g. spatiotemporal patterns of stimulation, gene expression vectors) — tasks potentially suitable for autonomous ML search.
IOP-Independent Mechanisms: Many people continue to lose vision despite well-controlled IOP. Factors like impaired ocular blood flow, neurovascular dysfunction, or metabolic stress in the optic nerve head are recognized but not fully understood. Genetic studies suggest significant “IOP-independent” components of glaucoma risk (pubmed.ncbi.nlm.nih.gov) (pubmed.ncbi.nlm.nih.gov). Biomarkers of these processes (beyond pressure) are urgently needed. Also, half of glaucoma patients have “normal-tension” disease, highlighting that high IOP is not the only culprit. Research into vascular factors or other damage pathways is ongoing but fragmented. Computational modeling or mining of large datasets (e.g. genome-wide association studies) could help identify novel mechanisms or therapeutic targets in this domain.
Biomarker Discovery via Imaging and Fields: Early detection and monitoring of glaucoma often rely on imaging (fundus photos, OCT) and functional tests (visual fields). Advanced algorithms could uncover subtle biomarkers that human clinicians miss. For example, deep learning has begun to detect pre-perimetric visual field loss (changes invisible to standard field analysis) (pmc.ncbi.nlm.nih.gov). Similarly, AI has been used to analyze OCT layer thickness profiles to predict glaucoma before overt damage. However, there are not yet widely accepted AI biomarkers that are used clinically for screening or risk stratification. Computational bottlenecks here include the need for large, well-labeled datasets and robust validation protocols (pmc.ncbi.nlm.nih.gov) (pmc.ncbi.nlm.nih.gov). Public challenges (REFUGE, AIROGS, etc.) have begun to standardize data, but coverage of early-stage disease is thin (pmc.ncbi.nlm.nih.gov). Further machine-driven discovery of multi-modal biomarkers (combining OCT, fields, genetics, etc.) remains an open frontier.

Where can small-model ML help? Many of the above describe high-level problems. The bottlenecks often are data scarcity, many interacting variables, and slowly moving biology. Where an autoresearch agent shines is in automating small-scale experiments on available data. For example, if there is a modest dataset of OCT scans with and without early glaucoma, a citizen scientist can set up a rapid model-testing loop to find what architecture best distinguishes them. Likewise, small transformers on genomics or literature could suggest novel gene or drug candidates. The key is focusing on narrow tasks with defined metrics (classification accuracy, AUC, loss) and iterating quickly. Areas with limited public data (e.g. TES parameters or novel gene-cocktails) might rely on synthetic data or proxies. In the next section, we map specific ML tasks in glaucoma to the autoresearch approach.

2. Mapping Autoresearch to Glaucoma Problems

Karpathy’s autoresearch framework is domain-agnostic: it can run experiments in any ML task provided by a prepare.py and train.py with a well-defined evaluation metric. We identify several concrete glaucoma-related tasks and specify how an agent could tackle each. Each use case below includes: a publicly available dataset (if possible), a starting model or architecture, an evaluation metric, and a sketch of program.md instructions.

2.1 OCT Image Analysis (Structural Detection and Segmentation)

Task: Early Glaucoma Detection from OCT Scans. OCT imaging provides cross-sectional views of retinal layers. Thinning of the retinal nerve fiber layer (RNFL) and ganglion cell complex (GCC) can precede visual field loss. We can treat this as a classification task (glaucoma vs healthy) or regression (e.g. output RNFL thickness).
- Dataset: A recent release, SYN-OCT (www.nature.com), is a synthetic dataset of 200,000 circumpapillary OCT images (100k glaucoma, 100k normal) generated by GANs. Each image has associated RNFL thickness and segmentation masks. These are publicly available on Zenodo (www.nature.com). (Though synthetic, they are statistically validated to mimic real OCT (www.nature.com).) Alternatively, one could use the OCT-DL dataset (www.nature.com) (2064 images of various retinal diseases) or smaller clinical OCT collections.
- Model: Start with a small convolutional neural network (CNN). For classification, a model with ~ 3–5 convolutional layers (e.g. analogous to ResNet-18 truncated, or a custom small CNN) can work. For segmentation of RNFL/GCC, an encoder-decoder like a tiny U-Net (with depth 3–4) is suitable. The initial train.py could implement a simple CNN and training loop, with default hyperparameters.
- Metric: If doing glaucoma classification on OCT, use AUC (Area Under ROC) or accuracy on a validation split. For segmentation, use Dice coefficient or IoU on RNFL layer masks (SYN-OCT provides masks (www.nature.com)).
- Example program.md:
  
  "Goal: Maximize validation AUC for detecting glaucoma from OCT images. Allowed modifications: number of conv layers, filter counts, kernel sizes, activation functions, learning rate, optimizer choice, batch size, etc. After each 5-minute training run, evaluate AUC on the held-out set. If AUC improves, keep the change; otherwise revert." (medium.com) (www.theneuron.ai).
  The agent will thus try variations (e.g. adding layers, adjusting width, switching from Adam to RMSProp) to improve AUC.
Task: RNFL/GCC Layer Segmentation. Precisely measuring RNFL thickness is crucial. Using synthetic OCT scans (with provided segmentations) or any real OCT with annotated layers, one can frame this as a segmentation task.
- Dataset: SYN-OCT again provides RNFL segmentation masks (www.nature.com). Another source: some academic groups have labeled OCT B-scans (though often proprietary). If needed, one might use generic OCT segmentation datasets (like Duke retina OCT fluid challenge (www.nature.com)) as proxies.
- Model: A small U-Net-like CNN, perhaps even channel trimmed from a baseline. E.g., use 3 down/up blocks, starting with 16 filters. Agent is allowed to change depth and width.
- Metric: Dice score or mean IoU of the predicted RNFL mask vs truth.
- Example program.md:
  
  "Goal: Maximize the Dice score for RNFL layer segmentation on OCT. The base model is a 3-block U-Net. The agent may vary the number of filters, add dropout, or change learning rate. Train for 5 minutes each trial and compute Dice on validation. Keep modifications that increase Dice."
Task: Progression Prediction via Serial OCT. Using sequential OCT, predict future thinning. If longitudinal OCT data exist (e.g. UK Biobank or private clinic data), the goal could be to predict RNFL change or a binary “fast progressor” label.
- Dataset: Public longitudinal OCT data specific to glaucoma are scarce. However, one could repurpose SR OCT challenge data (or SYN-OCT images with simulated progression) to simulate this task. Alternatively, use UK Biobank OCT images (though not glaucoma-specific and not easily accessible to citizen scientists). For illustration, assume a dataset of OCT scans at time0 and time1 with labels.
- Model: A Siamese or concatenated CNN taking pairs of OCT images, outputting probability of progression. Start with feeding time0 and predicting time1 cut-off.
- Metric: AUC for binary progression classification, or MSE if trying to predict thickness change.
- Example program.md:
  
  "Goal: Identify eyes that will have rapid RNFL loss. Input: baseline OCT; label: >5μm thinning after 1 year. We use a CNN classifier. Allowed changes include network depth, learning rate, augmentation. Use validation AUC as the metric."

2.2 Visual Field (VF) Analysis

Task: Predict Future Visual Field Loss. Given one or more past Humphrey visual field tests (point-wise sensitivity values), forecast future sensitivity or rate of progression. This is a classic glaucoma management problem.
- Dataset: The GRAPE dataset (www.nature.com) (2023) provides longitudinal follow-up of 263 eyes (1115 records) with VF and fundus/OCT, including annotated progression. Another resource is the U.S. UH Visual Field (UWHVF) longitudinal database (www.nature.com) (28,943 fields from many patients). However, GRAPE is well-curated and public with both VF and outcomes.
- Model: A simple approach is a feed-forward network (fully connected) on the 54-point VF data (or compress to global indices). For progression prediction, a smaller MLP or 1D-CNN can handle the 54 or 30 input features. Another idea: treat the 8×8 grid as a tiny image and use a small CNN (e.g., 3×3 kernels).
- Metric: If predicting future mean deviation or point values, use MSE (lower is better). If classifying “fast progressor vs not”, use AUC.
- Example program.md:
  
  "Goal: Minimize MSE of predicted visual field. Alternatively, maximize AUC for classifying rapid loss. Base model: 2-layer perceptron on 54 VF values. Agent can adjust hidden size, activation, or add dropout. After each 5-min train, compute metric on val set."
Task: Identify Fast Progressors. Using a series of past VFs, classify which eyes will lose vision quickly.
- Dataset: Use the annotated progression status in GRAPE (www.nature.com) (they marked eyes as progressed). Or take UWHVF and label top decile of MD loss as “fast”.
- Model: Could concatenate features from two or three consecutive fields (or differences) into a small network. Possibly include baseline IOP and age if available.
- Metric: AUC for distinguishing fast vs slow progressors.
- Example program.md:
  
  "Goal: Maximize AUC for predicting rapid field progression. Input features: second-order differences of VF1 & VF2, plus IOP. Use small FC network. Agent may tune layer widths, learning rate, batch size."

2.3 Drug/Compound Screening (In Silico Candidate Discovery)

Task: Predict Candidate Neuroprotective/Regenerative Compounds. Use ML to find small molecules that might protect RGCs or encourage regeneration. For example, many known compounds (like nicotinamide, valproate) show neuroprotective effects. We can train models to recognize chemotypes correlated with known efficacy and then search chemical space.
- Dataset: This is challenging due to lack of a dedicated glaucoma drug database. As a proxy, one could use MolNet datasets (e.g. HIV inhibition, BBB permeability) or any bioactivity dataset. Alternatively, compile a list of compounds tested in optic nerve injury models (from literature mining) with labels. In practice, one might start with a more generic property (e.g. blood-brain barrier penetration data from MoleculeNet).
- Model: A small transformer or graph neural network on SMILES strings. A transformer (like GPT-2 style) with few layers or a simple graph convolutional net (e.g. 3 GCN layers) can be implemented in the train.py.
- Metric: If we treat as classification (active vs inactive), use AUROC. If predicting affinity or logP, use RMSE.
- Example program.md:
  
  "Goal: Maximize classification ROC-AUC for identifying neuroprotective-like compounds. Base model: small transformer on SMILES. Agent may adjust number of transformer layers, dropout, learning rate, or use alternative featurizations (e.g. fingerprint input). After each 5-min, evaluate AUC on val molecules."

(Note: Because public data for actual neuroprotection is scarce, this task is more illustrative. In practice, citizen scientists could create a custom dataset of known neuroprotective compounds vs controls and follow this pattern.)

2.4 Gene Regulatory Network Modeling (Single-Cell RGC)

Task: Identify Regenerative TF Combinations. Use single-cell RNA-seq data from RGCs to learn transcriptional patterns of regenerative growth. For example, some RGC subtypes regenerate better than others. An ML model might predict a “regenerative state” label, and one could inspect which transcription factors are important.
- Dataset: A 2018 study provides RGC single-cell transcriptomes (GEO accession GSE115404) (pmc.ncbi.nlm.nih.gov), identifying distinct RGC subtypes. We can use this dataset (or a subset) where cells are labeled by subtype or by experimental condition (e.g. pre- vs post-injury).
- Model: A small transformer or MLP operating on gene expression vectors (each cell has thousands of gene abundances). Practically, one would preselect top ~500 genes (e.g. highly variable genes). The train.py might implement a mini-transformer (e.g. 4 layers, embedding 256) or simple 2-layer perceptron.
- Metric: If using unsupervised analysis, one could use silhouette score, but more simply, if labeling cells as “regenerating” vs “non” (if labels exist), use classification accuracy/AUC.
- Example program.md:
  
  "Goal: Build a model distinguishing regenerating vs non-regenerating RGC gene-expression profiles. Start with a 3-layer transformer. Agent can change embed dim, depth, learning rate, or add batchnorm. Optimize validation accuracy."
  After runs, the best model’s attention weights or learned features might highlight key transcription factors for experimentation.

2.5 Electrophysiology Signal Analysis

Task: Detect Subclinical RGC Dysfunction via ERG. Pattern electroretinogram (pERG) or other electrophysiological signals can reveal RGC health. For example, delayed or reduced ERG responses may precede visual field defects. We can attempt to classify signals as “normal” vs “glaucoma suspect.”
- Dataset: Public ERG datasets in glaucoma are rare. One could use a surrogate: a dataset from animals (retinal degeneration) or synthetic signals. If unavailable, even generic 1D electrophysiology datasets (e.g. ECG) could illustrate the pipeline.
- Model: A 1D CNN (e.g. 2 conv layers followed by FC) on the time-series data. Alternatively, an LSTM can be used if sequences are longer.
- Metric: Accuracy or AUC in classifying a subtle dysfunction vs normal. Possibly F1 if classes are imbalanced.
- Example program.md:
  
  "Goal: Maximize validation accuracy for classifying ERG traces (healthy vs early glaucoma pattern). Use a 1D CNN. Agent may adjust filter sizes, stride, or add recurrent layer. Keep any changes that improve accuracy."

2.6 Literature Mining (Hypothesis Generation)

Task: Fine-tune a Small Language Model to Surface Novel Insights. With thousands of glaucoma research papers in PubMed, an ML agent could look for connections or repurpose candidates. For instance, link neuroprotective pathways to existing drugs. We can treat this as a language modeling problem or as a retrieval problem.
- Dataset: Compile a corpus of glaucoma-related abstracts (e.g. use PubMed search for “glaucoma gene therapy” etc). One can download ~10,000 abstracts via NCBI APIs. For a simpler start, use PMC open-access glaucoma articles.
- Model: A small transformer language model (e.g. 6-layer GPT-2) or even BERT fine-tuned. For autoresearch purposes, we likely fine-tune a causal model (GPT) on the text.
- Metric: Standardly, validation loss (perplexity) is optimized. If doing a classification (e.g. given abstract, predict a label for a drug or pathway), use accuracy/AUC.
- Example program.md:
  
  "Goal: Minimize validation perplexity of a small GPT-2 on the glaucoma literature corpus. Use 5-minute fine-tuning runs. Agent can vary number of layers, hidden size, learning rate, context length. Keep changes that reduce perplexity."
  Once trained, one can prompt this model to generate hypotheses (e.g. “Top candidate repurposable drugs for neuroprotection in glaucoma: ...”).

In each of these domains, the key is that a single GPU and brief runs allow many trials. We are not expecting the agent to code new algorithms from scratch but to tweak an existing training script. The human role is writing program.md to guide the agent’s search towards a glaucoma-specific goal (like maximizing AUC on a fundus dataset or predicting RNFL thickness). The examples above illustrate how train.py could be set up initially and how program.md prompts to improve a chosen metric (medium.com) (www.theneuron.ai).

3. Practical Citizen Science Implementation Guide

How can motivated individuals with limited resources (e.g. a single RTX 3060 or a MacBook with Apple Silicon) actually apply autoresearch to glaucoma problems? The good news is Karpathy’s repo is small and has guidance for scaling down. Here are key steps and tips:

Environment Setup: Clone the karpathy/autoresearch repo. You’ll need a modern Python and ideally an LLM access (the agent itself is typically a pre-trained LLM like GPT-4 or Claude that edits the code). For GPUs, install PyTorch with proper CUDA/metal support. For Apple Silicon, use one of the forks (e.g. MLX) or a PyTorch build for M1/M2 (see the repo’s docs). On Windows/Linux with a 3060 or 4070, normal PyTorch CUDA works.
Configuring for Small GPU: The default autoresearch uses a ~50M-parameter GPT-like model and sequences of length 1024 (medium.com), which may be heavy. For a GTX 3060 (12GB), you should reduce model size and sequence length. In train.py, set MAX_SEQ_LEN=512 or even 256. Drop the number of layers and width (the medium GPT is ~8 layers; try 4 layers, 256 width). The instructions in the community mention lowering “DEPTH”, “WIDTH”, etc. You can also reduce the optimizer’s memory by using smaller batch sizes (even 16 or 8). The agent can still mutate these parameters, but giving it a smaller starting point ensures runs <5 minutes. The autoresearch GitHub README and issue discussions also note that Mac M1 chips can handle shorter sequences (e.g. 256 tokens) due to limited memory; similar scaling applies to any GPU.
Preparing Glaucoma Data: Each task’s data must be loaded and split. Public glaucoma datasets include:
- Fundus Datasets: ORIGA(-light) (650 labeled images (pubmed.ncbi.nlm.nih.gov)), RIM-ONE DL (485 images with cup/disc segmentations (github.com)), REFUGE (1200+ images, with training/test splits (refuge.grand-challenge.org)), the new Hillel Yaffe Glaucoma Dataset (HYGD) with ~1200 fundus images and high-quality labels (physionet.org). EyePACS/AIROGS (tens of thousands of retinal images) is also publicly accessible via registration (e.g. Kaggle).
- OCT Datasets: SYN-OCT (200k synthetic B-scans with RNFL masks (www.nature.com) (www.nature.com)), OCTDL (2064 images of various retinal diseases (www.nature.com)), and others from public challenges.
- Visual Field Data: GRAPE (263 eyes longitudinal VF plus images (www.nature.com)). UWHVF (28k VF tests) is open if you download from University of Washington repository (www.nature.com). Some Kaggle challenges include VF data.
- Electrophysiology: No large open glaucoma ERG dataset is known, but one could start with any accessible norm vs glaucoma signal data.
- Chemical/Gene Data: Standard datasets like MoleculeNet (for compounds) or GEO (for genes) can be repurposed. E.g. download GSE115404 raw counts (via GEO query (pmc.ncbi.nlm.nih.gov)) and preprocess to expression matrices.
For each, you need a prepare.py that loads data and defines train_set, val_set, and an evaluation function. Karpathy’s template expects prepare.py to output training data and an evaluation routine that returns a loss or metric. For example, prepare.py for RIM-ONE might load images and CC labelled as glaucoma, split into train/val folders, and define a function computing validation AUC. REFER to [14†L71-L79] for how RIM-ONE is structured.
Adjusting Data for Small Scale: If datasets are large (like EyePACS or SYN-OCT), you can subsample to create a “tiny” dataset of a few hundred examples (the model can still learn something valuable on a small corpus). The autoresearch repo even mentions using “TinyStories”-style tiny datasets to run on tiny hardware. For example, pick 500 images from ORIGA (balanced), or 1000 VF fields from GRAPE. Likewise, for language, one could use a 5,000-abstract subset of PubMed glaucoma papers. The key is a fixed dataset that the agent iterates over. Ensure to pre-shuffle and split 80/20 so each 5-min run sees the same train/val split.
Writing program.md Strategies: The community should share different program.md prompts (like “recipes”) in version control. Each file could encode a research strategy. For instance, one strategy might say “increase network depth if depth <6, else reduce learning rate,” while another might say “focus on data augmentation changes.” Over time, groups can compare which strategies yielded better metrics on leaderboards. A good program.md includes a goal (e.g. maximize AUC or minimize validation loss) and hints at allowable mutations (layers, filters, LR). The agent’s LLM uses these instructions to propose code edits. Keep metrics standardized (e.g. always report AUC for glaucoma classification tasks) so experiments are comparable.
Community Collaboration: To make this effort scalable, a citizen-science community should organize:
- Shared Experiment Logs: Post each experiment’s results (e.g. “Run #27 of program-v1 achieved Val AUC=0.82 with width=4, depth=3”).
- Standardized Metrics: Define metrics for each task: e.g. “OCT glaucoma AUC”, “VF progression AUC”, “Attribute AUC”, etc. A shared leaderboard (akin to autoresearch’s val_bpb) can track top scores. For example, a Slack or GitHub Actions might collect each agent’s best AUC weekly.
- Version-Controlled program.md: Host all program.md in a GitHub repo. Members can fork and propose new strategies (via pull requests) while keeping historical versions. This way multiple approaches can be tested in parallel (e.g. “program_word2vec.md” vs “program_transformer.md”).
- Data and Code Sharing: Use public repos or notebooks for data prep scripts, and share train.py modifications found by the agent (to reproduce in standard ML frameworks). Linking to the original dataset sources (Kaggle, PhysioNet, Zenodo) ensures others can download the same data.

By lowering technical barriers (the agent edits code, user edits instructions in Markdown), and by coordinating efforts (shared logs, leaderboards), citizen scientists can collectively explore hyperparameter/model choices for these glaucoma ML problems. In essence, they invest human creativity in defining goals, and let the agent run the grind of 100 experiments overnight per goal (medium.com) (www.theneuron.ai).

4. Vision Restoration Specifically

Vision restoration – regaining sight after damage – is a particularly exciting target for AI-driven optimization. Current AI-assisted vision restoration research includes retinal implants, cortical prostheses, and optogenetics. Here’s how an autoresearch loop could fit in:

Optimizing Visual Prosthesis Encoding: Modern prostheses (retinal implants or cameras linked to electrode arrays) try to translate a camera image into electrical stimulation patterns that the brain interprets as sight. The challenge is that the “bandwidth” of electrodes is very limited (often just tens to a few hundred points) (pmc.ncbi.nlm.nih.gov). An ML model (a small CNN or transformer) can be trained to map input images to ideal stimulation maps, but the best hyperparameters or architectures for this translation are unknown. An autoresearch agent could run 100 variations of a “neural encoder” model in hours. For example, set up a dataset of image→stimulation pairs (either simulated phosphenes or patient data) and have the agent optimize the encoder network to minimize a reconstruction loss or maximize a utility metric (contrast intactness, recognition accuracy). The agent might try adding attention layers, changing convolution sizes, or tuning learning rates. Over many runs, one could find small networks that deliver more salient prosthetic outputs. Some recent work already uses AI to extract visual saliency for prostheses (pmc.ncbi.nlm.nih.gov); autoresearch could automate the tuning of such pipelines.
Optogenetic Stimulation Patterns: In optogenetic therapy, survivors of RGCs or other retinal cells are made light-sensitive (via introduced genes). The inputs from a camera must then be encoded into light pulses. Here again, an ML model can control patterns. One could frame a toy task: small network transforms camera image to a light-intensity map (same dimensions as cells). The agent’s objective could be to maximize some metric of effective stimulation (e.g. maximize activation of target cells in a simulated retina). Each trial might run a quick simulation of the response. Over iterations, the agent might explore pulse durations or spatial filters. For instance, adjusting the aggressiveness of a high-pass filter on the camera input might be beneficial for some patterns. The point is that many analog parameters (filter kernels, nonlinearity, temporal pulse coding) can be swept automatically.
Pulse Pattern Optimization (TES and Implants): Even non-machine-learning domains can benefit from quick search. For example, a recent study (Xie et al. 2025) found that shorter pulse durations and insertion of interphase intervals significantly improved cortical activation for retinal implants (pmc.ncbi.nlm.nih.gov). This suggests the parameter space of electrical stimulation has strong, non-intuitive effects. An autoresearch agent could treat the stimulation protocol parameters (phase duration, frequency, interval) as “network parameters” and run many small experiments (each simulated or empirical) to maximize cortical response. For instance, set up a simplified electrical model (or use recorded evoked potential data) in prepare.py and let the agent tweak train.py parameters like pulse timing to maximize a defined response amplitude. This is akin to automating what aficionado neuroscientists do manually.
Viral Vector Design and Scaffold Geometry: In more exploratory therapy development, the agent’s looping approach could also tackle biomedical optimizations. For example, design of AAV viral capsids or promoters to target RGCs could be guided by small predictive models (e.g. logistic regression on sequence features). Autoresearch could repeatedly try modifying a model that predicts tropism or expression (trained on e.g. small viral libraries) to improve that prediction. Similarly, if someone has simulation code for growth in nerve scaffolds (for optic nerve repair), the agent could tweak geometric parameters to maximize axon extension. These are advanced, but conceptually fit – the “agent as experimenter” could adjust model or simulation parameters for improved outcomes.

In summary, any aspect of vision prosthesis or restoration that relies on parameterized algorithms could be improved via rapid iterations. Importantly, the limitation is we generally only have simulational data for many of these tasks; actual patient testing of hundreds of variants isn’t possible. But autoresearch can operate in silico to propose the best candidates for later clinical testing. As the prosthesis review noted, “ensuring phosphenes are reliably generated at precise locations… is an important challenge” and “AI-driven models have shown potential” in this area (pmc.ncbi.nlm.nih.gov). Autoresearch could significantly accelerate finding those AI models’ best configurations.

5. Bridging to Clinical Impact

Computational results must ultimately connect back to real glaucoma research and care. How can ideas generated by patient-led autoresearch be validated and advanced?

Collaboration with Research Groups: Citizen scientists should reach out to established glaucoma research consortia. Examples include the International Glaucoma Genetics Consortium (IGGC) and the NEIGHBORHOOD consortium, which pool genetic and clinical data (pubmed.ncbi.nlm.nih.gov) (pmc.ncbi.nlm.nih.gov). Findings from autoresearch (e.g. a novel candidate gene or drug repurposing hypothesis) could be shared with such groups for experimental follow-up. Tissue culture labs (e.g. at major universities) or sleep researchers might test compounds on RGC survival. Academic clinicians can correlate any biomarker or image classifier with their patient data under IRB. Starting dialogues between hackathon-style groups and formal labs is key.
Engaging Patient Advocacy Organizations: Groups like the Glaucoma Research Foundation or Cure Glaucoma Foundation often fund patient-centered innovation. They could sponsor proof-of-concept projects or citizen competitions using autoresearch. These organizations have clinician networks and could help route promising model leads to the clinic. For example, if an agent flags an existing FDA-approved drug as neuroprotective, an advocacy group could assist in setting up a small trial under proper protocols. Highlighting successes will require framing outputs as hypotheses (not medical advice) and ensuring transparency.
Ethical and Safety Guardrails: Citizen scientists must use only de-identified public data or fully synthetic data. Any use of actual patient records requires an IRB-approved protocol (and likely patient consent). Output from autoresearch loops should be clearly labeled as hypothesis-generating. For instance, “This model suggests Drug X may protect RGCs – experimental validation needed.” Critical medical decisions must remain with doctors. Risks include inadvertently distributing models that predict personal outcomes (glaucoma progression) – explicit disclaimers are necessary not to treat these as diagnostic tools. Data privacy best-practices (e.g. using aggregated or anonymized fields) are a must.
Precedents in Citizen Science: It is not unprecedented for amateurs to contribute to medical/neuroscience research. The Eyewire project (MIT’s crowdsourced neuron-mapping game) mobilized volunteers to reconstruct retinal neural circuits (www.citizenscience.gov). In ophthalmology, non-experts have helped annotate images in OpenAI-funded challenges (e.g. labeled datasets for eye disease). Outside eye care, games like Foldit (protein folding puzzles) and Galaxy Zoo (classifying galaxies) show that citizen participation can solve hard scientific problems. These successes encourage the idea that many hands (and now AIs) can indeed aid complex research. The autoresearch approach is like giving each person an AI-powered lab assistant: previous crowdsourced efforts only used humans to analyze fixed tasks, whereas here the human sets the goal and the AI does the iteration.

By being transparent, cautious, and collaborative, a citizen science autoresearch initiative can earn trust. It should emphasize “generating leads, not prescriptions.” If the community documents methods and shares code openly, professional researchers can reproduce findings. For example, if someone finds a new combination of RGC-protective factors, they could publish it in a preprint or alert a lab. Citation-style references (as we do here) help bridge: e.g. “We treated your list of candidate drugs in context of known pathways (pmc.ncbi.nlm.nih.gov).” Ultimately, this is a form of open science – patient-driven but scientifically rigorous. If ethical standards are maintained, such grassroots innovation has great potential to spark new collaborations and ultimately feed into peer-reviewed ophthalmology research.

6. A Concrete 90-Day Roadmap

A focused, time-boxed plan can rally a community of 10–50 people (with at least one GPU or Apple Silicon each) to launch an autoresearch-for-glaucoma effort. Here is a suggested phased plan:

Week 1–2: Formation & Setup
- Recruitment and Kickoff: Create a communication channel (e.g. Slack or Discord) and a GitHub repo for the project. Publicize to glaucoma patient forums, biohacker groups, and AI meetups.
- Hardware Check: Ensure everyone can install PyTorch and clone Karpathy’s repo (or the Maple fork). Hold a setup session where each member runs a sample autoresearch loop on a toy dataset (e.g. CIFAR-10 subset) to verify the environment.
- Dataset Selection: Decide on 1–3 initial tasks (e.g. OCT classification, VF progression). For each, assign a small team to prepare data: e.g. one team downloads RIM-ONE images (github.com), another retrieves GRAPE fields (www.nature.com), another collects literature abstracts. Teams should split data 80/20 and create prepare.py stubs.
- Baseline Models: For each task, finalize a simple train.py: e.g., a tiny CNN for RIM-ONE, an MLP for VFs. Choose evaluation metrics (AUC, Dice, MSE).
- Initial program.md Drafting: Each team writes an initial instruction file (program.md) stating the goal and allowed changes. E.g. for RIM-ONE: “maximize glaucoma detection AUC,” for GRAPE: “minimize VF MSE.”
Week 3–6: First Experiment Cycles
- Run Autoresearch Loops: Each subgroup runs the agent on their task overnight (roughly 100 5-min runs). Use a single program.md to start, then let participants add variations (e.g. “program_temp1.md”).
- Collect Results: Each morning, teams examine the logs (the repo auto-logs each run). Record the best metric achieved, the model parameters at that time, and any notable changes the agent found. For transparency, push these results to the shared GitHub (perhaps in CSV or JSON).
- Iteration & Feedback: Compare runs. Did any strategy beat the baseline significantly? If a sub-team sees little progress, they should tweak program.md (e.g. being more aggressive with learning rate changes). Each weekend, synthesize findings in a community meeting.
- Tools: Use Git for version control on program.md and on the code templates. Consider a shared Google Sheet or wiki table for leaderboards (e.g. “OCT-AUC: best=0.85 by Alice; VF-RMSE: best=2.1 by Bob”). This motivates healthy competition and transparency.
Week 7–12: Refinement and Outreach
- Refine Experiments: Based on early results, refine promising tasks. For example, perhaps the RIM-ONE classifier topped 0.90 AUC – now try adding data augmentation or a slightly deeper net. Encourage branching: some can try different architectures (e.g. Vision Transformer tiny instead of CNN). Agents can run multiple program.md variants in parallel.
- Result Synthesis: Create short reports on each domain (OCT, VF, etc.), summarizing what worked. For instance, “We improved GCC segmentation Dice from 0.60 to 0.75 by switching from ReLU to GELU activation.” Use lay language so non-experts can follow (glossary for ML terms).
- Community Presentation: By week 10, write a blog post or slide deck summarizing the initiative so far. Highlight any nontrivial findings (even “null” results are useful to share). Invite feedback from online forums; perhaps contact a researcher asking for comments (“We found X neural network tweaks help classify early glaucoma – any ideas if this aligns with physiology?”).
- Plan Outreach: Identify one or two ophthalmology labs or clinicians interested in collaborating. Reach out with the initial results. For example, connect with authors of the HYGD dataset or the GRAPE team on Twitter/LinkedIn, mention your citizen findings. Explore possibilities for co-validation (e.g. send them the trained model weights to test on their data).
Beyond 12 Weeks: Next Steps
- Continue looping on the most promising tasks and new ones. For example, if RIM-ONE yields good results, next tackle REFUGE. Perhaps build composite models (ensemble of CNNs).
- Officialize a project page or preprint describing the effort.
- Consider organizing a hackathon to bring in more minds, possibly in partnership with a glaucoma charity.

By structuring this way, the community can make steady progress, learn together, and start bridging to experts by the end of 90 days.

7. Risks, Limitations & Honest Assessment

The autoresearch-for-glaucoma idea is ambitious, so it requires honesty about potential pitfalls:

Risk of Overfitting and Spurious Patterns: Small models on small, noisy datasets often latch onto coincidences. An agent might find a tweak that improves validation AUC simply by overfitting to idiosyncrasies. For example, if a subset of images had a subtle annotation mark, the network might use that instead of true glaucoma features. This leads to “gradient descent foolery.” To mitigate:
- Always use held-out test sets (completely separate from any tuning) for final evaluation.
- Limit complexity: keep models modest, and watch if the agent excessively deepens or widens the net beyond reason.
- If a model achieves near-perfect score too quickly, question it.
- Use sanity checks: e.g. scramble labels and see if AUC drops to random (if not, there is leakage).
Bias and Data Quality: Public glaucoma datasets often come from narrow populations (e.g. ORIGA from Singapore) (pubmed.ncbi.nlm.nih.gov). A model tuned to those may not generalize. Citizen experiments should note this limitation. Ideally, multiple datasets (from different cohorts) are used to check if findings are robust.
False Leads (“Research Theater”): Running tons of experiments feels productive, but if every improvement is only on synthetic or trivial datasets, it might not benefit patients. To avoid this:
- Focus on tasks with clinical relevance (e.g. early detection from routine OCT).
- Tie outcomes to real measures when possible (e.g. AUC for progression, not just tiny loss delta).
- Prioritize interpretability: if the agent “finds” a new biomarker, try to ensure it makes sense (e.g. is it focusing on known anatomical changes?).
No Clinical Guarantee: It must be crystal clear: output from these loops is hypothesis-generation, not medical advice. A model suggesting a new drug must be vetted in the lab before any patient use. Overclaiming is dangerous. Label all shared results with disclaimers: “This is an AI-exploration and not a peer-reviewed finding.”
“Small Model” Limitation: Very small networks have limited capacity. They may miss complex patterns. In contrast, big models often see breakthroughs but require huge data. Here we accept limited scope: hope is that even small improvements can guide research. But we should not expect these models to replace deep learning on massive data. They’re best at quickly trying obvious ideas.
Agent Trustworthiness: The agent (e.g. GPT-4) might hallucinate or deviate. It’s important that results are reproducible: after an agent-run, a human should check what changes were kept and re-run training to confirm the metric. Keep the agent honest by include statements in program.md like “only accept actual improvements in evaluation metric”.

Despite these challenges, the key safeguard is transparency and critical follow-up. Document everything. When a model shows a pattern, verify it. If many citizen scientists see the same anomaly (e.g. all high-AUC models for an OCT task emphasize the nasal retina region), that strengthens the case. The goal is accelerating the idea generation phase, not avoiding careful science afterwards.

Conclusion

Glaucoma is a complex, silent blinding disease with many unmet research needs – from protecting neurons to restoring vision. At the same time, AI has democratized experimentation: one person with a GPU and some determination can run automated hyperparameter searches that would take teams weeks manually. Karpathy’s autoresearch framework essentially hands each citizen an AI lab assistant. By writing clear high-level goals in Markdown, community researchers can let an agent churn through products and cut straight to promising leads.

We have outlined how this can be done in practice: identifying glaucoma ML tasks, selecting data (fundus and OCT images, visual fields, molecular datasets), defining models and metrics, and using program instructions to guide the search. We sketched a 90-day community roadmap and noted bridges to clinicians to ensure that valuable output can inform actual glaucoma science. The approach is very much “citizen science”: opening up scientific discovery tools in an accessible way, while still relying on expert oversight where it matters.

Citations: We have referenced the latest resources in both glaucoma research and AI. Key facts (disease prevalence, half undiagnosed (physionet.org)), promising therapies (CNTF implants (pmc.ncbi.nlm.nih.gov), gene editing (pmc.ncbi.nlm.nih.gov)), and shady pitfalls (AI in imaging (pmc.ncbi.nlm.nih.gov)) are grounded in current literature. Autoresearch itself is described in Karpathy’s walkthrough (medium.com) and review (www.theneuron.ai). These should lend credibility to the vision outlined here.

By the end of it all, we hope the reader feels empowered: if you are a patient, caregiver, or passionate hobbyist, you could be part of driving glaucoma research forward. The tools and data exist, the problems are clear, and with coordination and an AI agent, we can accelerate learning. As with any research, the journey will have false starts, but even failures teach us something – often steering human minds toward the right approaches. With eyes wide open to both the possibilities and the pitfalls, citizen-led autoresearch could become a powerful complement to traditional glaucoma science.

Start Here

The easiest way to dip your toes into autoresearch for glaucoma today: Run a tiny classification on ORIGA fundus images.

Get the data: Download the ORIGA-light dataset (650 retinal fundus images labeled normal vs glaucoma) (pubmed.ncbi.nlm.nih.gov). Split ~80% train / 20% validation.
Initial model: Use or adapt the sample script from [karpathy/autoresearch] for image classification. For example, a bit of code to load ORIGA images and train a small CNN (2–3 conv layers) to distinguish glaucoma vs healthy.

Write program.md: In text, set the goal to “maximize validation AUC for glaucoma detection”, and instruct the agent it may tweak model depth, learning rate, etc. For instance:

# Goal: Maximize AUC on glaucoma vs normal for ORIGA dataset.
The agent should try adjusting convolutional layer sizes, number of filters, and learning rate. Each trial is 5 minutes of training. If the validation AUC improves, keep the change. Repeat.

Run the loop: Launch autoresearch (point it to your prepare.py, train.py, and program.md). Let it run for several hours or overnight on your RTX 3060. It will perform ~100 experiments automatically.
Check results: Examine the console or log to see the best validation AUC achieved (should be >0.8 if all goes well). You now have a model and training script that the AI agent refined.

This simple weekend experiment already gives you firsthand experience with building an ML pipeline without writing new code by hand. Document what you tried and share your program.md and results with the community. Each small success (AUC bumps, interesting network changes) is a building block. You're literally instructing an AI to do research on your glaucoma problem of choice – and in doing so, you learn both glaucoma data science and have hope to make a difference in understanding or treating vision loss.

Good luck! Keep questions and findings open-source, and remember: this is research-toy tools, not medical advice. Check your runs carefully and enjoy the process of discovery.

**`

Eyes Wide Open: How Karpathy's Autoresearch Framework Could Democratize Glaucoma Research — A Blueprint for Patient-Led, AI-Driven Discovery in Vision Restoration