Full Code of Orchestra-Research/AI-research-SKILLs for AI

main 28f2d29236f2 cached

499 files

7.4 MB

2.0M tokens

132 symbols

1 requests

Download .txt

Showing preview only (7,937K chars total). Download the full file or copy to clipboard to get everything.

Repository: Orchestra-Research/AI-research-SKILLs
Branch: main
Commit: 28f2d29236f2
Files: 499
Total size: 7.4 MB

Directory structure:
gitextract_o6a5td4x/

├── .claude-plugin/
│   └── marketplace.json
├── .github/
│   └── workflows/
│       ├── claude.yml
│       ├── publish-npm.yml
│       └── sync-skills.yml
├── .gitignore
├── 0-autoresearch-skill/
│   ├── SKILL.md
│   ├── references/
│   │   ├── agent-continuity.md
│   │   ├── progress-reporting.md
│   │   └── skill-routing.md
│   └── templates/
│       ├── findings.md
│       ├── progress-presentation.html
│       ├── research-log.md
│       └── research-state.yaml
├── 01-model-architecture/
│   ├── .gitkeep
│   ├── litgpt/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── custom-models.md
│   │       ├── distributed-training.md
│   │       ├── supported-models.md
│   │       └── training-recipes.md
│   ├── mamba/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── architecture-details.md
│   │       ├── benchmarks.md
│   │       └── training-guide.md
│   ├── nanogpt/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── architecture.md
│   │       ├── data.md
│   │       └── training.md
│   ├── rwkv/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── architecture-details.md
│   │       ├── rwkv7.md
│   │       └── state-management.md
│   └── torchtitan/
│       ├── SKILL.md
│       └── references/
│           ├── checkpoint.md
│           ├── custom-models.md
│           ├── float8.md
│           └── fsdp.md
├── 02-tokenization/
│   ├── .gitkeep
│   ├── huggingface-tokenizers/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── algorithms.md
│   │       ├── integration.md
│   │       ├── pipeline.md
│   │       └── training.md
│   └── sentencepiece/
│       ├── SKILL.md
│       └── references/
│           ├── algorithms.md
│           └── training.md
├── 03-fine-tuning/
│   ├── axolotl/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── api.md
│   │       ├── dataset-formats.md
│   │       ├── index.md
│   │       └── other.md
│   ├── llama-factory/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── _images.md
│   │       ├── advanced.md
│   │       ├── getting_started.md
│   │       ├── index.md
│   │       └── other.md
│   ├── peft/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   └── unsloth/
│       ├── SKILL.md
│       └── references/
│           ├── index.md
│           ├── llms-full.md
│           ├── llms-txt.md
│           └── llms.md
├── 04-mechanistic-interpretability/
│   ├── nnsight/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── README.md
│   │       ├── api.md
│   │       └── tutorials.md
│   ├── pyvene/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── README.md
│   │       ├── api.md
│   │       └── tutorials.md
│   ├── saelens/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── README.md
│   │       ├── api.md
│   │       └── tutorials.md
│   └── transformer-lens/
│       ├── SKILL.md
│       └── references/
│           ├── README.md
│           ├── api.md
│           └── tutorials.md
├── 05-data-processing/
│   ├── .gitkeep
│   ├── nemo-curator/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── deduplication.md
│   │       └── filtering.md
│   └── ray-data/
│       ├── SKILL.md
│       └── references/
│           ├── integration.md
│           └── transformations.md
├── 06-post-training/
│   ├── grpo-rl-training/
│   │   ├── README.md
│   │   ├── SKILL.md
│   │   ├── examples/
│   │   │   └── reward_functions_library.py
│   │   └── templates/
│   │       └── basic_grpo_training.py
│   ├── miles/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── api-reference.md
│   │       └── troubleshooting.md
│   ├── openrlhf/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── algorithm-comparison.md
│   │       ├── custom-rewards.md
│   │       ├── hybrid-engine.md
│   │       └── multi-node-training.md
│   ├── simpo/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── datasets.md
│   │       ├── hyperparameters.md
│   │       └── loss-functions.md
│   ├── slime/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── api-reference.md
│   │       └── troubleshooting.md
│   ├── torchforge/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── api-reference.md
│   │       └── troubleshooting.md
│   ├── trl-fine-tuning/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── dpo-variants.md
│   │       ├── online-rl.md
│   │       ├── reward-modeling.md
│   │       └── sft-training.md
│   └── verl/
│       ├── SKILL.md
│       └── references/
│           ├── api-reference.md
│           └── troubleshooting.md
├── 07-safety-alignment/
│   ├── .gitkeep
│   ├── constitutional-ai/
│   │   └── SKILL.md
│   ├── llamaguard/
│   │   └── SKILL.md
│   ├── nemo-guardrails/
│   │   └── SKILL.md
│   └── prompt-guard/
│       └── SKILL.md
├── 08-distributed-training/
│   ├── accelerate/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── custom-plugins.md
│   │       ├── megatron-integration.md
│   │       └── performance.md
│   ├── deepspeed/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── 08.md
│   │       ├── 09.md
│   │       ├── 2020.md
│   │       ├── 2023.md
│   │       ├── assets.md
│   │       ├── index.md
│   │       ├── mii.md
│   │       ├── other.md
│   │       └── tutorials.md
│   ├── megatron-core/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── benchmarks.md
│   │       ├── parallelism-guide.md
│   │       ├── production-examples.md
│   │       └── training-recipes.md
│   ├── pytorch-fsdp2/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── pytorch_dcp_async_recipe.md
│   │       ├── pytorch_dcp_overview.md
│   │       ├── pytorch_dcp_recipe.md
│   │       ├── pytorch_ddp_notes.md
│   │       ├── pytorch_device_mesh_tutorial.md
│   │       ├── pytorch_examples_fsdp2.md
│   │       ├── pytorch_fsdp1_api.md
│   │       ├── pytorch_fsdp2_tutorial.md
│   │       ├── pytorch_fully_shard_api.md
│   │       ├── pytorch_tp_tutorial.md
│   │       ├── ray_train_fsdp2_example.md
│   │       └── torchtitan_fsdp_notes.md
│   ├── pytorch-lightning/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── callbacks.md
│   │       ├── distributed.md
│   │       └── hyperparameter-tuning.md
│   └── ray-train/
│       ├── SKILL.md
│       └── references/
│           └── multi-node.md
├── 09-infrastructure/
│   ├── .gitkeep
│   ├── lambda-labs/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   ├── modal/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   └── skypilot/
│       ├── SKILL.md
│       └── references/
│           ├── advanced-usage.md
│           └── troubleshooting.md
├── 10-optimization/
│   ├── .gitkeep
│   ├── awq/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   ├── bitsandbytes/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── memory-optimization.md
│   │       ├── qlora-training.md
│   │       └── quantization-formats.md
│   ├── flash-attention/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── benchmarks.md
│   │       └── transformers-integration.md
│   ├── gguf/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   ├── gptq/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── calibration.md
│   │       ├── integration.md
│   │       └── troubleshooting.md
│   ├── hqq/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   └── ml-training-recipes/
│       ├── SKILL.md
│       └── references/
│           ├── architecture.md
│           ├── biomedical.md
│           ├── domain-specific.md
│           ├── experiment-loop.md
│           ├── optimizers.md
│           └── scaling-and-selection.md
├── 11-evaluation/
│   ├── .gitkeep
│   ├── bigcode-evaluation-harness/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── benchmarks.md
│   │       ├── custom-tasks.md
│   │       └── issues.md
│   ├── lm-evaluation-harness/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── api-evaluation.md
│   │       ├── benchmark-guide.md
│   │       ├── custom-tasks.md
│   │       └── distributed-eval.md
│   └── nemo-evaluator/
│       ├── SKILL.md
│       └── references/
│           ├── adapter-system.md
│           ├── configuration.md
│           ├── custom-benchmarks.md
│           └── execution-backends.md
├── 12-inference-serving/
│   ├── .gitkeep
│   ├── llama-cpp/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── optimization.md
│   │       ├── quantization.md
│   │       └── server.md
│   ├── sglang/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── deployment.md
│   │       ├── radix-attention.md
│   │       └── structured-generation.md
│   ├── tensorrt-llm/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── multi-gpu.md
│   │       ├── optimization.md
│   │       └── serving.md
│   └── vllm/
│       ├── SKILL.md
│       └── references/
│           ├── optimization.md
│           ├── quantization.md
│           ├── server-deployment.md
│           └── troubleshooting.md
├── 13-mlops/
│   ├── .gitkeep
│   ├── mlflow/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── deployment.md
│   │       ├── model-registry.md
│   │       └── tracking.md
│   ├── swanlab/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── integrations.md
│   │       └── visualization.md
│   ├── tensorboard/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── integrations.md
│   │       ├── profiling.md
│   │       └── visualization.md
│   └── weights-and-biases/
│       ├── SKILL.md
│       └── references/
│           ├── artifacts.md
│           ├── integrations.md
│           └── sweeps.md
├── 14-agents/
│   ├── .gitkeep
│   ├── a-evolve/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── README.md
│   │       ├── api.md
│   │       ├── architecture.md
│   │       ├── design-patterns.md
│   │       ├── examples.md
│   │       ├── issues.md
│   │       ├── releases.md
│   │       └── tutorials.md
│   ├── autogpt/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   ├── crewai/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── flows.md
│   │       ├── tools.md
│   │       └── troubleshooting.md
│   ├── langchain/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── agents.md
│   │       ├── integration.md
│   │       └── rag.md
│   └── llamaindex/
│       ├── SKILL.md
│       └── references/
│           ├── agents.md
│           ├── data_connectors.md
│           └── query_engines.md
├── 15-rag/
│   ├── .gitkeep
│   ├── chroma/
│   │   ├── SKILL.md
│   │   └── references/
│   │       └── integration.md
│   ├── faiss/
│   │   ├── SKILL.md
│   │   └── references/
│   │       └── index_types.md
│   ├── pinecone/
│   │   ├── SKILL.md
│   │   └── references/
│   │       └── deployment.md
│   ├── qdrant/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   └── sentence-transformers/
│       ├── SKILL.md
│       └── references/
│           └── models.md
├── 16-prompt-engineering/
│   ├── .gitkeep
│   ├── dspy/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── examples.md
│   │       ├── modules.md
│   │       └── optimizers.md
│   ├── guidance/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── backends.md
│   │       ├── constraints.md
│   │       └── examples.md
│   ├── instructor/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── examples.md
│   │       ├── providers.md
│   │       └── validation.md
│   └── outlines/
│       ├── SKILL.md
│       └── references/
│           ├── backends.md
│           ├── examples.md
│           └── json_generation.md
├── 17-observability/
│   ├── .gitkeep
│   ├── langsmith/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   └── phoenix/
│       ├── SKILL.md
│       └── references/
│           ├── advanced-usage.md
│           └── troubleshooting.md
├── 18-multimodal/
│   ├── .gitkeep
│   ├── audiocraft/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   ├── blip-2/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   ├── clip/
│   │   ├── SKILL.md
│   │   └── references/
│   │       └── applications.md
│   ├── cosmos-policy/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── libero-commands.md
│   │       └── robocasa-commands.md
│   ├── llava/
│   │   ├── SKILL.md
│   │   └── references/
│   │       └── training.md
│   ├── openpi/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── checkpoints-and-env-map.md
│   │       ├── config-recipes.md
│   │       ├── pytorch-gotchas.md
│   │       ├── remote-client-pattern.md
│   │       └── training-debugging.md
│   ├── openvla-oft/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── aloha-workflow.md
│   │       ├── config-troubleshooting.md
│   │       ├── libero-workflow.md
│   │       └── paper-and-checkpoints.md
│   ├── segment-anything/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   ├── stable-diffusion/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   └── whisper/
│       ├── SKILL.md
│       └── references/
│           └── languages.md
├── 19-emerging-techniques/
│   ├── .gitkeep
│   ├── knowledge-distillation/
│   │   ├── SKILL.md
│   │   └── references/
│   │       └── minillm.md
│   ├── long-context/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── extension_methods.md
│   │       ├── fine_tuning.md
│   │       └── rope.md
│   ├── model-merging/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── evaluation.md
│   │       ├── examples.md
│   │       └── methods.md
│   ├── model-pruning/
│   │   ├── SKILL.md
│   │   └── references/
│   │       └── wanda.md
│   ├── moe-training/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── architectures.md
│   │       ├── inference.md
│   │       └── training.md
│   └── speculative-decoding/
│       ├── SKILL.md
│       └── references/
│           ├── lookahead.md
│           └── medusa.md
├── 20-ml-paper-writing/
│   ├── academic-plotting/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── data-visualization.md
│   │       ├── diagram-generation.md
│   │       └── style-guide.md
│   ├── ml-paper-writing/
│   │   ├── SKILL.md
│   │   ├── references/
│   │   │   ├── checklists.md
│   │   │   ├── citation-workflow.md
│   │   │   ├── reviewer-guidelines.md
│   │   │   ├── sources.md
│   │   │   └── writing-guide.md
│   │   └── templates/
│   │       ├── README.md
│   │       ├── aaai2026/
│   │       │   ├── README.md
│   │       │   ├── aaai2026-unified-supp.tex
│   │       │   ├── aaai2026-unified-template.tex
│   │       │   ├── aaai2026.bib
│   │       │   ├── aaai2026.bst
│   │       │   └── aaai2026.sty
│   │       ├── acl/
│   │       │   ├── README.md
│   │       │   ├── acl.sty
│   │       │   ├── acl_latex.tex
│   │       │   ├── acl_lualatex.tex
│   │       │   ├── acl_natbib.bst
│   │       │   ├── anthology.bib.txt
│   │       │   ├── custom.bib
│   │       │   └── formatting.md
│   │       ├── colm2025/
│   │       │   ├── README.md
│   │       │   ├── colm2025_conference.bib
│   │       │   ├── colm2025_conference.bst
│   │       │   ├── colm2025_conference.sty
│   │       │   ├── colm2025_conference.tex
│   │       │   ├── fancyhdr.sty
│   │       │   ├── math_commands.tex
│   │       │   └── natbib.sty
│   │       ├── iclr2026/
│   │       │   ├── fancyhdr.sty
│   │       │   ├── iclr2026_conference.bib
│   │       │   ├── iclr2026_conference.bst
│   │       │   ├── iclr2026_conference.sty
│   │       │   ├── iclr2026_conference.tex
│   │       │   ├── math_commands.tex
│   │       │   └── natbib.sty
│   │       ├── icml2026/
│   │       │   ├── algorithm.sty
│   │       │   ├── algorithmic.sty
│   │       │   ├── example_paper.bib
│   │       │   ├── example_paper.tex
│   │       │   ├── fancyhdr.sty
│   │       │   ├── icml2026.bst
│   │       │   └── icml2026.sty
│   │       └── neurips2025/
│   │           ├── Makefile
│   │           ├── extra_pkgs.tex
│   │           ├── main.tex
│   │           └── neurips.sty
│   ├── presenting-conference-talks/
│   │   ├── SKILL.md
│   │   └── references/
│   │       └── slide-templates.md
│   └── systems-paper-writing/
│       ├── SKILL.md
│       ├── references/
│       │   ├── checklist.md
│       │   ├── reviewer-guidelines.md
│       │   ├── section-blueprints.md
│       │   ├── systems-conferences.md
│       │   └── writing-patterns.md
│       └── templates/
│           ├── asplos2027/
│           │   ├── main.tex
│           │   └── references.bib
│           ├── nsdi2027/
│           │   ├── main.tex
│           │   ├── references.bib
│           │   └── usenix-2020-09.sty
│           ├── osdi2026/
│           │   ├── main.tex
│           │   ├── references.bib
│           │   └── usenix-2020-09.sty
│           └── sosp2026/
│               ├── main.tex
│               └── references.bib
├── 21-research-ideation/
│   ├── brainstorming-research-ideas/
│   │   └── SKILL.md
│   └── creative-thinking-for-research/
│       └── SKILL.md
├── 22-agent-native-research-artifact/
│   ├── compiler/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── ara-schema.md
│   │       ├── exploration-tree-spec.md
│   │       └── validation-checklist.md
│   ├── research-manager/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── event-taxonomy.md
│   │       ├── provenance-tags.md
│   │       └── session-protocol.md
│   └── rigor-reviewer/
│       ├── SKILL.md
│       └── references/
│           └── review-dimensions.md
├── CITATION.cff
├── CLAUDE.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── WELCOME.md
├── anthropic_official_docs/
│   ├── best_practices.md
│   └── skills_overview.md
├── demos/
│   ├── README.md
│   ├── autoresearch-norm-heterogeneity/
│   │   └── README.md
│   ├── autoresearch-rl-brain-scan/
│   │   └── README.md
│   └── scientific-plotting-demo/
│       ├── README.md
│       └── figures/
│           ├── gen_fig_andes_architecture_gemini.py
│           ├── gen_fig_andes_workflow.py
│           └── gen_fig_experiment_results.py
├── dev_data/
│   ├── GITHUB_SKILLS_SYNC_SETUP.md
│   ├── PROJECT_ANALYSIS.md
│   ├── RESEARCH_QUESTIONNAIRE.md
│   ├── RESEARCH_QUESTIONNAIRE_PART1.md
│   ├── RESEARCH_QUESTIONNAIRE_PART2.md
│   ├── RESEARCH_QUESTIONNAIRE_PART3.md
│   ├── SCRAPING_STATUS.md
│   ├── SKILL_BUILD_PLAN.md
│   ├── SKILL_STRUCTURE_VERIFICATION.md
│   └── deep_research_report_1.md
├── docs/
│   ├── ROADMAP.md
│   ├── SKILL_CREATION_GUIDE.md
│   ├── SKILL_TEMPLATE.md
│   ├── npm-package-plan.md
│   ├── npm-package-ux-mockup.html
│   └── writing-assets/
│       ├── ML_paper_guide.md
│       └── ml_paper_writing_sources.md
├── package.json
├── packages/
│   └── ai-research-skills/
│       ├── .gitignore
│       ├── README.md
│       ├── bin/
│       │   └── cli.js
│       ├── package.json
│       └── src/
│           ├── agents.js
│           ├── ascii.js
│           ├── index.js
│           ├── installer.js
│           └── prompts.js
└── video-promo/
    └── ai-research-skills-promo/
        ├── .gitignore
        ├── package.json
        ├── remotion.config.ts
        ├── src/
        │   ├── AIResearchSkillsPromo.tsx
        │   ├── Root.tsx
        │   ├── components/
        │   │   ├── AgentDetection.tsx
        │   │   ├── CallToAction.tsx
        │   │   ├── CategorySelection.tsx
        │   │   ├── InstallProgress.tsx
        │   │   ├── OrchestraLogo.tsx
        │   │   ├── StatsDisplay.tsx
        │   │   ├── SuccessScreen.tsx
        │   │   └── Terminal.tsx
        │   └── index.ts
        └── tsconfig.json

================================================
FILE CONTENTS
================================================

================================================
FILE: .claude-plugin/marketplace.json
================================================
{
  "name": "ai-research-skills",
  "owner": {
    "name": "Orchestra Research",
    "email": "zechen@orchestra-research.com"
  },
  "metadata": {
    "description": "Comprehensive library of 98 AI research engineering skills enabling autonomous AI research from hypothesis to experimental verification",
    "version": "1.2.0"
  },
  "plugins": [
    {
      "name": "model-architecture",
      "description": "LLM architectures and implementations including LitGPT, Mamba, NanoGPT, RWKV, and TorchTitan. Use when implementing, training, or understanding transformer and alternative architectures.",
      "source": "./",
      "strict": false,
      "skills": [
        "./01-model-architecture/litgpt",
        "./01-model-architecture/mamba",
        "./01-model-architecture/nanogpt",
        "./01-model-architecture/rwkv",
        "./01-model-architecture/torchtitan"
      ]
    },
    {
      "name": "tokenization",
      "description": "Text tokenization for LLMs including HuggingFace Tokenizers and SentencePiece. Use when training custom tokenizers or handling multilingual text.",
      "source": "./",
      "strict": false,
      "skills": [
        "./02-tokenization/huggingface-tokenizers",
        "./02-tokenization/sentencepiece"
      ]
    },
    {
      "name": "fine-tuning",
      "description": "LLM fine-tuning frameworks including Axolotl, LLaMA-Factory, PEFT, and Unsloth. Use when fine-tuning models with LoRA, QLoRA, or full fine-tuning.",
      "source": "./",
      "strict": false,
      "skills": [
        "./03-fine-tuning/axolotl",
        "./03-fine-tuning/llama-factory",
        "./03-fine-tuning/peft",
        "./03-fine-tuning/unsloth"
      ]
    },
    {
      "name": "mechanistic-interpretability",
      "description": "Neural network interpretability tools including TransformerLens, SAELens, NNSight, and pyvene. Use when analyzing model internals, finding circuits, or understanding how models compute.",
      "source": "./",
      "strict": false,
      "skills": [
        "./04-mechanistic-interpretability/nnsight",
        "./04-mechanistic-interpretability/pyvene",
        "./04-mechanistic-interpretability/saelens",
        "./04-mechanistic-interpretability/transformer-lens"
      ]
    },
    {
      "name": "data-processing",
      "description": "Data curation and processing at scale including NeMo Curator and Ray Data. Use when preparing training datasets or processing large-scale data.",
      "source": "./",
      "strict": false,
      "skills": [
        "./05-data-processing/nemo-curator",
        "./05-data-processing/ray-data"
      ]
    },
    {
      "name": "post-training",
      "description": "RLHF and preference alignment including TRL, GRPO, OpenRLHF, SimPO, verl, slime, miles, and torchforge. Use when aligning models with human preferences, training reward models, or large-scale RL training.",
      "source": "./",
      "strict": false,
      "skills": [
        "./06-post-training/grpo-rl-training",
        "./06-post-training/miles",
        "./06-post-training/openrlhf",
        "./06-post-training/simpo",
        "./06-post-training/slime",
        "./06-post-training/torchforge",
        "./06-post-training/trl-fine-tuning",
        "./06-post-training/verl"
      ]
    },
    {
      "name": "safety-alignment",
      "description": "AI safety and content moderation including Constitutional AI, LlamaGuard, NeMo Guardrails, and Prompt Guard. Use when implementing safety filters, content moderation, or prompt injection detection.",
      "source": "./",
      "strict": false,
      "skills": [
        "./07-safety-alignment/constitutional-ai",
        "./07-safety-alignment/llamaguard",
        "./07-safety-alignment/nemo-guardrails",
        "./07-safety-alignment/prompt-guard"
      ]
    },
    {
      "name": "distributed-training",
      "description": "Multi-GPU and multi-node training including DeepSpeed, PyTorch FSDP, Accelerate, Megatron-Core, PyTorch Lightning, and Ray Train. Use when training large models across GPUs.",
      "source": "./",
      "strict": false,
      "skills": [
        "./08-distributed-training/accelerate",
        "./08-distributed-training/deepspeed",
        "./08-distributed-training/megatron-core",
        "./08-distributed-training/pytorch-fsdp2",
        "./08-distributed-training/pytorch-lightning",
        "./08-distributed-training/ray-train"
      ]
    },
    {
      "name": "infrastructure",
      "description": "GPU cloud and compute orchestration including Modal, Lambda Labs, and SkyPilot. Use when deploying training jobs or managing GPU resources.",
      "source": "./",
      "strict": false,
      "skills": [
        "./09-infrastructure/lambda-labs",
        "./09-infrastructure/modal",
        "./09-infrastructure/skypilot"
      ]
    },
    {
      "name": "optimization",
      "description": "Model optimization and quantization including Flash Attention, bitsandbytes, GPTQ, AWQ, GGUF, and HQQ. Use when reducing memory, accelerating inference, or quantizing models.",
      "source": "./",
      "strict": false,
      "skills": [
        "./10-optimization/awq",
        "./10-optimization/bitsandbytes",
        "./10-optimization/flash-attention",
        "./10-optimization/gguf",
        "./10-optimization/gptq",
        "./10-optimization/hqq",
        "./10-optimization/ml-training-recipes"
      ]
    },
    {
      "name": "evaluation",
      "description": "LLM benchmarking and evaluation including lm-evaluation-harness, BigCode Evaluation Harness, and NeMo Evaluator. Use when benchmarking models or measuring performance.",
      "source": "./",
      "strict": false,
      "skills": [
        "./11-evaluation/bigcode-evaluation-harness",
        "./11-evaluation/lm-evaluation-harness",
        "./11-evaluation/nemo-evaluator"
      ]
    },
    {
      "name": "inference-serving",
      "description": "Production LLM inference including vLLM, TensorRT-LLM, llama.cpp, and SGLang. Use when deploying models for production inference.",
      "source": "./",
      "strict": false,
      "skills": [
        "./12-inference-serving/llama-cpp",
        "./12-inference-serving/sglang",
        "./12-inference-serving/tensorrt-llm",
        "./12-inference-serving/vllm"
      ]
    },
    {
      "name": "mlops",
      "description": "ML experiment tracking and lifecycle including Weights & Biases, MLflow, and TensorBoard. Use when tracking experiments or managing models.",
      "source": "./",
      "strict": false,
      "skills": [
        "./13-mlops/mlflow",
        "./13-mlops/tensorboard",
        "./13-mlops/weights-and-biases"
      ]
    },
    {
      "name": "agents",
      "description": "LLM agent frameworks including LangChain, LlamaIndex, CrewAI, and AutoGPT. Use when building chatbots, autonomous agents, or tool-using systems.",
      "source": "./",
      "strict": false,
      "skills": [
        "./14-agents/autogpt",
        "./14-agents/crewai",
        "./14-agents/langchain",
        "./14-agents/llamaindex"
      ]
    },
    {
      "name": "rag",
      "description": "Retrieval-Augmented Generation including Chroma, FAISS, Pinecone, Qdrant, and Sentence Transformers. Use when building semantic search or document retrieval systems.",
      "source": "./",
      "strict": false,
      "skills": [
        "./15-rag/chroma",
        "./15-rag/faiss",
        "./15-rag/pinecone",
        "./15-rag/qdrant",
        "./15-rag/sentence-transformers"
      ]
    },
    {
      "name": "prompt-engineering",
      "description": "Structured LLM outputs including DSPy, Instructor, Guidance, and Outlines. Use when extracting structured data or constraining LLM outputs.",
      "source": "./",
      "strict": false,
      "skills": [
        "./16-prompt-engineering/dspy",
        "./16-prompt-engineering/guidance",
        "./16-prompt-engineering/instructor",
        "./16-prompt-engineering/outlines"
      ]
    },
    {
      "name": "observability",
      "description": "LLM application monitoring including LangSmith and Phoenix. Use when debugging LLM apps or monitoring production systems.",
      "source": "./",
      "strict": false,
      "skills": [
        "./17-observability/langsmith",
        "./17-observability/phoenix"
      ]
    },
    {
      "name": "multimodal",
      "description": "Vision, audio, and multimodal models including CLIP, Whisper, LLaVA, BLIP-2, Segment Anything, Stable Diffusion, AudioCraft, Cosmos Policy, OpenPI, and OpenVLA-OFT. Use when working with images, audio, multimodal tasks, or vision-language-action robot policies.",
      "source": "./",
      "strict": false,
      "skills": [
        "./18-multimodal/audiocraft",
        "./18-multimodal/blip-2",
        "./18-multimodal/clip",
        "./18-multimodal/cosmos-policy",
        "./18-multimodal/llava",
        "./18-multimodal/openpi",
        "./18-multimodal/openvla-oft",
        "./18-multimodal/segment-anything",
        "./18-multimodal/stable-diffusion",
        "./18-multimodal/whisper"
      ]
    },
    {
      "name": "emerging-techniques",
      "description": "Advanced ML techniques including MoE Training, Model Merging, Long Context, Speculative Decoding, Knowledge Distillation, and Model Pruning. Use when implementing cutting-edge optimization or architecture techniques.",
      "source": "./",
      "strict": false,
      "skills": [
        "./19-emerging-techniques/knowledge-distillation",
        "./19-emerging-techniques/long-context",
        "./19-emerging-techniques/model-merging",
        "./19-emerging-techniques/model-pruning",
        "./19-emerging-techniques/moe-training",
        "./19-emerging-techniques/speculative-decoding"
      ]
    },
    {
      "name": "autoresearch",
      "description": "Autonomous research orchestration using a two-loop architecture. Manages the full research lifecycle from literature survey to paper writing, routing to domain-specific skills for execution. Use when starting a research project, running autonomous experiments, or managing multi-hypothesis research.",
      "source": "./",
      "strict": false,
      "skills": [
        "./0-autoresearch-skill"
      ]
    },
    {
      "name": "ml-paper-writing",
      "description": "Write publication-ready ML/AI/Systems papers for NeurIPS, ICML, ICLR, ACL, AAAI, COLM, OSDI, NSDI, ASPLOS, SOSP. Includes LaTeX templates, citation verification, reviewer guidelines, publication-quality figure generation, systems paper structural blueprints, and conference presentation slides.",
      "source": "./",
      "strict": false,
      "skills": [
        "./20-ml-paper-writing/ml-paper-writing",
        "./20-ml-paper-writing/academic-plotting",
        "./20-ml-paper-writing/systems-paper-writing",
        "./20-ml-paper-writing/presenting-conference-talks"
      ]
    },
    {
      "name": "ideation",
      "description": "Research ideation frameworks including structured brainstorming and creative thinking. Use when exploring new research directions, generating novel ideas, or seeking fresh angles on existing work.",
      "source": "./",
      "strict": false,
      "skills": [
        "./21-research-ideation/brainstorming-research-ideas",
        "./21-research-ideation/creative-thinking-for-research"
      ]
    },
    {
      "name": "agent-native-research-artifact",
      "description": "Agent-Native Research Artifact (ARA) tooling: compile any research input (paper, repo, notes) into a structured artifact, record session provenance as a post-task epilogue, and run Seal Level 2 epistemic review. Use when ingesting research into a falsifiable, agent-traversable artifact, capturing how a research project actually evolved, or auditing an ARA for evidence-claim alignment.",
      "source": "./",
      "strict": false,
      "skills": [
        "./22-agent-native-research-artifact/compiler",
        "./22-agent-native-research-artifact/research-manager",
        "./22-agent-native-research-artifact/rigor-reviewer"
      ]
    }
  ]
}


================================================
FILE: .github/workflows/claude.yml
================================================
name: Claude Code
on:
  issue_comment:
    types: [created]
  pull_request_review_comment:
    types: [created]
  issues:
    types: [opened, assigned]

permissions:
  contents: write
  pull-requests: write
  issues: write

jobs:
  claude:
    if: |
      (github.event_name == 'issue_comment' && contains(github.event.comment.body, '@claude') && contains(fromJSON('["OWNER", "MEMBER", "COLLABORATOR"]'), github.event.comment.author_association)) ||
      (github.event_name == 'pull_request_review_comment' && contains(github.event.comment.body, '@claude') && contains(fromJSON('["OWNER", "MEMBER", "COLLABORATOR"]'), github.event.comment.author_association)) ||
      (github.event_name == 'issues' && contains(github.event.issue.body, '@claude') && contains(fromJSON('["OWNER", "MEMBER", "COLLABORATOR"]'), github.event.issue.author_association))
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: anthropics/claude-code-action@v1
        with:
          claude_code_oauth_token: ${{ secrets.CLAUDE_CODE_OAUTH_TOKEN }}
          github_token: ${{ secrets.GITHUB_TOKEN }}


================================================
FILE: .github/workflows/publish-npm.yml
================================================
name: Publish to npm

on:
  push:
    branches: [main]
    paths:
      - 'packages/ai-research-skills/**'

permissions:
  id-token: write
  contents: read

jobs:
  publish:
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: packages/ai-research-skills

    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          fetch-depth: 2

      - name: Check if version changed
        id: version
        run: |
          CURRENT=$(node -p "require('./package.json').version")
          PREVIOUS=$(git show HEAD~1:packages/ai-research-skills/package.json 2>/dev/null | node -p "JSON.parse(require('fs').readFileSync('/dev/stdin','utf8')).version" 2>/dev/null || echo "")
          echo "current=$CURRENT"
          echo "previous=$PREVIOUS"
          if [ "$CURRENT" != "$PREVIOUS" ]; then
            echo "changed=true" >> $GITHUB_OUTPUT
            echo "version=$CURRENT" >> $GITHUB_OUTPUT
          else
            echo "changed=false" >> $GITHUB_OUTPUT
          fi

      - name: Check if version already published
        if: steps.version.outputs.changed == 'true'
        id: published
        run: |
          VERSION=${{ steps.version.outputs.version }}
          if npm view @orchestra-research/ai-research-skills@$VERSION version 2>/dev/null; then
            echo "already_published=true" >> $GITHUB_OUTPUT
            echo "Version $VERSION already on npm, skipping"
          else
            echo "already_published=false" >> $GITHUB_OUTPUT
          fi

      - name: Setup Node.js
        if: steps.version.outputs.changed == 'true' && steps.published.outputs.already_published == 'false'
        uses: actions/setup-node@v4
        with:
          node-version: '24'
          registry-url: 'https://registry.npmjs.org'

      - name: Install dependencies
        if: steps.version.outputs.changed == 'true' && steps.published.outputs.already_published == 'false'
        run: npm ci

      - name: Publish to npm
        if: steps.version.outputs.changed == 'true' && steps.published.outputs.already_published == 'false'
        run: |
          echo "Publishing v${{ steps.version.outputs.version }} to npm..."
          unset NODE_AUTH_TOKEN
          npm config delete //registry.npmjs.org/:_authToken || true
          npm publish --access public --provenance

      - name: Skip reason
        if: steps.version.outputs.changed != 'true'
        run: echo "Version unchanged, skipping publish"


================================================
FILE: .github/workflows/sync-skills.yml
================================================
name: Sync Skills to Orchestra

on:
  push:
    branches:
      - main
  workflow_dispatch: # Allow manual trigger

jobs:
  sync-skills:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          fetch-depth: 2 # Fetch last 2 commits to detect changes

      - name: Detect changed skill folders
        id: changes
        run: |
          # Get list of changed files in last commit
          CHANGED_FILES=$(git diff --name-only HEAD^..HEAD)

          echo "Changed files:"
          echo "$CHANGED_FILES"

          # Find skill directories - supports two patterns:
          # Pattern 1: XX-category/skill-name/SKILL.md (nested skills)
          # Pattern 2: XX-category/SKILL.md (standalone skills like 20-ml-paper-writing)

          SKILL_DIRS=""

          # Pattern 1: Nested skills (XX-category/skill-name/)
          NESTED=$(echo "$CHANGED_FILES" | grep -E '^[0-9]{2}-[^/]+/[^/]+/' | sed -E 's|^([0-9]{2}-[^/]+/[^/]+)/.*|\1|' | sort -u)
          if [ -n "$NESTED" ]; then
            SKILL_DIRS="$NESTED"
          fi

          # Pattern 2: Standalone skills (XX-category/ with SKILL.md directly inside)
          STANDALONE=$(echo "$CHANGED_FILES" | grep -E '^[0-9]{2}-[^/]+/SKILL\.md$' | sed -E 's|^([0-9]{2}-[^/]+)/SKILL\.md$|\1|' | sort -u)
          if [ -n "$STANDALONE" ]; then
            if [ -n "$SKILL_DIRS" ]; then
              SKILL_DIRS=$(printf "%s\n%s" "$SKILL_DIRS" "$STANDALONE" | sort -u)
            else
              SKILL_DIRS="$STANDALONE"
            fi
          fi

          echo "Changed skill directories:"
          echo "$SKILL_DIRS"

          # Convert to JSON array for matrix
          if [ -z "$SKILL_DIRS" ]; then
            SKILLS_JSON="[]"
            SKILL_COUNT=0
          else
            SKILLS_JSON=$(echo "$SKILL_DIRS" | jq -R -s -c 'split("\n") | map(select(length > 0))')
            SKILL_COUNT=$(echo "$SKILL_DIRS" | grep -c . || echo "0")
          fi

          echo "skills=$SKILLS_JSON" >> $GITHUB_OUTPUT
          echo "count=$SKILL_COUNT" >> $GITHUB_OUTPUT

      - name: Process and sync skills
        if: steps.changes.outputs.count > 0
        env:
          ORCHESTRA_API_URL: ${{ secrets.ORCHESTRA_API_URL }}
          ORCHESTRA_SYNC_API_KEY: ${{ secrets.ORCHESTRA_SYNC_API_KEY }}
        run: |
          SKILLS='${{ steps.changes.outputs.skills }}'

          echo "Processing $(echo $SKILLS | jq 'length') skill(s)..."

          # Install jq for JSON processing
          sudo apt-get update && sudo apt-get install -y jq zip

          # Loop through each skill directory
          echo "$SKILLS" | jq -r '.[]' | while read SKILL_PATH; do
            echo "==================================================="
            echo "Processing: $SKILL_PATH"
            echo "==================================================="

            # Check if SKILL.md exists
            if [ ! -f "$SKILL_PATH/SKILL.md" ]; then
              echo "⚠️  WARNING: No SKILL.md found in $SKILL_PATH, skipping"
              continue
            fi

            # Extract skill name from SKILL.md frontmatter
            SKILL_NAME=$(grep -A 20 "^---$" "$SKILL_PATH/SKILL.md" | grep "^name:" | head -1 | sed 's/name: *//;s/"//g;s/'\''//g' | tr -d '\r')

            # Extract author from SKILL.md frontmatter
            AUTHOR=$(grep -A 20 "^---$" "$SKILL_PATH/SKILL.md" | grep "^author:" | head -1 | sed 's/author: *//;s/"//g;s/'\''//g' | tr -d '\r')

            # Default values
            if [ -z "$SKILL_NAME" ]; then
              # Extract from directory name as fallback
              SKILL_NAME=$(basename "$SKILL_PATH")
              echo "⚠️  No 'name' in frontmatter, using directory name: $SKILL_NAME"
            fi

            if [ -z "$AUTHOR" ]; then
              AUTHOR="Orchestra Research"
              echo "⚠️  No 'author' in frontmatter, defaulting to: $AUTHOR"
            fi

            echo "Skill Name: $SKILL_NAME"
            echo "Author: $AUTHOR"
            echo "Path: $SKILL_PATH"

            # Create temporary directory for zipping
            TEMP_DIR=$(mktemp -d)
            SKILL_DIR="$TEMP_DIR/$SKILL_NAME"
            mkdir -p "$SKILL_DIR"

            # Copy all contents of skill directory (SKILL.md, references/, scripts/, assets/, etc.)
            cp -r "$SKILL_PATH"/* "$SKILL_DIR/" 2>/dev/null || true

            # Create zip file (exclude hidden files and .gitkeep)
            ZIP_FILE="$TEMP_DIR/${SKILL_NAME}.zip"
            cd "$TEMP_DIR"
            zip -r "$ZIP_FILE" "$SKILL_NAME" -x "*/.*" "*/.gitkeep" "*.DS_Store"
            cd -

            # Verify zip was created
            if [ ! -f "$ZIP_FILE" ]; then
              echo "❌ ERROR: Failed to create zip file for $SKILL_NAME"
              continue
            fi

            echo "✓ Created zip: $(ls -lh "$ZIP_FILE" | awk '{print $5}')"

            # Write SKILL.md content to temp file (avoid argument length limits)
            SKILL_MD_FILE="$TEMP_DIR/skill.md"
            cat "$SKILL_PATH/SKILL.md" > "$SKILL_MD_FILE"

            # Encode zip to base64 and write to temp file (avoid argument length limits)
            ZIP_BASE64_FILE="$TEMP_DIR/base64.txt"
            base64 -w 0 "$ZIP_FILE" > "$ZIP_BASE64_FILE" 2>/dev/null || base64 "$ZIP_FILE" > "$ZIP_BASE64_FILE"

            # Prepare JSON payload (use --rawfile for large content)
            JSON_PAYLOAD=$(jq -n \
              --arg skillName "$SKILL_NAME" \
              --arg skillPath "$SKILL_PATH" \
              --arg author "$AUTHOR" \
              --rawfile skillMdContent "$SKILL_MD_FILE" \
              --rawfile zipBase64 "$ZIP_BASE64_FILE" \
              '{
                skillName: $skillName,
                skillPath: $skillPath,
                author: $author,
                skillMdContent: $skillMdContent,
                zipBase64: $zipBase64
              }')

            # Send to Orchestra API (write JSON to file to avoid argument length limits)
            echo "📤 Uploading to Orchestra..."
            JSON_FILE="$TEMP_DIR/payload.json"
            echo "$JSON_PAYLOAD" > "$JSON_FILE"

            RESPONSE=$(curl -s -w "\n%{http_code}" -L \
              -X POST \
              -H "Content-Type: application/json" \
              -H "X-Admin-API-Key: $ORCHESTRA_SYNC_API_KEY" \
              -d @"$JSON_FILE" \
              "$ORCHESTRA_API_URL/api/admin/sync-github-skill")

            HTTP_CODE=$(echo "$RESPONSE" | tail -n1)
            BODY=$(echo "$RESPONSE" | sed '$d')

            echo "HTTP Status: $HTTP_CODE"
            echo "Response: $BODY"

            if [ "$HTTP_CODE" = "200" ]; then
              ACTION=$(echo "$BODY" | jq -r '.action // "synced"')
              SOURCE=$(echo "$BODY" | jq -r '.source // "unknown"')
              echo "✅ SUCCESS: Skill $SKILL_NAME $ACTION (source: $SOURCE)"
            else
              ERROR_MSG=$(echo "$BODY" | jq -r '.error // "Unknown error"')
              echo "❌ FAILED: $ERROR_MSG"
              exit 1
            fi

            # Cleanup
            rm -rf "$TEMP_DIR"

            echo ""
          done

          echo "==================================================="
          echo "✅ Sync completed successfully!"
          echo "==================================================="

      - name: No changes detected
        if: steps.changes.outputs.count == 0
        run: |
          echo "ℹ️  No skill changes detected in this commit"
          echo "Only commits that modify skill directories will trigger sync"


================================================
FILE: .gitignore
================================================
# Python
__pycache__/
*.py[cod]
*$py.class
*.so

# LaTeX auxiliary files
*.aux
*.bbl
*.blg
*.out
*.fls
*.fdb_latexmk
*.synctex.gz
*.toc
*.lof
*.lot
*.nav
*.snm
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
*.manifest
*.spec
pip-log.txt
pip-delete-this-directory.txt

# Virtual environments
venv/
ENV/
env/
.venv

# IDEs
.vscode/
.idea/
*.swp
*.swo
*~
.DS_Store

# Jupyter Notebook
.ipynb_checkpoints
*.ipynb

# Pytest
.pytest_cache/
.coverage
htmlcov/

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# ML/Data
*.h5
*.pkl
*.pth
*.ckpt
*.safetensors
wandb/
runs/
outputs/
checkpoints/
*.log

# Environment variables
.env
.env.local

# Temporary files
tmp/
temp/
*.tmp

# Skill Seeker metadata and build artifacts
.metadata/
*_data/
!dev_data/
*_github_data.json
*_extracted.json
output/
*.zip
0-autoresearch-skill/background_docs/
0-autoresearch-skill/twitter_thread_draft.md
0-autoresearch-skill/social_posts.md
0-autoresearch-skill/image_generation_brief.md
0-autoresearch-skill/autoresearch-loops-image.png


================================================
FILE: 0-autoresearch-skill/SKILL.md
================================================
---
name: autoresearch
description: Orchestrates end-to-end autonomous AI research projects using a two-loop architecture. The inner loop runs rapid experiment iterations with clear optimization targets. The outer loop synthesizes results, identifies patterns, and steers research direction. Routes to domain-specific skills for execution, supports continuous agent operation via Claude Code /loop and OpenClaw heartbeat, and produces research presentations and papers. Use when starting a research project, running autonomous experiments, or managing a multi-hypothesis research effort.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Autonomous Research, Two-Loop Architecture, Experiment Orchestration, Research Synthesis, Project Management]
---

# Autoresearch

Autonomous research orchestration for AI coding agents. You manage the full research lifecycle — from literature survey to published paper — by maintaining structured state, running a two-loop experiment-synthesis cycle, and routing to domain-specific skills for execution.

You are a research project manager, not a domain expert. You orchestrate; the domain skills execute.

**This runs fully autonomously.** Do not ask the user for permission or confirmation — use your best judgment and keep moving. Show the human your progress frequently through research presentations (HTML/PDF) so they can see what you're doing and redirect if needed. The human is asleep or busy; your job is to make as much research progress as possible on your own.

## Getting Started

Users arrive in different states. Determine which and proceed:

| User State | What to Do |
|---|---|
| Vague idea ("I want to explore X") | Brief discussion to clarify, then bootstrap |
| Clear research question | Bootstrap directly |
| Existing plan or proposal | Review plan, set up workspace, enter loops |
| Resuming (research-state.yaml exists) | Read state, continue from where you left off |

If things are clear, don't over-discuss — proceed to full autoresearch. Most users want you to just start researching.

**Step 0 — before anything else**: Set up the agent continuity loop. See [Agent Continuity](#agent-continuity-mandatory--set-up-first). This is MANDATORY. Without it, the research stops after one cycle.

### Initialize Workspace

Create this structure at the project root:

```
{project}/
├── research-state.yaml       # Central state tracking
├── research-log.md           # Decision timeline
├── findings.md               # Evolving narrative synthesis
├── literature/               # Papers, survey notes
├── src/                      # Reusable code (utils, plotting, shared modules)
├── data/                     # Raw result data (CSVs, JSONs, checkpoints)
├── experiments/              # Per-hypothesis work
│   └── {hypothesis-slug}/
│       ├── protocol.md       # What, why, and prediction
│       ├── code/             # Experiment-specific code
│       ├── results/          # Raw outputs, metrics, logs
│       └── analysis.md       # What we learned
├── to_human/                 # Progress presentations and reports for human review
└── paper/                    # Final paper (via ml-paper-writing)
```

- **`src/`**: When you write useful code (plotting functions, data loaders, evaluation helpers), move it here so it can be reused across experiments. Don't duplicate code in every experiment directory.
- **`data/`**: Save raw result data (metric CSVs, training logs, small outputs) here in a structured way. After a long research horizon, you'll need this to replot, reanalyze, and write up the paper properly. Name files descriptively (e.g., `trajectory_H1_runs001-010.csv`). Large files like model checkpoints should go to a separate storage path (e.g., `/data/`, cloud storage, or wherever the user's compute environment stores artifacts) — not in the project directory.

Initialize `research-state.yaml`, `research-log.md`, and `findings.md` from [templates/](templates/). Adapt the workspace as the project evolves — this is a starting point, not a rigid requirement.

## The Two-Loop Architecture

This is the core engine. Everything else supports it.

```
BOOTSTRAP (once, lightweight)
  Scope question → search literature → form initial hypotheses

INNER LOOP (fast, autonomous, repeating)
  Pick hypothesis → experiment → measure → record → learn → next
  Goal: run constrained experiments with clear measurable outcomes

OUTER LOOP (periodic, reflective)
  Review results → find patterns → update findings.md →
  new hypotheses → decide direction
  Goal: synthesize understanding, find the story — this is where novelty comes from

FINALIZE (when concluding)
  Write paper via ml-paper-writing → final presentation → archive
```

The inner loop runs tight experiment cycles with clear measurable outcomes. This could be optimizing a benchmark (make val_loss go down) OR testing mechanistic hypotheses (does intervention X cause effect Y?). The outer loop steps back to ask: what do these results *mean*? What patterns emerge? What's the story? Research is open-ended — the two loops let you both optimize and discover.

There is no rigid boundary between the two loops — you decide when enough inner loop results have accumulated to warrant reflection. Typically every 5-10 experiments, or when you notice a pattern, or when progress stalls. The agent's judgment drives the rhythm.

### Research is Non-Linear

The two-loop structure is a rhythm, not a railroad. At any point during research you can and should:

- **Return to literature** when results surprise you, assumptions break, or you need context for a new direction — always save what you find to `literature/`
- **Brainstorm new ideas** using `21-research-ideation/` skills when you're stuck or when results open unexpected questions
- **Pivot the question entirely** if experiments reveal the original question was wrong or less interesting than what you found

This is normal. Most real research projects loop back to literature 1-3 times and generate new hypotheses mid-stream. Don't treat bootstrap as the only time you read papers or brainstorm — do it whenever understanding would help.

## Bootstrap: Literature and Hypotheses

Before entering the loops, understand the landscape. Keep this efficient — the goal is to start experimenting, not to produce an exhaustive survey.

1. **Search literature** for the research question. Use multiple sources — never stop at one:
   - **Exa MCP** (`web_search_exa`) if available — best for broad discovery and finding relevant papers quickly
   - **Semantic Scholar** (`pip install semanticscholar`) — best for ML/AI papers, citation graphs, and specific paper lookup. See `20-ml-paper-writing` skill's `references/citation-workflow.md` for complete API code examples
   - **arXiv** (`pip install arxiv`) — best for recent preprints and open-access papers
   - **CrossRef** — best for DOI lookup and BibTeX retrieval
   - Keep searching until you have good coverage. If one source comes up empty, try another with different keywords

   **Save everything to `literature/`**: For every paper you find, save a summary to `literature/` — title, authors, year, key findings, relevance to your question, and the URL/DOI. Create one file per paper and a running `literature/survey.md` with all summaries. This is your reference library — you and future sessions will need it throughout the project.

2. **Identify gaps** from the literature
   - What's been tried? What hasn't? Where do existing methods break?
   - What do Discussion sections flag as future work?

3. **Form initial hypotheses** — invoke `21-research-ideation/` skills
   - `brainstorming-research-ideas` for structured diverge-converge workflow
   - `creative-thinking-for-research` for deeper cognitive frameworks
   - Each hypothesis must be testable with a clear prediction

4. **Define the evaluation**
   - Set the proxy metric and baseline before running experiments
   - The metric should be computable quickly (minutes, not hours)
   - Lock evaluation criteria upfront to prevent unconscious metric gaming

5. **Record** in research-state.yaml, log the bootstrap in research-log.md

## The Inner Loop

Rapid iteration with clear measurable outcomes. Two flavors:

- **Optimization**: make a metric go up/down (val_loss, accuracy, throughput). Think Karpathy's autoresearch.
- **Discovery**: test mechanistic hypotheses about why something works. The metric is a measurement (does grokking happen faster? does entropy increase before forgetting?), not just a target to optimize.

```
1.  Pick the highest-priority untested hypothesis
2.  Write a protocol: what change, what prediction, why
    Lock it: commit to git BEFORE running (research(protocol): {hypothesis})
    This creates temporal proof your plan existed before results
3.  Run the experiment (invoke the relevant domain skill)
4.  Sanity check before trusting results:
    - Did training converge? No NaN/Inf?
    - Does baseline reproduce expected performance?
    - Data loading correct? (spot-check a few samples)
5.  Measure the proxy metric
6.  Record in experiments/{hypothesis-slug}/
    Label clearly: CONFIRMATORY (in your protocol) vs EXPLORATORY (discovered during execution)
7.  If positive: keep, note WHY it worked
8.  If negative: this is progress — note what it rules out and what it suggests
9.  Update research-state.yaml
10. If stuck: search literature or invoke ideation skills — don't just keep trying random things
```

**Never stop.** Even if something fails, find a path forward. Debug, adjust, simplify, or pivot — but keep the research moving. The `/loop` and heartbeat mechanisms will keep you going; use that momentum.

### Route to Domain Skills

When you need domain-specific execution, search the skills library:

| Research Activity | Look In |
|---|---|
| Data preparation | `05-data-processing/` |
| Model training / fine-tuning | `01-model-architecture/`, `03-fine-tuning/`, `06-post-training/` |
| Distributed training | `08-distributed-training/` |
| Optimization (quantization, attention) | `10-optimization/` |
| Evaluation / benchmarks | `11-evaluation/` |
| Inference / serving | `12-inference-serving/` |
| Interpretability analysis | `04-mechanistic-interpretability/` |
| Experiment tracking (W&B, MLflow) | `13-mlops/` |
| Cloud compute | `09-infrastructure/` |

Read the relevant SKILL.md before starting — it has workflows, common issues, and code examples. See [references/skill-routing.md](references/skill-routing.md) for a complete guide.

### Track the Experiment Trajectory

Maintain a running record of measurable outcomes across experiments:

```json
{
  "experiment_id": "run_014",
  "hypothesis": "H3",
  "metric_value": 0.847,
  "baseline": 0.812,
  "delta": "+0.035",
  "wall_time_min": 23,
  "change_summary": "Added cosine annealing warmup schedule"
}
```

This trajectory produces the optimization plot (like Karpathy's progress chart) — include it in progress reports. Humans love seeing the upward curve.

## The Outer Loop

Step back from individual experiments. Synthesize.

```
1. Review all results since last reflection
2. Cluster by type: what kinds of changes worked? Which didn't?
3. Ask WHY — identify the mechanism behind successes and failures
4. Update findings.md with current understanding
5. Search literature if results were surprising or assumptions need revisiting
6. Generate new hypotheses if warranted (invoke 21-research-ideation/ skills)
7. Decide direction (see criteria below)
8. Update research-state.yaml with new direction
9. Log the reflection in research-log.md
10. If there's something meaningful, generate a progress presentation
```

### Deciding Direction

Don't just pick randomly — use these criteria:

**DEEPEN** — a supported result raises follow-up questions
- Does the effect hold under different conditions? What's the mechanism?
- Action: generate sub-hypotheses (H1.1, H1.2) → back to inner loop

**BROADEN** — current results are solid, but adjacent questions are untested
- New questions emerged. The current contribution is clear but more is possible.
- Action: generate new root hypotheses → back to inner loop

**PIVOT** — results invalidate key assumptions or something more interesting appeared
- A core assumption was wrong, or an unexpected finding is more promising than the original question.
- Action: return to literature with new questions → re-bootstrap

**CONCLUDE** — sufficient evidence for a contribution
- At least one hypothesis is strongly supported (or a coherent set of negative results)
- Key ablations completed, error analysis done
- findings.md reads like a paper backbone — a human could write the abstract from it
- No critical open questions that would change the story

Note: coherent negative results are a valid contribution. "X does NOT work because Y" is publishable if the reasoning is rigorous.

### findings.md Is Your Project Memory

This file serves two purposes: it's the research narrative for humans AND your accumulated knowledge base as an agent. Read it at the start of every session, /loop tick, or heartbeat to remember what you've learned.

After every outer loop, update it to answer:

- What do we know so far? (Current Understanding)
- What patterns explain our results? (Patterns and Insights)
- What specific things did we learn not to repeat? (Lessons and Constraints)
- What remains open? (Open Questions)

The "Lessons and Constraints" section is especially important — it captures specific actionable learnings like "weight decay > 0.1 diverges at this scale" or "baseline only reproduces with batch_size=64." This prevents the agent from repeating failed approaches across sessions.

**Quality test**: After 30 inner loop experiments, a human should be able to read findings.md and write a paper abstract from it. If they can't, the outer loop isn't synthesizing — it's just logging.

## Agent Continuity (MANDATORY — Set Up First)

**Before doing anything else**, set up the wall-clock loop. This is what keeps the research running continuously. Without it, the agent stops after one cycle.

### Claude Code

**First action — run this immediately:**

```
/loop 20m Continue autoresearch. Read research-state.yaml and findings.md. Re-read the autoresearch SKILL.md occasionally to stay aligned. Step back and reflect holistically — is the research making real progress? Are you deepening understanding or just running experiments? If stalling, pivot or search literature for new ideas. Keep making research progress — never idle, never stop. Update findings.md, research-log.md, and research-state.yaml when there's new progress. Git commit periodically and clean up the repo if needed. Show the human your research progress with key plots and findings by preparing a report in to_human/ and opening the HTML/PDF. Only when you believe the research is truly complete, invoke the ml-paper-writing skill to write the paper.
```

This fires every 20 minutes regardless of what's happening. It's a rhythm that keeps you working — not a research phase boundary. If your previous work isn't done, just continue it.

### OpenClaw

**First action — set up a 20-minute cron job in the current session. You MUST do this before any research work.**

Use the `cron.add` tool to create a recurring job bound to this chat session:

```json
{
  "name": "autoresearch-loop",
  "schedule": { "kind": "every", "everyMs": 1200000 },
  "sessionTarget": "current",
  "payload": {
    "kind": "agentTurn",
    "message": "Continue autoresearch. Read research-state.yaml and findings.md. Re-read the autoresearch SKILL.md occasionally to stay aligned. Step back and reflect holistically — is the research making real progress? Are you deepening understanding or just running experiments? If stalling, pivot or search literature for new ideas. Keep making research progress — never idle, never stop. Update findings.md, research-log.md, and research-state.yaml when there's new progress. Git commit periodically and clean up the repo if needed. Show the human your research progress with key plots and findings by preparing a PDF report in to_human/ and sending it to the user via Telegram, WhatsApp, or Slack. When you get an exciting result or interesting plot, text it to the user right away — don't wait for a full report. Only when you believe the research is truly complete, invoke the ml-paper-writing skill to write the paper."
  }
}
```

Key details:
- `sessionTarget: "current"` binds the cron to this chat session so it maintains conversation context across ticks
- `everyMs: 1200000` = 20 minutes
- After creation, verify with `cron.list` that the job exists and is enabled
- If the cron fires while you're mid-experiment, just continue — the tick is a nudge, not a restart

### What the Loop Does

The `/loop` and cron job are purely **wall-clock rhythm**. They are completely separate from your research loops (inner/outer). On each tick:

1. Read `research-state.yaml` and `findings.md` — remember where you are
2. Check if anything is broken (failed experiments, stalled training, errors)
3. If on track → keep working on whatever you were doing
4. If stuck or something's wrong → step back, diagnose, fix, then continue
5. Never idle. Always be making progress.

## Progress Reporting

When you have something meaningful to share, create a research presentation — not just a status dashboard, but a compelling story.

**When to report** (your judgment):
- After an outer loop that found a significant pattern
- When the optimization trajectory shows clear progress (include the plot!)
- After a pivot in direction
- Before requesting human input on a decision
- When concluding

**What to include** (adapt to what's compelling):
- The research question and why it matters
- Key results with visualizations (plots, metric tables)
- The optimization trajectory chart (metric over experiments)
- What was tried and why (selective, not exhaustive)
- Current understanding (the findings narrative)
- What's planned next

For Claude Code: generate HTML and `open` it. If HTML fails to open or render, convert to PDF as fallback (use `weasyprint`, `playwright pdf`, or `wkhtmltopdf`). For OpenClaw: generate PDF directly.

See [references/progress-reporting.md](references/progress-reporting.md) for template scaffolding and the optimization plot approach. Use the template as a starting point — be creative with what you show.

## Git Protocol

Commit at natural research milestones:

| When | Message Pattern |
|---|---|
| Workspace initialized | `research(init): {project} — {question}` |
| Experiment protocol locked | `research(protocol): {hypothesis}` |
| Significant results | `research(results): {hypothesis} — {outcome}` |
| Outer loop direction change | `research(reflect): {direction} — {reason}` |
| Paper draft complete | `research(paper): {title}` |

**Hard rule**: Protocol commits MUST precede result commits. Never combine them. The git history is your lightweight pre-registration — it proves what you planned before you saw results. Don't commit after every experiment — commit when there's meaningful progress.

## Concluding: Paper Writing

When the outer loop decides to CONCLUDE:

1. Ensure findings.md has a clear, well-supported narrative
2. Study 2-3 top related papers to learn their format, style, and section structure
3. Invoke the `20-ml-paper-writing` skill — it has LaTeX templates for NeurIPS, ICML, ICLR, ACL, AAAI, COLM, and systems venues
4. Feed it the accumulated literature, experimental results, and findings
5. Follow its citation verification workflow — never hallucinate references
6. Generate a final comprehensive research presentation

Proceed autonomously through the writing process. If the ml-paper-writing skill suggests human collaboration points, adapt and keep going — produce the best draft you can. The human will review and provide feedback.

## Research Discipline

Principles to enforce continuously — not tied to any specific phase:

- **Lock before you run**: Commit your experiment protocol to git before executing. This proves your plan existed before you saw results. Never combine protocol + results in one commit.
- **Confirmatory vs exploratory**: Results matching your locked protocol are confirmatory. Everything else is exploratory — interesting but requiring more skepticism.
- **Negative results are progress**: A refuted hypothesis tells you something. Log what it rules out and what it suggests. Don't treat it as failure.
- **Sanity check before analysis**: Verify training converged, baselines reproduce, and data is correct before trusting your primary metric.
- **Return to literature when confused**: Don't guess — search. If results surprise you or assumptions break, go find papers. Use Exa MCP for discovery, Semantic Scholar for specific ML/AI paper lookup, arXiv for preprints.
- **Never stop**: Don't wait for human approval on routine decisions. If a skill or tool suggests collaboration, adapt and keep going. Find the best path forward autonomously. The human will see your progress reports and can redirect if needed.
- **Use whatever compute is available**: Adapt to the user's environment — local GPU, cluster job submission, cloud instances, or just CPU. If no GPU is available, use CPU and adjust experiment scale accordingly. Don't block on compute availability.

## Quality Standards

**Good agent behavior:**
- Hypotheses have mechanistic reasoning ("X because Y, predicting Z"), not just "try X"
- findings.md builds a coherent narrative, not a flat list of results
- Negative results are recorded with what they rule out
- The agent updates its model when experiments contradict expectations
- Progress reports tell a research story with compelling visualizations

**Bad agent behavior:**
- Pure hyperparameter sweeps without interpretation
- findings.md is just experiment logs copy-pasted
- Agent never revisits its assumptions after failures
- Optimizing metrics without understanding why changes work

## When to Use vs Alternatives

**Use autoresearch when:**
- You have a research question explorable through experiments
- There's a measurable proxy metric for inner loop optimization
- The real contribution requires synthesis beyond the metric
- You want continuous autonomous research operation

**Use individual domain skills instead when:**
- You have a specific one-off task (train a model, run eval, write a paper)
- No iterative experimentation needed

## Common Issues

**Inner loop stalls (no metric improvement)**
Run an outer loop. Is the metric the right one? Is the search space exhausted? Consider broadening or pivoting. Search literature for new approaches.

**Stuck and not making progress**
Don't keep trying random changes. Step back: search literature for related work, invoke `21-research-ideation/` brainstorming skills, or run an outer loop reflection. Being stuck means you need new information or a new perspective, not more experiments.

**Results contradict baseline expectations**
Investigate, don't ignore. Return to literature — your protocol might have an error, the published baseline may be wrong, or conditions differ. Update findings.md with what you learn.

**Agent loses context between ticks**
Ensure research-state.yaml and findings.md are updated after every action. These files are your memory across sessions.

**Can't find relevant papers**
Try multiple approaches in order: Exa MCP for broad search, Semantic Scholar for specific ML/AI paper lookup (`pip install semanticscholar`), arXiv for preprints (`pip install arxiv`). Check `20-ml-paper-writing` skill's `references/citation-workflow.md` for complete API code. Note: Google Scholar has no official API — use Semantic Scholar instead for programmatic search.

**No GPU available**
Use CPU and scale experiments down. Many research tasks (analysis, interpretability, small model training) run fine on CPU. Adjust experiment design to fit available compute rather than blocking.

**Experiments take longer than /loop interval**
Normal. On the next tick, check if it finished. If not, keep waiting or do something else useful (update notes, search papers). Adjust interval if needed.

**Not sure when to conclude**
Three questions: Do you have a strongly supported finding? Can you explain WHY it works? Would findings.md make a convincing paper abstract? If yes to all: conclude.

## Advanced Topics

- **Detailed agent continuity**: [references/agent-continuity.md](references/agent-continuity.md)
- **Progress presentation templates**: [references/progress-reporting.md](references/progress-reporting.md)
- **Complete skill routing**: [references/skill-routing.md](references/skill-routing.md)


================================================
FILE: 0-autoresearch-skill/references/agent-continuity.md
================================================
# Agent Continuity: Keeping Research Running

Autonomous research requires agents that keep working continuously — hours, overnight, or days. This reference covers how to set up the wall-clock loop on each platform.

## MANDATORY: Set Up Before Anything Else

The wall-clock loop is the first thing you do when starting autoresearch. Without it, the agent does one cycle of work and stops. The loop fires every 20 minutes and simply tells the agent: "keep working, check if anything's wrong."

This is completely separate from the research inner/outer loops. The wall-clock loop is a fixed-interval prompt injection. The research loops run at experiment speed (minutes to hours per cycle). They will not sync and don't need to.

## Claude Code: /loop

**Run this as your very first action:**

```
/loop 20m Continue autoresearch. Read research-state.yaml and findings.md. Check: is the current experiment done? Are there errors? Is progress stalling? If everything is on track, keep working. If something is wrong, step back and fix it. Never idle.
```

### What Happens on Each Tick

The `/loop` message fires every 20 minutes. When it fires:

1. **Read state** — `research-state.yaml` and `findings.md` to remember where you are and what you've learned
2. **Check health** — is the current experiment running? Did it error? Is training diverging?
3. **If on track** → continue whatever you were doing (don't restart or re-plan)
4. **If stuck or broken** → step back, diagnose the issue, fix it, then continue
5. **If idle** → you shouldn't be. Pick the next hypothesis and start an experiment

The tick is a nudge, not a phase boundary. If the previous tick's work isn't done, just keep going.

### When Experiments Run Longer Than 20 Minutes

This is normal and expected. On the next tick:
- Check if the experiment finished
- If still running: monitor for errors, do lightweight work (update notes, search papers)
- Don't restart or duplicate the experiment

### Context Recovery

Your primary memory between ticks is `research-state.yaml` and `findings.md`. Keep them updated after every experiment and reflection. If you lose context, read both files plus `research-log.md` to reconstruct where you are.

## OpenClaw: Cron Job

**Set up a 20-minute cron job in the current session as your very first action.** Use the `cron.add` tool to create a recurring job bound to the current chat session:

```json
{
  "name": "autoresearch-loop",
  "schedule": { "kind": "every", "everyMs": 1200000 },
  "sessionTarget": "current",
  "payload": {
    "kind": "agentTurn",
    "message": "Continue autoresearch. Read research-state.yaml and findings.md. Check: is the current experiment done? Are there errors? Is progress stalling? If everything is on track, keep working. If something is wrong, step back and fix it. Never idle."
  }
}
```

Key details:
- `sessionTarget: "current"` binds the cron to the current chat session (resolved to `session:<sessionKey>` at creation time), so it maintains conversation context across ticks
- `everyMs: 1200000` = 20 minutes
- Verify with `cron.list` that the job is created and enabled
- To check run history later: `cron.runs` with the job ID

### Context Between Cron Ticks

OpenClaw cron invocations may start fresh each time. Your workspace files are your memory:

- `research-state.yaml` — where you are, what's active
- `findings.md` — what you've learned (read this every time!)
- `research-log.md` — what happened chronologically

Keep these updated after every action so the next cron tick can pick up seamlessly.

### Progress Reports

OpenClaw can't `open` HTML files locally like Claude Code can. When you have something to report:

1. Generate a PDF progress summary (use Python with reportlab, matplotlib, or similar)
2. Include: research question, key results, optimization trajectory plot, current understanding, next steps
3. Send it to the user via Telegram, WhatsApp, or Slack — whichever channel they use
4. When you get an exciting result or interesting plot, send it right away — don't wait for a full report

## Research State as Ground Truth

Both platforms share the same ground truth: the workspace files.

| File | Purpose | Update Frequency |
|---|---|---|
| `research-state.yaml` | Machine-readable state | After every experiment and reflection |
| `research-log.md` | Decision timeline | After every significant action |
| `findings.md` | Narrative understanding + project memory | After every outer loop |
| `experiments/*/results/` | Raw experimental data | After every experiment |

The wall-clock loop (`/loop` or cron) is just the trigger. The workspace files are the memory. Keep them current.


================================================
FILE: 0-autoresearch-skill/references/progress-reporting.md
================================================
# Progress Reporting: Research Presentations

When the research produces something worth sharing, create a compelling presentation — not a status dump, but a research story with visuals.

## When to Report

You decide when progress is meaningful enough to report. Consider reporting:

- After an outer loop reflection that identified a significant pattern
- When the optimization trajectory shows clear, sustained improvement
- After a pivot — explain why the direction changed
- Before requesting human input on a major decision
- When concluding the research, before paper writing

Maximum frequency: once per /loop tick or heartbeat cycle. Minimum: whenever you have something a human would find interesting.

## What Makes a Good Research Presentation

A good progress report reads like a research talk, not a database query. It should:

1. **Tell a story**: why we started, what we tried, what we found, what it means
2. **Show, don't just tell**: include plots, tables, comparisons — not just text
3. **Be selective**: highlight the interesting findings, don't exhaustively list every experiment
4. **End with direction**: what happens next and why

## Recommended Sections

Adapt these to what's compelling from your current research. Skip sections that aren't relevant. Add sections the research demands.

### 1. Research Question and Motivation
- What are we investigating and why does it matter?
- One paragraph, accessible to someone unfamiliar with the project

### 2. Approach
- What's our method? What are we optimizing?
- The two-loop architecture in one sentence

### 3. Optimization Trajectory (The Karpathy Plot)
- X-axis: experiment number or wall-clock time
- Y-axis: proxy metric value
- Show baseline as a horizontal line
- Annotate significant jumps with what change caused them
- This is often the most compelling visual — include it whenever possible

### 4. Key Findings
- The 2-3 most significant results with supporting evidence
- Include plots, metric tables, comparison charts
- Explain WHY results are significant, not just WHAT they are

### 5. What We Tried (Decision Map)
- A selective view of the hypothesis tree
- Focus on the reasoning: why each direction was chosen, what it taught us
- Include both successes and informative failures

### 6. Current Understanding
- The findings.md narrative, but presented compellingly
- What's our best explanation for the patterns we see?

### 7. Next Steps
- What experiments are planned and why
- What questions remain open
- Any decisions that need human input

## The Optimization Trajectory Plot

This is the signature visual of autoresearch — a chart showing metric improvement over experiments.

Minimal implementation (SVG-based, no dependencies):

```python
def generate_trajectory_svg(trajectory_data, width=800, height=400):
    """Generate an SVG optimization trajectory chart.

    trajectory_data: list of {"run": int, "metric": float, "label": str}
    """
    if not trajectory_data:
        return "<p>No experiments yet.</p>"

    metrics = [d["metric"] for d in trajectory_data]
    min_m, max_m = min(metrics), max(metrics)
    margin = (max_m - min_m) * 0.1 or 0.1
    y_min, y_max = min_m - margin, max_m + margin

    padding = 60
    plot_w = width - 2 * padding
    plot_h = height - 2 * padding
    n = len(trajectory_data)

    def x_pos(i):
        return padding + (i / max(n - 1, 1)) * plot_w

    def y_pos(v):
        return padding + plot_h - ((v - y_min) / (y_max - y_min)) * plot_h

    # Build SVG
    svg = f'<svg width="{width}" height="{height}" xmlns="http://www.w3.org/2000/svg">'
    svg += f'<rect width="{width}" height="{height}" fill="#1a1a2e" rx="8"/>'

    # Grid lines
    for i in range(5):
        y = padding + i * plot_h / 4
        val = y_max - i * (y_max - y_min) / 4
        svg += f'<line x1="{padding}" y1="{y}" x2="{width-padding}" y2="{y}" stroke="#333" stroke-dasharray="4"/>'
        svg += f'<text x="{padding-8}" y="{y+4}" fill="#888" text-anchor="end" font-size="11">{val:.3f}</text>'

    # Baseline line
    baseline = trajectory_data[0]["metric"]
    by = y_pos(baseline)
    svg += f'<line x1="{padding}" y1="{by}" x2="{width-padding}" y2="{by}" stroke="#ff6b6b" stroke-dasharray="6" opacity="0.7"/>'
    svg += f'<text x="{width-padding+5}" y="{by+4}" fill="#ff6b6b" font-size="10">baseline</text>'

    # Data line
    points = " ".join(f"{x_pos(i)},{y_pos(d['metric'])}" for i, d in enumerate(trajectory_data))
    svg += f'<polyline points="{points}" fill="none" stroke="#4ecdc4" stroke-width="2"/>'

    # Data points
    for i, d in enumerate(trajectory_data):
        cx, cy = x_pos(i), y_pos(d["metric"])
        svg += f'<circle cx="{cx}" cy="{cy}" r="4" fill="#4ecdc4"/>'

    # Title
    svg += f'<text x="{width/2}" y="24" fill="#eee" text-anchor="middle" font-size="14" font-weight="bold">Optimization Trajectory</text>'
    svg += f'<text x="{width/2}" y="{height-10}" fill="#888" text-anchor="middle" font-size="11">Experiment Run</text>'
    svg += '</svg>'
    return svg
```

Embed the SVG output directly in the HTML report. Annotate significant jumps with brief labels.

## HTML Presentation Template

Use [templates/progress-presentation.html](../templates/progress-presentation.html) as a starting point. It provides:

- Clean, dark-themed styling suitable for research presentations
- Responsive layout
- Section scaffolding matching the recommended structure
- Placeholder for the trajectory chart

Replace placeholder content with your actual research data. Add, remove, or rearrange sections as the research demands. The template is a scaffold, not a constraint.

### Claude Code

Generate the HTML, then show it to the human:

```bash
open to_human/progress-001.html
```

### OpenClaw

Generate a PDF version. Options:
- Use Python `weasyprint` to convert HTML to PDF
- Use `matplotlib` to generate plots directly as PDF
- Create a simple markdown → PDF pipeline

Note the PDF path in HEARTBEAT.md so the human knows to look at it.

## Presentation Quality Tips

- **One insight per section** — don't overload
- **Label axes and units** on all plots
- **Use color consistently** — one color for improvements, another for baselines
- **Include confidence intervals** or error bars where meaningful
- **Show the trajectory early** — it's the hook that tells the reader "this is working"
- **End with a clear next step** — the human should know what happens next without asking


================================================
FILE: 0-autoresearch-skill/references/skill-routing.md
================================================
# Skill Routing: When to Use Which Domain Skill

The autoresearch skill orchestrates — domain skills execute. This reference maps research activities to the skills library.

## Routing Principle

When you encounter a domain-specific task during research, search the skills library for the right tool. Read the SKILL.md of the relevant skill before starting — it contains workflows, common issues, and production-ready code examples.

## Complete Routing Map

### Data and Preprocessing

| Task | Skill | Location |
|---|---|---|
| Large-scale data processing | Ray Data | `05-data-processing/ray-data/` |
| Data curation and filtering | NeMo Curator | `05-data-processing/nemo-curator/` |
| Custom tokenizer training | HuggingFace Tokenizers | `02-tokenization/hf-tokenizers/` |
| Subword tokenization | SentencePiece | `02-tokenization/sentencepiece/` |

### Model Architecture and Training

| Task | Skill | Location |
|---|---|---|
| Large-scale pretraining | Megatron-Core | `01-model-architecture/megatron-core/` |
| Lightweight LLM training | LitGPT | `01-model-architecture/litgpt/` |
| State-space models | Mamba | `01-model-architecture/mamba/` |
| Linear attention models | RWKV | `01-model-architecture/rwkv/` |
| Small-scale pretraining | NanoGPT | `01-model-architecture/nanogpt/` |

### Fine-tuning

| Task | Skill | Location |
|---|---|---|
| Multi-method fine-tuning | Axolotl | `03-fine-tuning/axolotl/` |
| Template-based fine-tuning | LLaMA-Factory | `03-fine-tuning/llama-factory/` |
| Fast LoRA fine-tuning | Unsloth | `03-fine-tuning/unsloth/` |
| PyTorch-native fine-tuning | Torchtune | `03-fine-tuning/torchtune/` |

### Post-training (RL / Alignment)

| Task | Skill | Location |
|---|---|---|
| PPO, DPO, SFT pipelines | TRL | `06-post-training/trl/` |
| Group Relative Policy Optimization | GRPO | `06-post-training/grpo-rl-training/` |
| Scalable RLHF | OpenRLHF | `06-post-training/openrlhf/` |
| Reference-free alignment | SimPO | `06-post-training/simpo/` |

### Interpretability

| Task | Skill | Location |
|---|---|---|
| Transformer circuit analysis | TransformerLens | `04-mechanistic-interpretability/transformerlens/` |
| Sparse autoencoder training | SAELens | `04-mechanistic-interpretability/saelens/` |
| Intervention experiments | NNsight | `04-mechanistic-interpretability/nnsight/` |
| Causal tracing | Pyvene | `04-mechanistic-interpretability/pyvene/` |

### Distributed Training

| Task | Skill | Location |
|---|---|---|
| ZeRO optimization | DeepSpeed | `08-distributed-training/deepspeed/` |
| Fully sharded data parallel | FSDP | `08-distributed-training/fsdp/` |
| Multi-GPU abstraction | Accelerate | `08-distributed-training/accelerate/` |
| Training framework | PyTorch Lightning | `08-distributed-training/pytorch-lightning/` |
| Distributed data + training | Ray Train | `08-distributed-training/ray-train/` |

### Evaluation

| Task | Skill | Location |
|---|---|---|
| Standard LLM benchmarks | lm-evaluation-harness | `11-evaluation/lm-eval-harness/` |
| NeMo-integrated evaluation | NeMo Evaluator | `11-evaluation/nemo-evaluator/` |
| Custom eval tasks | Inspect AI | `11-evaluation/inspect-ai/` |

### Inference and Serving

| Task | Skill | Location |
|---|---|---|
| High-throughput serving | vLLM | `12-inference-serving/vllm/` |
| NVIDIA-optimized inference | TensorRT-LLM | `12-inference-serving/tensorrt-llm/` |
| CPU / edge inference | llama.cpp | `12-inference-serving/llama-cpp/` |
| Structured generation serving | SGLang | `12-inference-serving/sglang/` |

### Experiment Tracking

| Task | Skill | Location |
|---|---|---|
| Full experiment tracking | Weights & Biases | `13-mlops/wandb/` |
| Open-source tracking | MLflow | `13-mlops/mlflow/` |
| Training visualization | TensorBoard | `13-mlops/tensorboard/` |

### Optimization Techniques

| Task | Skill | Location |
|---|---|---|
| Efficient attention | Flash Attention | `10-optimization/flash-attention/` |
| 4/8-bit quantization | bitsandbytes | `10-optimization/bitsandbytes/` |
| GPTQ quantization | GPTQ | `10-optimization/gptq/` |
| AWQ quantization | AWQ | `10-optimization/awq/` |
| GGUF format (llama.cpp) | GGUF | `10-optimization/gguf/` |
| PyTorch-native quantization | Quanto | `10-optimization/quanto/` |

### Safety and Alignment

| Task | Skill | Location |
|---|---|---|
| Constitutional AI training | Constitutional AI | `07-safety-alignment/constitutional-ai/` |
| Content safety classification | LlamaGuard | `07-safety-alignment/llamaguard/` |
| Guardrail pipelines | NeMo Guardrails | `07-safety-alignment/nemo-guardrails/` |
| Prompt injection detection | Prompt Guard | `07-safety-alignment/prompt-guard/` |

### Infrastructure

| Task | Skill | Location |
|---|---|---|
| Serverless GPU compute | Modal | `09-infrastructure/modal/` |
| Multi-cloud orchestration | SkyPilot | `09-infrastructure/skypilot/` |
| GPU cloud instances | Lambda Labs | `09-infrastructure/lambda-labs/` |

### Agents and RAG

| Task | Skill | Location |
|---|---|---|
| Agent pipelines | LangChain | `14-agents/langchain/` |
| Knowledge retrieval agents | LlamaIndex | `14-agents/llamaindex/` |
| Lightweight agents | Smolagents | `14-agents/smolagents/` |
| Claude-based agents | Claude Agent SDK | `14-agents/claude-agent-sdk/` |
| Vector store (local) | Chroma | `15-rag/chroma/` |
| Vector similarity search | FAISS | `15-rag/faiss/` |
| Text embeddings | Sentence Transformers | `15-rag/sentence-transformers/` |
| Managed vector DB | Pinecone | `15-rag/pinecone/` |
| Scalable vector DB | Milvus | `15-rag/milvus/` |

### Prompt Engineering and Structured Output

| Task | Skill | Location |
|---|---|---|
| Prompt optimization | DSPy | `16-prompt-engineering/dspy/` |
| Structured LLM output | Instructor | `16-prompt-engineering/instructor/` |
| Constrained generation | Guidance | `16-prompt-engineering/guidance/` |
| Grammar-based generation | Outlines | `16-prompt-engineering/outlines/` |

### Multimodal

| Task | Skill | Location |
|---|---|---|
| Vision-language models | CLIP | `18-multimodal/clip/` |
| Speech recognition | Whisper | `18-multimodal/whisper/` |
| Visual instruction tuning | LLaVA | `18-multimodal/llava/` |
| Vision-language (Qwen) | Qwen2-VL | `18-multimodal/qwen2-vl/` |
| Vision-language (Mistral) | Pixtral | `18-multimodal/pixtral/` |
| Visual understanding | Florence-2 | `18-multimodal/florence-2/` |
| Document retrieval | ColPali | `18-multimodal/colpali/` |

### Observability

| Task | Skill | Location |
|---|---|---|
| LLM tracing and debugging | LangSmith | `17-observability/langsmith/` |
| LLM observability platform | Phoenix | `17-observability/phoenix/` |

### Emerging Techniques

| Task | Skill | Location |
|---|---|---|
| Mixture of Experts training | MoE Training | `19-emerging-techniques/moe-training/` |
| Combining trained models | Model Merging | `19-emerging-techniques/model-merging/` |
| Extended context windows | Long Context | `19-emerging-techniques/long-context/` |
| Faster inference via drafting | Speculative Decoding | `19-emerging-techniques/speculative-decoding/` |
| Teacher-student compression | Knowledge Distillation | `19-emerging-techniques/knowledge-distillation/` |
| Reducing model size | Model Pruning | `19-emerging-techniques/model-pruning/` |

### Research Output

| Task | Skill | Location |
|---|---|---|
| Generate research ideas | Research Ideation | `21-research-ideation/` |
| Write publication-ready paper | ML Paper Writing | `20-ml-paper-writing/` |

## Common Research Workflows

### "I need to fine-tune a model and evaluate it"

1. Pick fine-tuning skill based on needs (Unsloth for speed, Axolotl for flexibility)
2. Use lm-evaluation-harness for standard benchmarks
3. Track with W&B or MLflow

### "I need to understand what the model learned"

1. Use TransformerLens for circuit-level analysis
2. Train SAEs with SAELens for feature-level understanding
3. Run interventions with NNsight or Pyvene

### "I need to do RL training"

1. Start with TRL for standard PPO/DPO
2. Use GRPO skill for DeepSeek-R1 style training
3. Scale with OpenRLHF if needed

### "I need to run experiments on cloud GPUs"

1. Modal for quick serverless runs
2. SkyPilot for multi-cloud optimization
3. Lambda Labs for dedicated instances

## Finding Skills

If you're not sure which skill to use:

```bash
# Search by keyword in skill names
ls */*/SKILL.md | head -20

# Search skill descriptions for a keyword
grep -l "keyword" */*/SKILL.md
```

Or search the repository's README.md which lists all skills with descriptions.


================================================
FILE: 0-autoresearch-skill/templates/findings.md
================================================
# Research Findings

## Research Question

<!-- What are we trying to discover? One clear sentence. -->

## Current Understanding

<!-- Updated after each outer loop cycle. What do we know so far?
     What patterns explain our results? What's the mechanism?
     This section should read like the core argument of a paper. -->

## Key Results

<!-- Significant experimental findings. Include metrics, comparisons, and
     brief interpretation. Link to experiment directories for full details. -->

## Patterns and Insights

<!-- What emerges across multiple experiments? What types of changes
     consistently work or fail? Why? -->

## Lessons and Constraints

<!-- Specific actionable learnings that should guide future experiments.
     Things you tried that didn't work and WHY, so you don't repeat them.
     Constraints you discovered about the problem space.

     Examples:
     - Weight decay > 0.1 causes training instability at 125M param scale
     - SwiGLU and RoPE improvements stack because they're orthogonal (FFN vs positional)
     - Baseline only reproduces published numbers with batch_size=64, not 32
     - Sleep phases before memorization completion hurt — model needs memories to consolidate -->

## Open Questions

<!-- What remains unanswered? What would strengthen or challenge
     our current understanding? -->

## Optimization Trajectory

<!-- Summary of inner loop progress. How has the metric evolved?
     Note inflection points and what caused them. -->


================================================
FILE: 0-autoresearch-skill/templates/progress-presentation.html
================================================
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Research Progress</title>
    <style>
        * { margin: 0; padding: 0; box-sizing: border-box; }

        body {
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', system-ui, sans-serif;
            background: #0d1117;
            color: #e6edf3;
            line-height: 1.6;
            padding: 2rem;
            max-width: 1100px;
            margin: 0 auto;
        }

        header {
            text-align: center;
            padding: 3rem 0 2rem;
            border-bottom: 1px solid #21262d;
            margin-bottom: 2.5rem;
        }

        header h1 {
            font-size: 2.2rem;
            font-weight: 700;
            color: #f0f6fc;
            margin-bottom: 0.5rem;
        }

        .subtitle {
            font-size: 1.15rem;
            color: #8b949e;
            font-style: italic;
            max-width: 700px;
            margin: 0 auto 1rem;
        }

        .meta {
            font-size: 0.85rem;
            color: #484f58;
        }

        .meta span {
            display: inline-block;
            margin: 0 0.5rem;
            padding: 0.15rem 0.6rem;
            background: #161b22;
            border: 1px solid #21262d;
            border-radius: 12px;
        }

        section {
            margin-bottom: 3rem;
        }

        section h2 {
            font-size: 1.4rem;
            font-weight: 600;
            color: #f0f6fc;
            margin-bottom: 1rem;
            padding-bottom: 0.5rem;
            border-bottom: 1px solid #21262d;
        }

        p, li { color: #c9d1d9; }

        .card {
            background: #161b22;
            border: 1px solid #21262d;
            border-radius: 8px;
            padding: 1.5rem;
            margin-bottom: 1rem;
        }

        .card h3 {
            font-size: 1.05rem;
            color: #58a6ff;
            margin-bottom: 0.5rem;
        }

        .result-grid {
            display: grid;
            grid-template-columns: repeat(auto-fit, minmax(220px, 1fr));
            gap: 1rem;
            margin-bottom: 1.5rem;
        }

        .stat-card {
            background: #161b22;
            border: 1px solid #21262d;
            border-radius: 8px;
            padding: 1.2rem;
            text-align: center;
        }

        .stat-card .value {
            font-size: 2rem;
            font-weight: 700;
            color: #58a6ff;
        }

        .stat-card .label {
            font-size: 0.8rem;
            color: #8b949e;
            text-transform: uppercase;
            letter-spacing: 0.05em;
        }

        .stat-card.positive .value { color: #3fb950; }
        .stat-card.negative .value { color: #f85149; }

        table {
            width: 100%;
            border-collapse: collapse;
            margin: 1rem 0;
        }

        th {
            text-align: left;
            padding: 0.6rem 1rem;
            background: #161b22;
            color: #8b949e;
            font-size: 0.8rem;
            text-transform: uppercase;
            letter-spacing: 0.05em;
            border-bottom: 1px solid #21262d;
        }

        td {
            padding: 0.6rem 1rem;
            border-bottom: 1px solid #21262d;
            font-size: 0.95rem;
        }

        .badge {
            display: inline-block;
            padding: 0.15rem 0.5rem;
            border-radius: 10px;
            font-size: 0.75rem;
            font-weight: 600;
        }

        .badge-supported { background: #0d2818; color: #3fb950; border: 1px solid #1b4332; }
        .badge-refuted   { background: #2d1215; color: #f85149; border: 1px solid #4a1c20; }
        .badge-active    { background: #0c2d6b; color: #58a6ff; border: 1px solid #1158c7; }
        .badge-pending   { background: #1c1c1c; color: #8b949e; border: 1px solid #333; }

        .chart-container {
            background: #161b22;
            border: 1px solid #21262d;
            border-radius: 8px;
            padding: 1.5rem;
            text-align: center;
            margin: 1rem 0;
        }

        .next-steps {
            background: #0c2d6b22;
            border: 1px solid #1158c744;
            border-radius: 8px;
            padding: 1.5rem;
        }

        .next-steps h3 { color: #58a6ff; margin-bottom: 0.5rem; }
        .next-steps ul { padding-left: 1.5rem; }
        .next-steps li { margin-bottom: 0.3rem; }

        footer {
            text-align: center;
            padding: 2rem 0;
            color: #484f58;
            font-size: 0.8rem;
            border-top: 1px solid #21262d;
        }
    </style>
</head>
<body>

    <!--
        AGENT INSTRUCTIONS:
        This is a starting point. Fill in, rearrange, add, or remove sections
        based on what's compelling from your current research. The goal is a
        research story, not a status dashboard.

        Replace {{PLACEHOLDERS}} with actual content.
        Embed SVG charts inline (see progress-reporting.md for the trajectory plot function).
        Add additional sections as needed.
    -->

    <header>
        <h1>{{PROJECT_TITLE}}</h1>
        <p class="subtitle">{{RESEARCH_QUESTION}}</p>
        <p class="meta">
            <span>{{DATE}}</span>
            <span>{{N_EXPERIMENTS}} experiments</span>
            <span>Status: {{STATUS}}</span>
        </p>
    </header>

    <!-- Summary stats -->
    <section>
        <div class="result-grid">
            <div class="stat-card positive">
                <div class="value">{{BEST_METRIC}}</div>
                <div class="label">Best Metric</div>
            </div>
            <div class="stat-card">
                <div class="value">{{BASELINE_METRIC}}</div>
                <div class="label">Baseline</div>
            </div>
            <div class="stat-card positive">
                <div class="value">{{IMPROVEMENT}}</div>
                <div class="label">Improvement</div>
            </div>
            <div class="stat-card">
                <div class="value">{{N_HYPOTHESES}}</div>
                <div class="label">Hypotheses Tested</div>
            </div>
        </div>
    </section>

    <!-- Background and motivation -->
    <section id="background">
        <h2>Background & Motivation</h2>
        <div class="card">
            <!-- Why does this research matter? What gap are we addressing? -->
            <p>{{BACKGROUND_TEXT}}</p>
        </div>
    </section>

    <!-- Optimization trajectory - THE key visual -->
    <section id="trajectory">
        <h2>Optimization Trajectory</h2>
        <div class="chart-container">
            <!-- Embed SVG chart here. See references/progress-reporting.md
                 for the generate_trajectory_svg() function. -->
            {{TRAJECTORY_SVG}}
        </div>
    </section>

    <!-- Key findings -->
    <section id="findings">
        <h2>Key Findings</h2>
        <!-- Add cards for each significant finding -->
        <div class="card">
            <h3>{{FINDING_1_TITLE}}</h3>
            <p>{{FINDING_1_DESCRIPTION}}</p>
            <!-- Include inline plots, tables, or metrics as needed -->
        </div>
    </section>

    <!-- What was tried -->
    <section id="experiments">
        <h2>What We Tried</h2>
        <table>
            <thead>
                <tr>
                    <th>Hypothesis</th>
                    <th>Change</th>
                    <th>Result</th>
                    <th>Status</th>
                </tr>
            </thead>
            <tbody>
                <!-- Add rows for notable experiments -->
                <tr>
                    <td>{{H_ID}}</td>
                    <td>{{CHANGE_SUMMARY}}</td>
                    <td>{{METRIC_DELTA}}</td>
                    <td><span class="badge badge-supported">{{STATUS}}</span></td>
                </tr>
            </tbody>
        </table>
    </section>

    <!-- Current understanding -->
    <section id="understanding">
        <h2>Current Understanding</h2>
        <div class="card">
            <!-- The narrative from findings.md, but presented compellingly -->
            <p>{{CURRENT_UNDERSTANDING}}</p>
        </div>
    </section>

    <!-- Next steps -->
    <section id="next">
        <h2>Next Steps</h2>
        <div class="next-steps">
            <ul>
                <li>{{NEXT_STEP_1}}</li>
                <li>{{NEXT_STEP_2}}</li>
                <li>{{NEXT_STEP_3}}</li>
            </ul>
        </div>
    </section>

    <footer>
        Generated by Autoresearch | {{DATE}}
    </footer>

</body>
</html>


================================================
FILE: 0-autoresearch-skill/templates/research-log.md
================================================
# Research Log

Chronological record of research decisions and actions. Append-only.

| # | Date | Type | Summary |
|---|------|------|---------|
| | | | |

<!-- Entry types:
  bootstrap    — initial scoping, literature search, hypothesis formation
  inner-loop   — experiment run and result
  outer-loop   — synthesis, reflection, direction decision
  pivot        — change in research direction
  report       — progress presentation generated
  conclude     — decision to finalize and write paper

Example entries:
| 1 | 2026-03-15 | bootstrap | Searched Semantic Scholar + arXiv for efficient transformer architectures. Found 8 relevant papers. Gap: no systematic comparison of GLU variants on small models. Formed 3 hypotheses. Baseline: NanoGPT 5-min run, val_loss=4.82. |
| 2 | 2026-03-15 | inner-loop | H1 run_001: swapped ReLU for SwiGLU in FFN. 5-min training run. val_loss=4.61 (baseline 4.82, delta -0.21). Kept. |
| 3 | 2026-03-15 | inner-loop | H1 run_002: increased FFN hidden dim from 4x to 5.3x to match SwiGLU param count. val_loss=4.58 (-0.03 vs run_001). Marginal — SwiGLU benefit mostly from gating, not extra params. |
| 4 | 2026-03-15 | inner-loop | H1 run_003: tried GEGLU instead of SwiGLU. val_loss=4.63. Slightly worse than SwiGLU. SwiGLU wins for this scale. |
| 5 | 2026-03-15 | inner-loop | H2 run_004: replaced learned positional embeddings with RoPE. val_loss=4.55 (-0.06 vs SwiGLU baseline). Promising — stacks with SwiGLU. |
| 6 | 2026-03-15 | inner-loop | H2 run_005: RoPE + SwiGLU combined. val_loss=4.41 (-0.41 vs original baseline). Best so far. |
| 7 | 2026-03-16 | outer-loop | Reviewed 5 runs. Pattern: gating mechanisms (SwiGLU) and rotary embeddings (RoPE) give independent gains that stack. Combined improvement ~9%. But WHY do they stack? Hypothesis: they operate on orthogonal aspects (FFN expressiveness vs positional encoding). Direction: DEEPEN — test if adding RMSNorm also stacks independently. |
| 8 | 2026-03-16 | inner-loop | H3 run_006: replaced LayerNorm with RMSNorm. val_loss=4.39 (-0.02). Small gain. Stacks but diminishing returns on normalization. |
| 9 | 2026-03-17 | outer-loop | 8 runs complete. Optimization plateau around val_loss=4.38. The easy architectural wins (SwiGLU, RoPE) are captured. Searched literature on training dynamics — found papers on warmup schedules at small scale. Direction: BROADEN — shift from architecture to training recipe. |
| 10 | 2026-03-17 | report | Generated progress-001.html with trajectory plot showing 9% improvement from architectural changes. |

Example entries (discovery-type research — understanding grokking):
| 1 | 2026-03-20 | bootstrap | Searched literature on grokking and delayed generalization. Found Nanda et al. progress measures, Grokfast spectral filtering. Gap: no connection to memory consolidation theory from neuroscience. 3 hypotheses formed. |
| 2 | 2026-03-20 | inner-loop | H1 run_001: trained modular addition transformer to memorization (100% train acc, 0% test). Steps to memorize: 1200. Baseline established. |
| 3 | 2026-03-20 | inner-loop | H1 run_002: continued training with standard weight decay. Grokking at step 48000. Measured progress measure throughout — sharp transition at step 44000. |
| 4 | 2026-03-20 | inner-loop | H1 run_003: inserted "sleep phase" at step 20000 (elevated weight decay + oscillatory LR for 500 steps). Grokking now at step 31000. 35% acceleration. |
| 5 | 2026-03-20 | inner-loop | H1 run_004: sleep phase at step 10000. Grokking at step 27000. Earlier sleep = earlier grokking. |
| 6 | 2026-03-20 | inner-loop | H1 run_005: sleep phase at step 5000 (before full memorization). Grokking at step 38000. Too early hurts — model hadn't memorized enough for consolidation to work. |
| 7 | 2026-03-21 | outer-loop | Reviewed 5 runs. Clear pattern: sleep phases accelerate grokking but only AFTER memorization is complete. This matches memory consolidation theory exactly — you need memories formed before consolidation can reorganize them. Searched for neural slow-wave sleep literature. The weight decay + oscillatory LR during sleep phases mimics synaptic downscaling. Direction: DEEPEN — sweep sleep timing relative to memorization completion. |
| 8 | 2026-03-21 | inner-loop | H1.1 run_006-010: swept sleep insertion at 80%, 100%, 120%, 150%, 200% of memorization step. Sweet spot at 110-120%. Consistent across 3 seeds. |
| 9 | 2026-03-22 | outer-loop | 10 runs complete. The story is clear: neural networks "dream to learn" just like brains — consolidation after encoding, not during. Grokfast achieves similar acceleration through a different mechanism (gradient spectral filtering). Next: compare gradient spectra during our sleep phases vs Grokfast filtering to see if they converge on the same signal. Direction: BROADEN. |
| 10 | 2026-03-22 | report | Generated progress-001.html with sleep timing vs grokking step plot. Key visual: sweet spot curve mirrors neuroscience memory consolidation window. |
-->


================================================
FILE: 0-autoresearch-skill/templates/research-state.yaml
================================================
# Research State — Central Project Tracking
# Copy this template to your project root and fill in as you go.
# Updated by the agent after each experiment and reflection.

project:
  title: ""
  question: ""                    # The core research question
  status: active                  # active | paused | concluded
  started: ""                     # ISO date
  domain: ""                      # e.g., "mechanistic interpretability", "RL training"

literature:
  key_papers: []
  # - id: "liu2025superposition"
  #   title: "Superposition Yields Robust Neural Scaling"
  #   authors: "Liu et al."
  #   year: 2025
  #   relevance: "Proves ETF structure in LM heads"
  open_problems: []               # Gaps identified from literature
  evidence_gaps: []               # What's missing in the field

hypotheses:
  # List of all hypotheses, active and completed
  # - id: H1
  #   statement: "Testable claim with clear prediction"
  #   status: pending             # pending | active | supported | refuted | inconclusive
  #   motivation: "Why this is worth testing"
  #   parent: null                # null for root, parent ID (e.g., H1) for sub-hypotheses
  #   priority: medium            # high | medium | low

experiments:
  proxy_metric: ""                # What we're optimizing and how to compute it
  baseline_value: null            # Starting point
  best_value: null                # Best achieved so far
  total_runs: 0
  trajectory: []
  # - run_id: "run_001"
  #   hypothesis: "H1"
  #   metric_value: null
  #   delta: null                 # Change from baseline
  #   wall_time_min: null
  #   change_summary: ""
  #   timestamp: ""

outer_loop:
  cycle: 0                       # How many outer loop reflections so far
  last_direction: null            # deepen | broaden | pivot | conclude
  last_reflection: ""             # Brief summary of last reflection decision

workspace:
  # Track key resource locations
  findings: "findings.md"
  log: "research-log.md"
  literature_dir: "literature/"
  experiments_dir: "experiments/"
  to_human_dir: "to_human/"
  paper_dir: "paper/"


================================================
FILE: 01-model-architecture/.gitkeep
================================================
# Skills Coming Soon

This directory will contain high-quality AI research skills for model architecture.

See [CONTRIBUTING.md](../CONTRIBUTING.md) for how to contribute.


================================================
FILE: 01-model-architecture/litgpt/SKILL.md
================================================
---
name: implementing-llms-litgpt
description: Implements and trains LLMs using Lightning AI's LitGPT with 20+ pretrained architectures (Llama, Gemma, Phi, Qwen, Mistral). Use when need clean model implementations, educational understanding of architectures, or production fine-tuning with LoRA/QLoRA. Single-file implementations, no abstraction layers.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Model Architecture, LitGPT, Lightning AI, LLM Implementation, LoRA, QLoRA, Fine-Tuning, Llama, Gemma, Phi, Mistral, Educational]
dependencies: [litgpt, torch, transformers]
---

# LitGPT - Clean LLM Implementations

## Quick start

LitGPT provides 20+ pretrained LLM implementations with clean, readable code and production-ready training workflows.

**Installation**:
```bash
pip install 'litgpt[extra]'
```

**Load and use any model**:
```python
from litgpt import LLM

# Load pretrained model
llm = LLM.load("microsoft/phi-2")

# Generate text
result = llm.generate(
    "What is the capital of France?",
    max_new_tokens=50,
    temperature=0.7
)
print(result)
```

**List available models**:
```bash
litgpt download list
```

## Common workflows

### Workflow 1: Fine-tune on custom dataset

Copy this checklist:

```
Fine-Tuning Setup:
- [ ] Step 1: Download pretrained model
- [ ] Step 2: Prepare dataset
- [ ] Step 3: Configure training
- [ ] Step 4: Run fine-tuning
```

**Step 1: Download pretrained model**

```bash
# Download Llama 3 8B
litgpt download meta-llama/Meta-Llama-3-8B

# Download Phi-2 (smaller, faster)
litgpt download microsoft/phi-2

# Download Gemma 2B
litgpt download google/gemma-2b
```

Models are saved to `checkpoints/` directory.

**Step 2: Prepare dataset**

LitGPT supports multiple formats:

**Alpaca format** (instruction-response):
```json
[
  {
    "instruction": "What is the capital of France?",
    "input": "",
    "output": "The capital of France is Paris."
  },
  {
    "instruction": "Translate to Spanish: Hello, how are you?",
    "input": "",
    "output": "Hola, ¿cómo estás?"
  }
]
```

Save as `data/my_dataset.json`.

**Step 3: Configure training**

```bash
# Full fine-tuning (requires 40GB+ GPU for 7B models)
litgpt finetune \
  meta-llama/Meta-Llama-3-8B \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --train.max_steps 1000 \
  --train.learning_rate 2e-5 \
  --train.micro_batch_size 1 \
  --train.global_batch_size 16

# LoRA fine-tuning (efficient, 16GB GPU)
litgpt finetune_lora \
  microsoft/phi-2 \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --lora_r 16 \
  --lora_alpha 32 \
  --lora_dropout 0.05 \
  --train.max_steps 1000 \
  --train.learning_rate 1e-4
```

**Step 4: Run fine-tuning**

Training saves checkpoints to `out/finetune/` automatically.

Monitor training:
```bash
# View logs
tail -f out/finetune/logs.txt

# TensorBoard (if using --train.logger_name tensorboard)
tensorboard --logdir out/finetune/lightning_logs
```

### Workflow 2: LoRA fine-tuning on single GPU

Most memory-efficient option.

```
LoRA Training:
- [ ] Step 1: Choose base model
- [ ] Step 2: Configure LoRA parameters
- [ ] Step 3: Train with LoRA
- [ ] Step 4: Merge LoRA weights (optional)
```

**Step 1: Choose base model**

For limited GPU memory (12-16GB):
- **Phi-2** (2.7B) - Best quality/size tradeoff
- **Llama 3 1B** - Smallest, fastest
- **Gemma 2B** - Good reasoning

**Step 2: Configure LoRA parameters**

```bash
litgpt finetune_lora \
  microsoft/phi-2 \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --lora_r 16 \          # LoRA rank (8-64, higher=more capacity)
  --lora_alpha 32 \      # LoRA scaling (typically 2×r)
  --lora_dropout 0.05 \  # Prevent overfitting
  --lora_query true \    # Apply LoRA to query projection
  --lora_key false \     # Usually not needed
  --lora_value true \    # Apply LoRA to value projection
  --lora_projection true \  # Apply LoRA to output projection
  --lora_mlp false \     # Usually not needed
  --lora_head false      # Usually not needed
```

LoRA rank guide:
- `r=8`: Lightweight, 2-4MB adapters
- `r=16`: Standard, good quality
- `r=32`: High capacity, use for complex tasks
- `r=64`: Maximum quality, 4× larger adapters

**Step 3: Train with LoRA**

```bash
litgpt finetune_lora \
  microsoft/phi-2 \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --lora_r 16 \
  --train.epochs 3 \
  --train.learning_rate 1e-4 \
  --train.micro_batch_size 4 \
  --train.global_batch_size 32 \
  --out_dir out/phi2-lora

# Memory usage: ~8-12GB for Phi-2 with LoRA
```

**Step 4: Merge LoRA weights** (optional)

Merge LoRA adapters into base model for deployment:

```bash
litgpt merge_lora \
  out/phi2-lora/final \
  --out_dir out/phi2-merged
```

Now use merged model:
```python
from litgpt import LLM
llm = LLM.load("out/phi2-merged")
```

### Workflow 3: Pretrain from scratch

Train new model on your domain data.

```
Pretraining:
- [ ] Step 1: Prepare pretraining dataset
- [ ] Step 2: Configure model architecture
- [ ] Step 3: Set up multi-GPU training
- [ ] Step 4: Launch pretraining
```

**Step 1: Prepare pretraining dataset**

LitGPT expects tokenized data. Use `prepare_dataset.py`:

```bash
python scripts/prepare_dataset.py \
  --source_path data/my_corpus.txt \
  --checkpoint_dir checkpoints/tokenizer \
  --destination_path data/pretrain \
  --split train,val
```

**Step 2: Configure model architecture**

Edit config file or use existing:

```python
# config/pythia-160m.yaml
model_name: pythia-160m
block_size: 2048
vocab_size: 50304
n_layer: 12
n_head: 12
n_embd: 768
rotary_percentage: 0.25
parallel_residual: true
bias: true
```

**Step 3: Set up multi-GPU training**

```bash
# Single GPU
litgpt pretrain \
  --config config/pythia-160m.yaml \
  --data.data_dir data/pretrain \
  --train.max_tokens 10_000_000_000

# Multi-GPU with FSDP
litgpt pretrain \
  --config config/pythia-1b.yaml \
  --data.data_dir data/pretrain \
  --devices 8 \
  --train.max_tokens 100_000_000_000
```

**Step 4: Launch pretraining**

For large-scale pretraining on cluster:

```bash
# Using SLURM
sbatch --nodes=8 --gpus-per-node=8 \
  pretrain_script.sh

# pretrain_script.sh content:
litgpt pretrain \
  --config config/pythia-1b.yaml \
  --data.data_dir /shared/data/pretrain \
  --devices 8 \
  --num_nodes 8 \
  --train.global_batch_size 512 \
  --train.max_tokens 300_000_000_000
```

### Workflow 4: Convert and deploy model

Export LitGPT models for production.

```
Model Deployment:
- [ ] Step 1: Test inference locally
- [ ] Step 2: Quantize model (optional)
- [ ] Step 3: Convert to GGUF (for llama.cpp)
- [ ] Step 4: Deploy with API
```

**Step 1: Test inference locally**

```python
from litgpt import LLM

llm = LLM.load("out/phi2-lora/final")

# Single generation
print(llm.generate("What is machine learning?"))

# Streaming
for token in llm.generate("Explain quantum computing", stream=True):
    print(token, end="", flush=True)

# Batch inference
prompts = ["Hello", "Goodbye", "Thank you"]
results = [llm.generate(p) for p in prompts]
```

**Step 2: Quantize model** (optional)

Reduce model size with minimal quality loss:

```bash
# 8-bit quantization (50% size reduction)
litgpt convert_lit_checkpoint \
  out/phi2-lora/final \
  --dtype bfloat16 \
  --quantize bnb.nf4

# 4-bit quantization (75% size reduction)
litgpt convert_lit_checkpoint \
  out/phi2-lora/final \
  --quantize bnb.nf4-dq  # Double quantization
```

**Step 3: Convert to GGUF** (for llama.cpp)

```bash
python scripts/convert_lit_checkpoint.py \
  --checkpoint_path out/phi2-lora/final \
  --output_path models/phi2.gguf \
  --model_name microsoft/phi-2
```

**Step 4: Deploy with API**

```python
from fastapi import FastAPI
from litgpt import LLM

app = FastAPI()
llm = LLM.load("out/phi2-lora/final")

@app.post("/generate")
def generate(prompt: str, max_tokens: int = 100):
    result = llm.generate(
        prompt,
        max_new_tokens=max_tokens,
        temperature=0.7
    )
    return {"response": result}

# Run: uvicorn api:app --host 0.0.0.0 --port 8000
```

## When to use vs alternatives

**Use LitGPT when:**
- Want to understand LLM architectures (clean, readable code)
- Need production-ready training recipes
- Educational purposes or research
- Prototyping new model ideas
- Lightning ecosystem user

**Use alternatives instead:**
- **Axolotl/TRL**: More fine-tuning features, YAML configs
- **Megatron-Core**: Maximum performance for >70B models
- **HuggingFace Transformers**: Broadest model support
- **vLLM**: Inference-only (no training)

## Common issues

**Issue: Out of memory during fine-tuning**

Use LoRA instead of full fine-tuning:
```bash
# Instead of litgpt finetune (requires 40GB+)
litgpt finetune_lora  # Only needs 12-16GB
```

Or enable gradient checkpointing:
```bash
litgpt finetune_lora \
  ... \
  --train.gradient_accumulation_iters 4  # Accumulate gradients
```

**Issue: Training too slow**

Enable Flash Attention (built-in, automatic on compatible hardware):
```python
# Already enabled by default on Ampere+ GPUs (A100, RTX 30/40 series)
# No configuration needed
```

Use smaller micro-batch and accumulate:
```bash
--train.micro_batch_size 1 \
--train.global_batch_size 32 \
--train.gradient_accumulation_iters 32  # Effective batch=32
```

**Issue: Model not loading**

Check model name:
```bash
# List all available models
litgpt download list

# Download if not exists
litgpt download meta-llama/Meta-Llama-3-8B
```

Verify checkpoints directory:
```bash
ls checkpoints/
# Should see: meta-llama/Meta-Llama-3-8B/
```

**Issue: LoRA adapters too large**

Reduce LoRA rank:
```bash
--lora_r 8  # Instead of 16 or 32
```

Apply LoRA to fewer layers:
```bash
--lora_query true \
--lora_value true \
--lora_projection false \  # Disable this
--lora_mlp false  # And this
```

## Advanced topics

**Supported architectures**: See [references/supported-models.md](references/supported-models.md) for complete list of 20+ model families with sizes and capabilities.

**Training recipes**: See [references/training-recipes.md](references/training-recipes.md) for proven hyperparameter configurations for pretraining and fine-tuning.

**FSDP configuration**: See [references/distributed-training.md](references/distributed-training.md) for multi-GPU training with Fully Sharded Data Parallel.

**Custom architectures**: See [references/custom-models.md](references/custom-models.md) for implementing new model architectures in LitGPT style.

## Hardware requirements

- **GPU**: NVIDIA (CUDA 11.8+), AMD (ROCm), Apple Silicon (MPS)
- **Memory**:
  - Inference (Phi-2): 6GB
  - LoRA fine-tuning (7B): 16GB
  - Full fine-tuning (7B): 40GB+
  - Pretraining (1B): 24GB
- **Storage**: 5-50GB per model (depending on size)

## Resources

- GitHub: https://github.com/Lightning-AI/litgpt
- Docs: https://lightning.ai/docs/litgpt
- Tutorials: https://lightning.ai/docs/litgpt/tutorials
- Model zoo: 20+ pretrained architectures (Llama, Gemma, Phi, Qwen, Mistral, Mixtral, Falcon, etc.)




================================================
FILE: 01-model-architecture/litgpt/references/custom-models.md
================================================
# Custom Models

Guide to implementing custom model architectures in LitGPT.

## Overview

LitGPT's clean, single-file implementations make it easy to create custom architectures. You can extend the base `GPT` class or create entirely new models.

**Use cases**:
- Implementing new research architectures
- Adapting models for specific domains
- Experimenting with attention mechanisms
- Adding custom layers or components

## Key Files and Classes

### Core Architecture (`litgpt/model.py`)

**Main classes**:
- `GPT`: Top-level model class
- `Block`: Transformer block (attention + MLP)
- `CausalSelfAttention`: Attention mechanism
- `MLP`: Feed-forward network
- `RMSNorm` / `LayerNorm`: Normalization layers

**Configuration** (`litgpt/config.py`):
- `Config`: Base configuration dataclass
- Model-specific configs: `LlamaConfig`, `MistralConfig`, `PhiConfig`, etc.

## Custom Architecture Workflow

### Step 1: Define Configuration

Create a `Config` dataclass with your model's hyperparameters:

```python
from dataclasses import dataclass
from litgpt.config import Config

@dataclass
class MyModelConfig(Config):
    """Configuration for my custom model."""
    # Standard parameters
    name: str = "my-model-7b"
    block_size: int = 4096
    vocab_size: int = 32000
    n_layer: int = 32
    n_head: int = 32
    n_embd: int = 4096

    # Custom parameters
    custom_param: float = 0.1
    use_custom_attention: bool = True

    # Optional: override defaults
    rope_base: int = 10000
    intermediate_size: int = 11008
```

### Step 2: Implement Custom Components

#### Option A: Custom Attention

```python
from litgpt.model import CausalSelfAttention
import torch
import torch.nn as nn

class CustomAttention(CausalSelfAttention):
    """Custom attention mechanism."""

    def __init__(self, config):
        super().__init__(config)
        # Add custom components
        self.custom_proj = nn.Linear(config.n_embd, config.n_embd)
        self.custom_param = config.custom_param

    def forward(self, x, mask=None, input_pos=None):
        B, T, C = x.size()

        # Standard Q, K, V projections
        q = self.attn(x)
        k = self.attn(x)
        v = self.attn(x)

        # Custom modification
        q = q + self.custom_proj(x) * self.custom_param

        # Rest of attention computation
        q = q.view(B, T, self.n_head, self.head_size)
        k = k.view(B, T, self.n_query_groups, self.head_size)
        v = v.view(B, T, self.n_query_groups, self.head_size)

        # Scaled dot-product attention
        y = self.scaled_dot_product_attention(q, k, v, mask=mask)

        y = y.reshape(B, T, C)
        return self.proj(y)
```

#### Option B: Custom MLP

```python
from litgpt.model import MLP

class CustomMLP(MLP):
    """Custom feed-forward network."""

    def __init__(self, config):
        super().__init__(config)
        # Add custom layers
        self.custom_layer = nn.Linear(config.intermediate_size, config.intermediate_size)

    def forward(self, x):
        x = self.fc_1(x)
        x = self.act(x)
        x = self.custom_layer(x)  # Custom modification
        x = self.fc_2(x)
        return x
```

#### Option C: Custom Block

```python
from litgpt.model import Block

class CustomBlock(Block):
    """Custom transformer block."""

    def __init__(self, config):
        super().__init__(config)
        # Replace attention or MLP
        self.attn = CustomAttention(config)
        # Or: self.mlp = CustomMLP(config)

        # Add custom components
        self.custom_norm = nn.LayerNorm(config.n_embd)

    def forward(self, x, input_pos=None, mask=None):
        # Custom forward pass
        h = self.norm_1(x)
        h = self.attn(h, mask=mask, input_pos=input_pos)
        x = x + h

        # Custom normalization
        x = x + self.custom_norm(x)

        x = x + self.mlp(self.norm_2(x))
        return x
```

### Step 3: Create Custom GPT Model

```python
from litgpt.model import GPT
import torch.nn as nn

class CustomGPT(GPT):
    """Custom GPT model."""

    def __init__(self, config: MyModelConfig):
        # Don't call super().__init__() - we reimplement
        nn.Module.__init__(self)
        self.config = config

        # Standard components
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        self.transformer = nn.ModuleDict(
            dict(
                wte=nn.Embedding(config.vocab_size, config.n_embd),
                h=nn.ModuleList(CustomBlock(config) for _ in range(config.n_layer)),
                ln_f=nn.LayerNorm(config.n_embd),
            )
        )

        # Custom components
        if config.use_custom_attention:
            self.custom_embedding = nn.Linear(config.n_embd, config.n_embd)

        # Initialize weights
        self.apply(self._init_weights)

    def _init_weights(self, module):
        """Initialize weights (required)."""
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, input_pos=None):
        """Forward pass (must match base signature)."""
        B, T = idx.size()

        # Token embeddings
        x = self.transformer.wte(idx)

        # Custom embedding modification
        if self.config.use_custom_attention:
            x = x + self.custom_embedding(x)

        # Transformer blocks
        for block in self.transformer.h:
            x = block(x, input_pos=input_pos)

        # Final norm + LM head
        x = self.transformer.ln_f(x)
        return self.lm_head(x)
```

### Step 4: Register Configuration

Add your config to `litgpt/config.py`:

```python
# In litgpt/config.py
configs = [
    # ... existing configs ...

    # My custom model
    dict(
        name="my-model-7b",
        hf_config=dict(org="myorg", name="my-model-7b"),
        block_size=4096,
        vocab_size=32000,
        n_layer=32,
        n_head=32,
        n_embd=4096,
        custom_param=0.1,
    ),
]
```

### Step 5: Use Your Custom Model

```python
from litgpt.api import LLM
from my_model import CustomGPT, MyModelConfig

# Initialize
config = MyModelConfig()
model = CustomGPT(config)

# Wrap with LLM API
llm = LLM(model=model, tokenizer_dir="path/to/tokenizer")

# Generate
result = llm.generate("Once upon a time", max_new_tokens=100)
print(result)
```

## Real Example: Adapter Fine-tuning

LitGPT's `Adapter` implementation shows a complete custom architecture:

### Adapter Configuration

```python
@dataclass
class Config(BaseConfig):
    """Adds adapter-specific parameters."""
    adapter_prompt_length: int = 10
    adapter_start_layer: int = 2
```

### Adapter GPT Model

```python
class GPT(BaseModel):
    """GPT model with adapter layers."""

    def __init__(self, config: Config):
        nn.Module.__init__(self)
        self.config = config

        # Standard components
        self.lm_head = nn.Linear(config.n_embd, config.padded_vocab_size, bias=False)
        self.transformer = nn.ModuleDict(
            dict(
                wte=nn.Embedding(config.padded_vocab_size, config.n_embd),
                h=nn.ModuleList(Block(config, i) for i in range(config.n_layer)),
                ln_f=config.norm_class(config.n_embd, eps=config.norm_eps),
            )
        )

        # Adapter-specific: gating factor
        self.gating_factor = torch.nn.Parameter(torch.zeros(1))
```

### Adapter Block

```python
class Block(BaseBlock):
    """Transformer block with adapter."""

    def __init__(self, config: Config, block_idx: int):
        super().__init__()
        self.norm_1 = config.norm_class(config.n_embd, eps=config.norm_eps)
        self.attn = CausalSelfAttention(config, block_idx)
        self.norm_2 = config.norm_class(config.n_embd, eps=config.norm_eps)
        self.mlp = config.mlp_class(config)

        # Adapter: add prefix for certain layers
        self.adapter_wte = (
            nn.Embedding(config.adapter_prompt_length, config.n_embd)
            if block_idx >= config.adapter_start_layer
            else None
        )
```

### Adapter Attention

```python
class CausalSelfAttention(BaseCausalSelfAttention):
    """Attention with adapter prompts."""

    def forward(self, x: torch.Tensor, ...) -> torch.Tensor:
        B, T, C = x.size()

        # Add adapter prefix if enabled
        if self.adapter_wte is not None:
            adapter_prompts = self.adapter_wte(
                torch.arange(self.adapter_prompt_length, device=x.device)
            )
            adapter_prompts = adapter_prompts.unsqueeze(0).expand(B, -1, -1)
            x = torch.cat([adapter_prompts, x], dim=1)

        # Standard attention with gating
        q, k, v = self.attn(x).split(self.n_embd, dim=2)
        y = self.scaled_dot_product_attention(q, k, v, mask=mask)

        # Apply gating factor
        y = y * self.gating_factor

        return self.proj(y)
```

See full implementation: `litgpt/finetune/adapter.py`

## Real Example: AdapterV2

AdapterV2 shows custom linear layers:

### AdapterV2Linear

```python
class AdapterV2Linear(torch.nn.Module):
    """Linear layer with low-rank adapter."""

    def __init__(self, in_features, out_features, adapter_rank=8, **kwargs):
        super().__init__()
        self.linear = torch.nn.Linear(in_features, out_features, **kwargs)

        # Adapter: low-rank bottleneck
        self.adapter_down = torch.nn.Linear(in_features, adapter_rank, bias=False)
        self.adapter_up = torch.nn.Linear(adapter_rank, out_features, bias=False)

        # Initialize adapter to identity
        torch.nn.init.zeros_(self.adapter_up.weight)

    def forward(self, x):
        # Original linear transformation
        out = self.linear(x)

        # Add adapter contribution
        adapter_out = self.adapter_up(self.adapter_down(x))
        return out + adapter_out
```

See full implementation: `litgpt/finetune/adapter_v2.py`

## Custom Model Checklist

- [ ] Define `Config` dataclass with all hyperparameters
- [ ] Implement custom components (Attention, MLP, Block)
- [ ] Create custom `GPT` class
- [ ] Implement `_init_weights()` for proper initialization
- [ ] Implement `forward()` matching base signature
- [ ] Register configuration in `litgpt/config.py`
- [ ] Test with small model (100M params) first
- [ ] Verify training convergence
- [ ] Profile memory usage

## Testing Your Custom Model

### Unit Test

```python
import torch
from my_model import CustomGPT, MyModelConfig

def test_custom_model():
    """Test custom model forward pass."""
    config = MyModelConfig(
        n_layer=2,
        n_head=4,
        n_embd=128,
        vocab_size=1000,
        block_size=256,
    )

    model = CustomGPT(config)
    model.eval()

    # Test forward pass
    batch_size = 2
    seq_length = 16
    idx = torch.randint(0, config.vocab_size, (batch_size, seq_length))

    with torch.no_grad():
        logits = model(idx)

    assert logits.shape == (batch_size, seq_length, config.vocab_size)
    print("✓ Forward pass works")

if __name__ == "__main__":
    test_custom_model()
```

### Training Test

```python
from litgpt.api import LLM

def test_training():
    """Test custom model training."""
    config = MyModelConfig(n_layer=2, n_head=4, n_embd=128)
    model = CustomGPT(config)

    # Small dataset for testing
    data = [
        {"instruction": "Test", "input": "", "output": "OK"}
    ]

    # Should run without errors
    llm = LLM(model=model)
    # ... training code ...
    print("✓ Training works")
```

## Common Patterns

### Adding New Attention Mechanism

```python
class MyAttention(nn.Module):
    """Template for custom attention."""

    def __init__(self, config):
        super().__init__()
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.head_size = self.n_embd // self.n_head

        # Q, K, V projections
        self.q_proj = nn.Linear(config.n_embd, config.n_embd, bias=False)
        self.k_proj = nn.Linear(config.n_embd, config.n_embd, bias=False)
        self.v_proj = nn.Linear(config.n_embd, config.n_embd, bias=False)

        # Output projection
        self.out_proj = nn.Linear(config.n_embd, config.n_embd, bias=False)

    def forward(self, x, mask=None):
        B, T, C = x.size()

        # Project Q, K, V
        q = self.q_proj(x).view(B, T, self.n_head, self.head_size)
        k = self.k_proj(x).view(B, T, self.n_head, self.head_size)
        v = self.v_proj(x).view(B, T, self.n_head, self.head_size)

        # Custom attention computation here
        # attn = custom_attention_function(q, k, v, mask)

        # Output projection
        out = self.out_proj(attn.reshape(B, T, C))
        return out
```

### Adding Mixture of Experts

```python
class MoELayer(nn.Module):
    """Mixture of Experts layer."""

    def __init__(self, config):
        super().__init__()
        self.num_experts = config.num_experts
        self.top_k = config.moe_top_k

        # Router
        self.router = nn.Linear(config.n_embd, self.num_experts)

        # Experts
        self.experts = nn.ModuleList([
            MLP(config) for _ in range(self.num_experts)
        ])

    def forward(self, x):
        B, T, C = x.size()

        # Route tokens to experts
        router_logits = self.router(x)  # (B, T, num_experts)
        router_probs = torch.softmax(router_logits, dim=-1)

        # Select top-k experts
        top_k_probs, top_k_indices = torch.topk(router_probs, self.top_k, dim=-1)

        # Process through selected experts
        output = torch.zeros_like(x)
        for i in range(self.top_k):
            expert_idx = top_k_indices[:, :, i]
            expert_prob = top_k_probs[:, :, i:i+1]

            # Route to expert
            for expert_id in range(self.num_experts):
                mask = (expert_idx == expert_id)
                if mask.any():
                    expert_out = self.experts[expert_id](x[mask])
                    output[mask] += expert_out * expert_prob[mask]

        return output
```

### Adding Positional Encoding

```python
class CustomPositionalEncoding(nn.Module):
    """Custom positional encoding."""

    def __init__(self, config):
        super().__init__()
        self.n_embd = config.n_embd
        self.register_buffer(
            "pos_encoding",
            self._create_encoding(config.block_size, config.n_embd)
        )

    def _create_encoding(self, max_len, d_model):
        """Create positional encoding matrix."""
        pos = torch.arange(max_len).unsqueeze(1)
        div = torch.exp(torch.arange(0, d_model, 2) * -(torch.log(torch.tensor(10000.0)) / d_model))

        encoding = torch.zeros(max_len, d_model)
        encoding[:, 0::2] = torch.sin(pos * div)
        encoding[:, 1::2] = torch.cos(pos * div)
        return encoding

    def forward(self, x):
        """Add positional encoding."""
        return x + self.pos_encoding[:x.size(1), :]
```

## Debugging Tips

1. **Start small**: Test with 2 layers, 128 hidden size
2. **Check shapes**: Print tensor shapes at each step
3. **Verify gradients**: Ensure all parameters have gradients
4. **Compare to base**: Run same config with base `GPT` model
5. **Profile memory**: Use `torch.cuda.memory_summary()`

## References

- Base model: `litgpt/model.py`
- Configuration: `litgpt/config.py`
- Adapter example: `litgpt/finetune/adapter.py`
- AdapterV2 example: `litgpt/finetune/adapter_v2.py`
- LoRA example: `litgpt/finetune/lora.py`


================================================
FILE: 01-model-architecture/litgpt/references/distributed-training.md
================================================
# Distributed Training

Guide to FSDP (Fully Sharded Data Parallel) distributed training in LitGPT for scaling to multiple GPUs and nodes.

## Overview

LitGPT uses **Lightning Fabric** with **FSDP** to distribute training across multiple GPUs. FSDP shards model parameters, gradients, and optimizer states to enable training models larger than single-GPU memory.

**When to use FSDP**:
- Model doesn't fit on single GPU
- Want faster training with multi-GPU
- Training models >7B parameters
- Need to scale across multiple nodes

## Quick Start

### Single Node Multi-GPU

```bash
# Train Llama 2 7B on 4 GPUs
litgpt finetune_lora meta-llama/Llama-2-7b-hf \
  --devices 4 \
  --data JSON \
  --data.json_path data/alpaca.json
```

FSDP is **automatically enabled** when `devices > 1`.

### Multi-Node Training

```bash
# Train on 2 nodes with 8 GPUs each (16 total)
litgpt finetune_lora meta-llama/Llama-2-70b-hf \
  --devices 8 \
  --num_nodes 2 \
  --data JSON \
  --data.json_path data/alpaca.json
```

## FSDP Configuration

### Default FSDP Strategy

When multiple devices are used, LitGPT applies this FSDP configuration:

```python
from lightning.fabric.strategies import FSDPStrategy
from litgpt.model import Block

strategy = FSDPStrategy(
    auto_wrap_policy={Block},
    state_dict_type="full",
    sharding_strategy="HYBRID_SHARD"
)
```

**Parameters**:
- `auto_wrap_policy={Block}`: Automatically wraps each transformer `Block` with FSDP
- `state_dict_type="full"`: Saves full model (assembled on rank 0) for easy deployment
- `sharding_strategy="HYBRID_SHARD"`: Shards parameters, gradients, and optimizer states

### Sharding Strategies

| Strategy | Shards | Communication | Use Case |
|----------|--------|---------------|----------|
| `FULL_SHARD` (ZeRO-3) | Params + Grads + Optim | All-gather before forward/backward | Maximum memory savings |
| `SHARD_GRAD_OP` (ZeRO-2) | Grads + Optim only | Reduce-scatter after backward | Faster than FULL_SHARD |
| `HYBRID_SHARD` (default) | All (hybrid across nodes) | Optimized for multi-node | Best for clusters |
| `NO_SHARD` | None | Broadcast | Single GPU (no FSDP) |

**Recommendation**: Use default `HYBRID_SHARD` for multi-node, or `FULL_SHARD` for single-node multi-GPU.

### State Dict Types

| Type | Behavior | Use Case |
|------|----------|----------|
| `full` (default) | Gathers all shards on rank 0, saves single file | Easy deployment, inference |
| `sharded` | Each rank saves its shard separately | Faster checkpointing, resume training |

### Auto-Wrap Policy

FSDP wraps model components based on `auto_wrap_policy`:

```python
auto_wrap_policy={Block}  # Wrap each transformer block
```

This means each `Block` (transformer layer) is independently sharded across GPUs. For a 32-layer model on 4 GPUs, each GPU holds ~8 layer shards.

## Thunder FSDP (Advanced)

LitGPT includes an experimental **Thunder** extension with enhanced FSDP:

```bash
litgpt pretrain tiny-llama-1.1b \
  --devices 8 \
  --num_nodes 1 \
  --compiler thunder \
  --strategy fsdp
```

### Thunder FSDP Configuration

```python
from extensions.thunder.pretrain import ThunderFSDPStrategy

strategy = ThunderFSDPStrategy(
    sharding_strategy="ZERO3",
    bucketing_strategy="BLOCK",
    state_dict_type="full",
    jit=False,
)
```

**Additional Parameters**:
- `sharding_strategy`: `"ZERO3"` (full shard), `"ZERO2"` (grad/optim only)
- `bucketing_strategy`: `"BLOCK"` (combine ops per block), `"LAYER"` (per layer), `"NONE"` (no bucketing)
- `jit`: Whether to apply `thunder.jit(model)` for optimization
- `executors`: Tuple of Thunder executors to enable

**Bucketing Strategy**:
- `"BLOCK"` (default): Combines collective operations for layer blocks → fewer communication calls
- `"LAYER"`: Combines per layer class
- `"NONE"`: No bucketing → more fine-grained but more overhead

## Pretraining with FSDP

### Single Node

```bash
litgpt pretrain tiny-llama-1.1b \
  --devices 8 \
  --train.global_batch_size 512 \
  --train.micro_batch_size 8 \
  --data Alpaca2k
```

**Memory calculation**:
- TinyLlama 1.1B: ~4GB model + ~4GB gradients + ~8GB optimizer = 16GB per GPU without FSDP
- With FSDP on 8 GPUs: 16GB / 8 = 2GB per GPU ✅ Fits easily

### Multi-Node

```bash
# Launch on 4 nodes with 8 GPUs each (32 total)
litgpt pretrain llama-2-7b \
  --devices 8 \
  --num_nodes 4 \
  --train.global_batch_size 1024 \
  --train.micro_batch_size 2 \
  --data RedPajama
```

**Memory calculation**:
- Llama 2 7B: ~28GB model + ~28GB gradients + ~56GB optimizer = 112GB total
- With FSDP on 32 GPUs: 112GB / 32 = 3.5GB per GPU ✅

## Fine-tuning with FSDP

### LoRA Fine-tuning (Recommended)

LoRA fine-tuning with FSDP for >7B models:

```bash
# Llama 2 70B LoRA on 8 GPUs
litgpt finetune_lora meta-llama/Llama-2-70b-hf \
  --devices 8 \
  --data JSON \
  --data.json_path data/alpaca.json \
  --train.global_batch_size 16 \
  --train.micro_batch_size 1 \
  --lora_r 8
```

**Why LoRA with FSDP**:
- Base model sharded with FSDP (memory efficient)
- Only LoRA adapters trained (fast)
- Best of both worlds for large models

### Full Fine-tuning

Full fine-tuning with FSDP:

```bash
# Llama 2 7B full fine-tune on 4 GPUs
litgpt finetune_full meta-llama/Llama-2-7b-hf \
  --devices 4 \
  --data JSON \
  --data.json_path data/alpaca.json \
  --train.global_batch_size 16 \
  --train.micro_batch_size 1 \
  --train.learning_rate 3e-5
```

## Mixed Precision

FSDP works with mixed precision for memory savings and speedup:

```bash
# BF16 mixed precision (recommended for A100/H100)
litgpt pretrain tiny-llama-1.1b \
  --devices 8 \
  --precision bf16-mixed

# FP16 mixed precision (V100 compatible)
litgpt pretrain tiny-llama-1.1b \
  --devices 8 \
  --precision 16-mixed
```

**Precision options**:
- `bf16-mixed`: BF16 for computation, FP32 for master weights (best for Ampere+)
- `16-mixed`: FP16 for computation, FP32 for master weights (V100)
- `32-true`: Full FP32 (debugging only, slow)

## Gradient Accumulation

Simulate larger batch sizes with gradient accumulation:

```bash
# Simulate global_batch_size=512 with micro_batch_size=2
litgpt pretrain tiny-llama-1.1b \
  --devices 8 \
  --train.global_batch_size 512 \
  --train.micro_batch_size 2
# Accumulates over 512/(8*2) = 32 steps per optimizer update
```

**Formula**:
```
Gradient accumulation steps = global_batch_size / (devices × micro_batch_size)
```

## Memory Optimization

### Out of Memory? Try These

1. **Increase devices**:
   ```bash
   --devices 8  # Instead of 4
   ```

2. **Reduce micro batch size**:
   ```bash
   --train.micro_batch_size 1  # Instead of 2
   ```

3. **Lower precision**:
   ```bash
   --precision bf16-mixed  # Instead of 32-true
   ```

4. **Use FULL_SHARD**:
   ```python
   strategy = FSDPStrategy(
       sharding_strategy="FULL_SHARD"  # Maximum memory savings
   )
   ```

5. **Enable activation checkpointing** (implemented in model):
   ```python
   # Recomputes activations during backward pass
   # Trades compute for memory
   ```

6. **Use QLoRA**:
   ```bash
   litgpt finetune_lora meta-llama/Llama-2-7b-hf \
     --quantize bnb.nf4 \
     --devices 1  # May not need FSDP with quantization
   ```

## Checkpointing

### Save Checkpoints

FSDP automatically handles checkpoint saving:

```bash
litgpt pretrain tiny-llama-1.1b \
  --devices 8 \
  --out_dir checkpoints/tinyllama-pretrain
# Saves to: checkpoints/tinyllama-pretrain/final/lit_model.pth
```

With `state_dict_type="full"` (default), rank 0 assembles full model and saves single file.

### Resume Training

```bash
litgpt pretrain tiny-llama-1.1b \
  --devices 8 \
  --resume checkpoints/tinyllama-pretrain/
# Automatically loads latest checkpoint
```

### Convert to HuggingFace

```bash
python scripts/convert_lit_checkpoint.py \
  --checkpoint_path checkpoints/tinyllama-pretrain/final/lit_model.pth \
  --output_dir models/tinyllama-hf
```

## Performance Tuning

### Communication Backends

LitGPT uses NCCL for GPU communication:

```bash
# Default (NCCL auto-configured)
litgpt pretrain tiny-llama-1.1b --devices 8

# Explicit NCCL settings (advanced)
NCCL_DEBUG=INFO \
NCCL_IB_DISABLE=0 \
litgpt pretrain tiny-llama-1.1b --devices 8
```

**NCCL Environment Variables**:
- `NCCL_DEBUG=INFO`: Enable debug logging
- `NCCL_IB_DISABLE=0`: Use InfiniBand (if available)
- `NCCL_SOCKET_IFNAME=eth0`: Specify network interface

### Multi-Node Setup

**Option 1: SLURM**

```bash
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --gpus-per-node=8
#SBATCH --ntasks-per-node=1

srun litgpt pretrain llama-2-7b \
  --devices 8 \
  --num_nodes 4 \
  --data RedPajama
```

**Option 2: torchrun**

```bash
# On each node, run:
torchrun \
  --nproc_per_node=8 \
  --nnodes=4 \
  --node_rank=$NODE_RANK \
  --master_addr=$MASTER_ADDR \
  --master_port=29500 \
  -m litgpt pretrain llama-2-7b
```

### Profiling

Enable profiling to identify bottlenecks:

```bash
litgpt pretrain tiny-llama-1.1b \
  --devices 8 \
  --train.max_steps 100 \
  --profile
# Generates profiling report
```

## Example Configurations

### Llama 2 7B on 4× A100 (40GB)

```bash
litgpt finetune_lora meta-llama/Llama-2-7b-hf \
  --devices 4 \
  --precision bf16-mixed \
  --train.global_batch_size 64 \
  --train.micro_batch_size 4 \
  --train.max_seq_length 2048 \
  --lora_r 8 \
  --data JSON \
  --data.json_path data/alpaca.json
```

**Memory per GPU**: ~20GB
**Throughput**: ~5 samples/sec

### Llama 2 70B on 8× A100 (80GB)

```bash
litgpt finetune_lora meta-llama/Llama-2-70b-hf \
  --devices 8 \
  --precision bf16-mixed \
  --train.global_batch_size 32 \
  --train.micro_batch_size 1 \
  --train.max_seq_length 2048 \
  --lora_r 8 \
  --data JSON \
  --data.json_path data/alpaca.json
```

**Memory per GPU**: ~70GB
**Throughput**: ~1 sample/sec

### Llama 3 405B on 64× H100 (80GB)

```bash
litgpt finetune_lora meta-llama/Llama-3.1-405B \
  --devices 8 \
  --num_nodes 8 \
  --precision bf16-mixed \
  --train.global_batch_size 128 \
  --train.micro_batch_size 1 \
  --train.max_seq_length 4096 \
  --lora_r 16 \
  --data JSON \
  --data.json_path data/alpaca.json
```

**Memory per GPU**: ~60GB
**Requires**: 64 H100 GPUs (8 nodes × 8 GPUs)

## Troubleshooting

### "CUDA out of memory"

1. Reduce `micro_batch_size`
2. Increase `devices` (more sharding)
3. Lower `max_seq_length`
4. Use `bf16-mixed` precision
5. Try QLoRA (`--quantize bnb.nf4`)

### "NCCL error" or Slow Communication

1. Check network connectivity between nodes
2. Enable InfiniBand: `NCCL_IB_DISABLE=0`
3. Verify NCCL version: `python -c "import torch; print(torch.cuda.nccl.version())"`
4. Test with NCCL tests: `$NCCL_HOME/build/all_reduce_perf -b 8 -e 128M`

### Training Slower Than Expected

1. Profile with `--profile`
2. Check GPU utilization: `nvidia-smi dmon`
3. Verify data loading isn't bottleneck
4. Increase `micro_batch_size` if memory allows
5. Use Thunder FSDP with bucketing

## References

- FSDP configuration: `litgpt/pretrain.py:setup()`
- Thunder FSDP: `extensions/thunder/pretrain.py`
- Memory optimization guide: `tutorials/oom.md`
- Lightning Fabric docs: https://lightning.ai/docs/fabric/


================================================
FILE: 01-model-architecture/litgpt/references/supported-models.md
================================================
# Supported Models

Complete list of model architectures supported by LitGPT with parameter sizes and variants.

## Overview

LitGPT supports **20+ model families** with **100+ model variants** ranging from 135M to 405B parameters.

**List all models**:
```bash
litgpt download list
```

**List pretrain-capable models**:
```bash
litgpt pretrain list
```

## Model Families

### Llama Family

**Llama 3, 3.1, 3.2, 3.3**:
- **Sizes**: 1B, 3B, 8B, 70B, 405B
- **Use Cases**: General-purpose, long-context (128K), multimodal
- **Best For**: Production applications, research, instruction following

**Code Llama**:
- **Sizes**: 7B, 13B, 34B, 70B
- **Use Cases**: Code generation, completion, infilling
- **Best For**: Programming assistants, code analysis

**Function Calling Llama 2**:
- **Sizes**: 7B
- **Use Cases**: Tool use, API integration
- **Best For**: Agents, function execution

**Llama 2**:
- **Sizes**: 7B, 13B, 70B
- **Use Cases**: General-purpose (predecessor to Llama 3)
- **Best For**: Established baselines, research comparisons

**Llama 3.1 Nemotron**:
- **Sizes**: 70B
- **Use Cases**: NVIDIA-optimized variant
- **Best For**: Enterprise deployments

**TinyLlama**:
- **Sizes**: 1.1B
- **Use Cases**: Edge devices, resource-constrained environments
- **Best For**: Fast inference, mobile deployment

**OpenLLaMA**:
- **Sizes**: 3B, 7B, 13B
- **Use Cases**: Open-source Llama reproduction
- **Best For**: Research, education

**Vicuna**:
- **Sizes**: 7B, 13B, 33B
- **Use Cases**: Chatbot, instruction following
- **Best For**: Conversational AI

**R1 Distill Llama**:
- **Sizes**: 8B, 70B
- **Use Cases**: Distilled reasoning models
- **Best For**: Efficient reasoning tasks

**MicroLlama**:
- **Sizes**: 300M
- **Use Cases**: Extremely small Llama variant
- **Best For**: Prototyping, testing

**Platypus**:
- **Sizes**: 7B, 13B, 70B
- **Use Cases**: STEM-focused fine-tune
- **Best For**: Science, math, technical domains

### Mistral Family

**Mistral**:
- **Sizes**: 7B, 123B
- **Use Cases**: Efficient open models, long-context
- **Best For**: Cost-effective deployments

**Mathstral**:
- **Sizes**: 7B
- **Use Cases**: Math reasoning
- **Best For**: Mathematical problem solving

**Mixtral MoE**:
- **Sizes**: 8×7B (47B total, 13B active), 8×22B (141B total, 39B active)
- **Use Cases**: Sparse mixture of experts
- **Best For**: High capacity with lower compute

### Falcon Family

**Falcon**:
- **Sizes**: 7B, 40B, 180B
- **Use Cases**: Open-source models from TII
- **Best For**: Multilingual applications

**Falcon 3**:
- **Sizes**: 1B, 3B, 7B, 10B
- **Use Cases**: Newer Falcon generation
- **Best For**: Efficient multilingual models

### Phi Family (Microsoft)

**Phi 1.5 & 2**:
- **Sizes**: 1.3B, 2.7B
- **Use Cases**: Small language models with strong performance
- **Best For**: Edge deployment, low-resource environments

**Phi 3 & 3.5**:
- **Sizes**: 3.8B
- **Use Cases**: Improved small models
- **Best For**: Mobile, browser-based applications

**Phi 4**:
- **Sizes**: 14B
- **Use Cases**: Medium-size high-performance model
- **Best For**: Balance of size and capability

**Phi 4 Mini Instruct**:
- **Sizes**: 3.8B
- **Use Cases**: Instruction-tuned variant
- **Best For**: Chat, task completion

### Gemma Family (Google)

**Gemma**:
- **Sizes**: 2B, 7B
- **Use Cases**: Google's open models
- **Best For**: Research, education

**Gemma 2**:
- **Sizes**: 2B, 9B, 27B
- **Use Cases**: Second generation improvements
- **Best For**: Enhanced performance

**Gemma 3**:
- **Sizes**: 1B, 4B, 12B, 27B
- **Use Cases**: Latest Gemma generation
- **Best For**: State-of-the-art open models

**CodeGemma**:
- **Sizes**: 7B
- **Use Cases**: Code-specialized Gemma
- **Best For**: Code generation, analysis

### Qwen Family (Alibaba)

**Qwen2.5**:
- **Sizes**: 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B
- **Use Cases**: General-purpose multilingual models
- **Best For**: Chinese/English applications

**Qwen2.5 Coder**:
- **Sizes**: 0.5B, 1.5B, 3B, 7B, 14B, 32B
- **Use Cases**: Code-specialized variants
- **Best For**: Programming in multiple languages

**Qwen2.5 Math**:
- **Sizes**: 1.5B, 7B, 72B
- **Use Cases**: Mathematical reasoning
- **Best For**: Math problems, STEM education

**QwQ & QwQ-Preview**:
- **Sizes**: 32B
- **Use Cases**: Question-answering focus
- **Best For**: Reasoning tasks

### Pythia Family (EleutherAI)

**Pythia**:
- **Sizes**: 14M, 31M, 70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, 12B
- **Use Cases**: Research, interpretability
- **Best For**: Scientific studies, ablations

### StableLM Family (Stability AI)

**StableLM**:
- **Sizes**: 3B, 7B
- **Use Cases**: Open models from Stability AI
- **Best For**: Research, commercial use

**StableLM Zephyr**:
- **Sizes**: 3B
- **Use Cases**: Instruction-tuned variant
- **Best For**: Chat applications

**StableCode**:
- **Sizes**: 3B
- **Use Cases**: Code generation
- **Best For**: Programming tasks

**FreeWilly2 (Stable Beluga 2)**:
- **Sizes**: 70B
- **Use Cases**: Large Stability AI model
- **Best For**: High-capability tasks

### Other Models

**Danube2**:
- **Sizes**: 1.8B
- **Use Cases**: Efficient small model
- **Best For**: Resource-constrained environments

**Dolly**:
- **Sizes**: 3B, 7B, 12B
- **Use Cases**: Databricks' instruction-following model
- **Best For**: Enterprise applications

**LongChat**:
- **Sizes**: 7B, 13B
- **Use Cases**: Extended context windows
- **Best For**: Long-document understanding

**Nous-Hermes**:
- **Sizes**: 7B, 13B, 70B
- **Use Cases**: Instruction-following fine-tune
- **Best For**: Task completion, reasoning

**OLMo**:
- **Sizes**: 1B, 7B
- **Use Cases**: Allen AI's fully open model
- **Best For**: Research transparency

**RedPajama-INCITE**:
- **Sizes**: 3B, 7B
- **Use Cases**: Open reproduction project
- **Best For**: Research, education

**Salamandra**:
- **Sizes**: 2B, 7B
- **Use Cases**: Multilingual European model
- **Best For**: European language support

**SmolLM2**:
- **Sizes**: 135M, 360M, 1.7B
- **Use Cases**: Ultra-small models
- **Best For**: Edge devices, testing

## Download Examples

**Download specific model**:
```bash
litgpt download meta-llama/Llama-3.2-1B
litgpt download microsoft/phi-2
litgpt download google/gemma-2-9b
```

**Download with HuggingFace token** (for gated models):
```bash
export HF_TOKEN=hf_...
litgpt download meta-llama/Llama-3.1-405B
```

## Model Selection Guide

### By Use Case

**General Chat/Instruction Following**:
- Small: Phi-2 (2.7B), TinyLlama (1.1B)
- Medium: Llama-3.2-8B, Mistral-7B
- Large: Llama-3.1-70B, Mixtral-8x22B

**Code Generation**:
- Small: Qwen2.5-Coder-3B
- Medium: CodeLlama-13B, CodeGemma-7B
- Large: CodeLlama-70B, Qwen2.5-Coder-32B

**Math/Reasoning**:
- Small: Qwen2.5-Math-1.5B
- Medium: Mathstral-7B, Qwen2.5-Math-7B
- Large: QwQ-32B, Qwen2.5-Math-72B

**Multilingual**:
- Small: SmolLM2-1.7B
- Medium: Qwen2.5-7B, Falcon-7B
- Large: Qwen2.5-72B

**Research/Education**:
- Pythia family (14M-12B for ablations)
- OLMo (fully open)
- TinyLlama (fast iteration)

### By Hardware

**Consumer GPU (8-16GB VRAM)**:
- Phi-2 (2.7B)
- TinyLlama (1.1B)
- Gemma-2B
- SmolLM2 family

**Single A100 (40-80GB)**:
- Llama-3.2-8B
- Mistral-7B
- CodeLlama-13B
- Gemma-9B

**Multi-GPU (200GB+ total)**:
- Llama-3.1-70B (TP=4)
- Mixtral-8x22B (TP=2)
- Falcon-40B

**Large Cluster**:
- Llama-3.1-405B (FSDP)
- Falcon-180B

## Model Capabilities

### Context Lengths

| Model | Context Window |
|-------|----------------|
| Llama 3.1 | 128K |
| Llama 3.2/3.3 | 128K |
| Mistral-123B | 128K |
| Mixtral | 32K |
| Gemma 2 | 8K |
| Phi-3 | 128K |
| Qwen2.5 | 32K |

### Training Data

- **Llama 3**: 15T tokens (multilingual)
- **Mistral**: Web data, code
- **Qwen**: Multilingual (Chinese/English focus)
- **Pythia**: The Pile (controlled training)

## References

- LitGPT GitHub: https://github.com/Lightning-AI/litgpt
- Model configs: `litgpt/config.py`
- Download tutorial: `tutorials/download_model_weights.md`


================================================
FILE: 01-model-architecture/litgpt/references/training-recipes.md
================================================
# Training Recipes

Complete hyperparameter configurations for LoRA, QLoRA, and full fine-tuning across different model sizes.

## Overview

LitGPT provides optimized training configurations in `config_hub/finetune/` for various model architectures and fine-tuning methods.

**Key Configuration Files**:
- `config_hub/finetune/*/lora.yaml` - LoRA fine-tuning
- `config_hub/finetune/*/qlora.yaml` - 4-bit quantized LoRA
- `config_hub/finetune/*/full.yaml` - Full fine-tuning

## LoRA Fine-tuning Recipes

### TinyLlama 1.1B LoRA

**Configuration**:
```yaml
global_batch_size: 8
micro_batch_size: 8
lr_warmup_steps: 10
epochs: 3
max_seq_length: 512

# LoRA specific
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
```

**Command**:
```bash
litgpt finetune_lora TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T \
  --data JSON \
  --data.json_path data/alpaca_sample.json \
  --train.global_batch_size 8 \
  --train.micro_batch_size 8 \
  --train.lr_warmup_steps 10 \
  --train.epochs 3 \
  --train.max_seq_length 512 \
  --lora_r 8 \
  --lora_alpha 16
```

**Memory**: ~4GB VRAM
**Time**: ~30 minutes on RTX 3090

### Llama 2 7B LoRA

**Configuration**:
```yaml
global_batch_size: 8
micro_batch_size: 2
lr_warmup_steps: 10
epochs: 4
max_seq_length: 512

# LoRA specific
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
```

**Command**:
```bash
litgpt finetune_lora meta-llama/Llama-2-7b-hf \
  --data JSON \
  --data.json_path data/alpaca.json \
  --train.global_batch_size 8 \
  --train.micro_batch_size 2 \
  --train.lr_warmup_steps 10 \
  --train.epochs 4 \
  --lora_r 8 \
  --lora_alpha 16
```

**Memory**: ~16GB VRAM
**Gradient Accumulation**: 4 steps (8 / 2)
**Time**: ~6 hours on A100

### Llama 3 8B LoRA

**Configuration**:
```yaml
global_batch_size: 8
micro_batch_size: 1
lr_warmup_steps: 10
epochs: 2
max_seq_length: 512

# LoRA specific
lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
```

**Command**:
```bash
litgpt finetune_lora meta-llama/Llama-3.2-8B \
  --data JSON \
  --data.json_path data/custom_dataset.json \
  --train.global_batch_size 8 \
  --train.micro_batch_size 1 \
  --train.lr_warmup_steps 10 \
  --train.epochs 2 \
  --lora_r 8
```

**Memory**: ~20GB VRAM
**Gradient Accumulation**: 8 steps
**Time**: ~8 hours on A100

### Mistral 7B LoRA

**Configuration**:
```yaml
global_batch_size: 8
micro_batch_size: 2
lr_warmup_steps: 10
epochs: 4
max_seq_length: 512

lora_r: 8
lora_alpha: 16
```

**Command**:
```bash
litgpt finetune_lora mistralai/Mistral-7B-v0.1 \
  --data JSON \
  --data.json_path data/alpaca.json \
  --train.global_batch_size 8 \
  --train.micro_batch_size 2 \
  --train.epochs 4 \
  --lora_r 8
```

**Memory**: ~16GB VRAM

### Phi-2 LoRA

**Configuration**:
```yaml
global_batch_size: 8
micro_batch_size: 4
lr_warmup_steps: 10
epochs: 1
max_seq_length: 512

lora_r: 8
lora_alpha: 16
```

**Command**:
```bash
litgpt finetune_lora microsoft/phi-2 \
  --data JSON \
  --data.json_path data/alpaca_sample.json \
  --train.global_batch_size 8 \
  --train.micro_batch_size 4 \
  --train.epochs 1 \
  --lora_r 8
```

**Memory**: ~8GB VRAM
**Time**: ~20 minutes on RTX 3090

### Falcon 7B LoRA

**Configuration**:
```yaml
global_batch_size: 8
micro_batch_size: 1
lr_warmup_steps: 10
epochs: 4
max_seq_length: 512

lora_r: 8
lora_alpha: 16
```

**Command**:
```bash
litgpt finetune_lora tiiuae/falcon-7b \
  --data JSON \
  --data.json_path data/alpaca.json \
  --train.global_batch_size 8 \
  --train.micro_batch_size 1 \
  --train.epochs 4 \
  --lora_r 8
```

**Memory**: ~18GB VRAM

### Gemma 7B LoRA

**Configuration**:
```yaml
global_batch_size: 6
micro_batch_size: 1
lr_warmup_steps: 200
epochs: 2
max_seq_length: 512

lora_r: 8
lora_alpha: 16
```

**Command**:
```bash
litgpt finetune_lora google/gemma-7b \
  --data JSON \
  --data.json_path data/alpaca.json \
  --train.global_batch_size 6 \
  --train.micro_batch_size 1 \
  --train.lr_warmup_steps 200 \
  --train.epochs 2 \
  --lora_r 8
```

**Memory**: ~18GB VRAM
**Note**: Longer warmup (200 steps) for stability

## QLoRA Fine-tuning Recipes

QLoRA uses 4-bit quantization to reduce memory by ~75%.

### TinyLlama 1.1B QLoRA

**Configuration**:
```yaml
global_batch_size: 8
micro_batch_size: 8
lr_warmup_steps: 10
epochs: 3
max_seq_length: 512

lora_r: 8
lora_alpha: 16
quantize: "bnb.nf4"
```

**Command**:
```bash
litgpt finetune_lora TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T \
  --quantize bnb.nf4 \
  --data JSON \
  --data.json_path data/alpaca_sample.json \
  --train.global_batch_size 8 \
  --train.micro_batch_size 8 \
  --train.epochs 3 \
  --lora_r 8
```

**Memory**: ~2GB VRAM (75% reduction)

### Llama 2 7B QLoRA

**Configuration**:
```yaml
global_batch_size: 8
micro_batch_size: 2
lr_warmup_steps: 10
epochs: 4
max_seq_length: 512
min_lr: 6.0e-5

lora_r: 8
lora_alpha: 16
quantize: "bnb.nf4"
```

**Command**:
```bash
litgpt finetune_lora meta-llama/Llama-2-7b-hf \
  --quantize bnb.nf4 \
  --data JSON \
  --data.json_path data/alpaca.json \
  --train.global_batch_size 8 \
  --train.micro_batch_size 2 \
  --train.epochs 4 \
  --lora_r 8
```

**Memory**: ~6GB VRAM (consumer GPU friendly)

### Llama 3 8B QLoRA

**Configuration**:
```yaml
global_batch_size: 8
micro_batch_size: 2
lr_warmup_steps: 10
epochs: 2
max_seq_length: 512

lora_r: 8
lora_alpha: 16
quantize: "bnb.nf4"
```

**Command**:
```bash
litgpt finetune_lora meta-llama/Llama-3.2-8B \
  --quantize bnb.nf4 \
  --data JSON \
  --data.json_path data/custom_dataset.json \
  --train.global_batch_size 8 \
  --train.micro_batch_size 2 \
  --train.epochs 2 \
  --lora_r 8
```

**Memory**: ~8GB VRAM

### Mistral 7B QLoRA

**Configuration**:
```yaml
global_batch_size: 8
micro_batch_size: 2
lr_warmup_steps: 10
epochs: 4
max_seq_length: 512

lora_r: 8
lora_alpha: 16
quantize: "bnb.nf4"
```

**Memory**: ~6GB VRAM

### Phi-2 QLoRA

**Configuration**:
```yaml
global_batch_size: 8
micro_batch_size: 4
lr_warmup_steps: 10
epochs: 1
max_seq_length: 512

lora_r: 8
lora_alpha: 16
quantize: "bnb.nf4"
```

**Memory**: ~3GB VRAM

### Falcon 7B QLoRA

**Configuration**:
```yaml
global_batch_size: 8
micro_batch_size: 1
lr_warmup_steps: 10
epochs: 4
max_seq_length: 512

lora_r: 8
lora_alpha: 16
quantize: "bnb.nf4"
```

**Memory**: ~6GB VRAM

### Gemma 2B QLoRA

**Configuration**:
```yaml
global_batch_size: 6
micro_batch_size: 2
lr_warmup_steps: 200
epochs: 2
max_seq_length: 512

lora_r: 8
lora_alpha: 16
quantize: "bnb.nf4"
```

**Memory**: ~3GB VRAM

### Gemma 7B QLoRA

**Configuration**:
```yaml
global_batch_size: 6
micro_batch_size: 1
lr_warmup_steps: 200
epochs: 2
max_seq_length: 512

lora_r: 8
lora_alpha: 16
quantize: "bnb.nf4"
```

**Memory**: ~6GB VRAM

## Full Fine-tuning Recipes

Full fine-tuning updates all model parameters (requires more memory).

### TinyLlama 1.1B Full

**Configuration**:
```yaml
global_batch_size: 8
micro_batch_size: 2
lr_warmup_steps: 100
epochs: 3
max_seq_length: 512
learning_rate: 5e-5
```

**Command**:
```bash
litgpt finetune_full TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T \
  --data JSON \
  --data.json_path data/alpaca.json \
  --train.global_batch_size 8 \
  --train.micro_batch_size 2 \
  --train.lr_warmup_steps 100 \
  --train.epochs 3 \
  --train.learning_rate 5e-5
```

**Memory**: ~12GB VRAM
**Time**: ~4 hours on A100

### Phi-2 Full

**Configuration**:
```yaml
global_batch_size: 8
micro_batch_size: 1
lr_warmup_steps: 100
epochs: 2
max_seq_length: 512
learning_rate: 3e-5
```

**Command**:
```bash
litgpt finetune_full microsoft/phi-2 \
  --data JSON \
  --data.json_path data/alpaca.json \
  --train.global_batch_size 8 \
  --train.micro_batch_size 1 \
  --train.epochs 2 \
  --train.learning_rate 3e-5
```

**Memory**: ~24GB VRAM

## Common Hyperparameter Patterns

### Learning Rates

| Model Size | LoRA LR | Full Fine-tune LR |
|------------|---------|-------------------|
| <2B | 3e-4 | 5e-5 |
| 2-10B | 1e-4 | 3e-5 |
| 10-70B | 5e-5 | 1e-5 |

### LoRA Rank (r)

- **r=8**: Default, good balance (recommended)
- **r=16**: More capacity, 2× trainable params
- **r=32**: Maximum capacity, slower training
- **r=4**: Minimal, fastest training

**Rule of thumb**: Start with r=8, increase if underfitting.

### Batch Sizes

| GPU VRAM | Micro Batch | Global Batch |
|----------|-------------|--------------|
| 8GB | 1 | 8 |
| 16GB | 2 | 8-16 |
| 40GB | 4 | 16-32 |
| 80GB | 8 | 32-64 |

### Warmup Steps

- **Small models (<2B)**: 10-50 steps
- **Medium models (2-10B)**: 100-200 steps
- **Large models (>10B)**: 200-500 steps

### Epochs

- **Instruction tuning**: 1-3 epochs
- **Domain adaptation**: 3-5 epochs
- **Small datasets (<10K)**: 5-10 epochs

## Advanced Configurations

### Custom Learning Rate Schedule

```bash
litgpt finetune_lora meta-llama/Llama-2-7b-hf \
  --train.learning_rate 3e-4 \
  --train.lr_warmup_steps 100 \
  --train.min_lr 3e-6 \
  --train.lr_decay_iters 10000
```

### Gradient Accumulation

```bash
# Simulate global_batch_size=128 with 16GB GPU
litgpt finetune_lora meta-llama/Llama-2-7b-hf \
  --train.global_batch_size 128 \
  --train.micro_batch_size 2
# Accumulates over 64 steps (128 / 2)
```

### Mixed Precision

```bash
litgpt finetune_lora meta-llama/Llama-2-7b-hf \
  --precision bf16-mixed  # BF16 mixed precision
# or
  --precision 16-mixed  # FP16 mixed precision
```

### Longer Context

```bash
litgpt finetune_lora meta-llama/Llama-3.1-8B \
  --train.max_seq_length 8192 \
  --train.micro_batch_size 1  # Reduce batch for memory
```

## Memory Optimization

### Out of Memory? Try These

1. **Enable quantization**:
   ```bash
   --quantize bnb.nf4  # 4-bit QLoRA
   ```

2. **Reduce batch size**:
   ```bash
   --train.micro_batch_size 1
   ```

3. **Lower LoRA rank**:
   ```bash
   --lora_r 4  # Instead of 8
   ```

4. **Use FSDP** (multi-GPU):
   ```bash
   litgpt finetune_lora meta-llama/Llama-2-7b-hf \
     --devices 4  # Use 4 GPUs with FSDP
   ```

5. **Gradient checkpointing**:
   ```bash
   --train.gradient_accumulation_iters 16
   ```

## Data Format

LitGPT expects JSON data in instruction format:

```json
[
  {
    "instruction": "What is the capital of France?",
    "input": "",
    "output": "The capital of France is Paris."
  },
  {
    "instruction": "Translate to Spanish:",
    "input": "Hello world",
    "output": "Hola mundo"
  }
]
```

**Load custom data**:
```bash
litgpt finetune_lora meta-llama/Llama-2-7b-hf \
  --data JSON \
  --data.json_path data/my_dataset.json \
  --data.val_split_fraction 0.1  # 10% validation
```

## Merge and Deploy

After fine-tuning, merge LoRA weights:

```bash
litgpt merge_lora checkpoints/meta-llama/Llama-2-7b-hf/final_lora.pth
```

Generate with merged model:

```bash
litgpt generate checkpoints/meta-llama/Llama-2-7b-hf-merged/ \
  --prompt "What is machine learning?"
```

Or serve via API:

```bash
litgpt serve checkpoints/meta-llama/Llama-2-7b-hf-merged/
```

## References

- Configuration hub: `config_hub/finetune/`
- Fine-tuning tutorial: `tutorials/finetune_*.md`
- Memory guide: `tutorials/oom.md`


================================================
FILE: 01-model-architecture/mamba/SKILL.md
================================================
---
name: mamba-architecture
description: State-space model with O(n) complexity vs Transformers' O(n²). 5× faster inference, million-token sequences, no KV cache. Selective SSM with hardware-aware design. Mamba-1 (d_state=16) and Mamba-2 (d_state=128, multi-head). Models 130M-2.8B on HuggingFace.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Model Architecture, Mamba, State Space Models, SSM, Linear Complexity, Long Context, Efficient Inference, Hardware-Aware, Alternative To Transformers]
dependencies: [mamba-ssm, torch, transformers, causal-conv1d]
---

# Mamba - Selective State Space Models

## Quick start

Mamba is a state-space model architecture achieving O(n) linear complexity for sequence modeling.

**Installation**:
```bash
# Install causal-conv1d (optional, for efficiency)
pip install causal-conv1d>=1.4.0

# Install Mamba
pip install mamba-ssm
# Or both together
pip install mamba-ssm[causal-conv1d]
```

**Prerequisites**: Linux, NVIDIA GPU, PyTorch 1.12+, CUDA 11.6+

**Basic usage** (Mamba block):
```python
import torch
from mamba_ssm import Mamba

batch, length, dim = 2, 64, 16
x = torch.randn(batch, length, dim).to("cuda")

model = Mamba(
    d_model=dim,      # Model dimension
    d_state=16,       # SSM state dimension
    d_conv=4,         # Conv1d kernel size
    expand=2          # Expansion factor
).to("cuda")

y = model(x)  # O(n) complexity!
assert y.shape == x.shape
```

## Common workflows

### Workflow 1: Language model with Mamba-2

**Complete LM with generation**:
```python
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
from mamba_ssm.models.config_mamba import MambaConfig
import torch

# Configure Mamba-2 LM
config = MambaConfig(
    d_model=1024,           # Hidden dimension
    n_layer=24,             # Number of layers
    vocab_size=50277,       # Vocabulary size
    ssm_cfg=dict(
        layer="Mamba2",     # Use Mamba-2
        d_state=128,        # Larger state for Mamba-2
        headdim=64,         # Head dimension
        ngroups=1           # Number of groups
    )
)

model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)

# Generate text
input_ids = torch.randint(0, 1000, (1, 20), device="cuda", dtype=torch.long)
output = model.generate(
    input_ids=input_ids,
    max_length=100,
    temperature=0.7,
    top_p=0.9
)
```

### Workflow 2: Use pretrained Mamba models

**Load from HuggingFace**:
```python
from transformers import AutoTokenizer
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel

# Load pretrained model
model_name = "state-spaces/mamba-2.8b"
tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")  # Use compatible tokenizer
model = MambaLMHeadModel.from_pretrained(model_name, device="cuda", dtype=torch.float16)

# Generate
prompt = "The future of AI is"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
output_ids = model.generate(
    input_ids=input_ids,
    max_length=200,
    temperature=0.7,
    top_p=0.9,
    repetition_penalty=1.2
)
generated_text = tokenizer.decode(output_ids[0])
print(generated_text)
```

**Available models**:
- `state-spaces/mamba-130m`
- `state-spaces/mamba-370m`
- `state-spaces/mamba-790m`
- `state-spaces/mamba-1.4b`
- `state-spaces/mamba-2.8b`

### Workflow 3: Mamba-1 vs Mamba-2

**Mamba-1** (smaller state):
```python
from mamba_ssm import Mamba

model = Mamba(
    d_model=256,
    d_state=16,      # Smaller state dimension
    d_conv=4,
    expand=2
).to("cuda")
```

**Mamba-2** (multi-head, larger state):
```python
from mamba_ssm import Mamba2

model = Mamba2(
    d_model=256,
    d_state=128,     # Larger state dimension
    d_conv=4,
    expand=2,
    headdim=64,      # Head dimension for multi-head
    ngroups=1        # Parallel groups
).to("cuda")
```

**Key differences**:
- **State size**: Mamba-1 (d_state=16) vs Mamba-2 (d_state=128)
- **Architecture**: Mamba-2 has multi-head structure
- **Normalization**: Mamba-2 uses RMSNorm
- **Distributed**: Mamba-2 supports tensor parallelism

### Workflow 4: Benchmark vs Transformers

**Generation speed comparison**:
```bash
# Benchmark Mamba
python benchmarks/benchmark_generation_mamba_simple.py \
  --model-name "state-spaces/mamba-2.8b" \
  --prompt "The future of machine learning is" \
  --topp 0.9 --temperature 0.7 --repetition-penalty 1.2

# Benchmark Transformer
python benchmarks/benchmark_generation_mamba_simple.py \
  --model-name "EleutherAI/pythia-2.8b" \
  --prompt "The future of machine learning is" \
  --topp 0.9 --temperature 0.7 --repetition-penalty 1.2
```

**Expected results**:
- **Mamba**: 5× faster inference
- **Memory**: No KV cache needed
- **Scaling**: Linear with sequence length

## When to use vs alternatives

**Use Mamba when**:
- Need long sequences (100K+ tokens)
- Want faster inference than Transformers
- Memory-constrained (no KV cache)
- Building streaming applications
- Linear scaling important

**Advantages**:
- **O(n) complexity**: Linear vs quadratic
- **5× faster inference**: No attention overhead
- **No KV cache**: Lower memory usage
- **Million-token sequences**: Hardware-efficient
- **Streaming**: Constant memory per token

**Use alternatives instead**:
- **Transformers**: Need best-in-class performance, have compute
- **RWKV**: Want RNN+Transformer hybrid
- **RetNet**: Need retention-based architecture
- **Hyena**: Want convolution-based approach

## Common issues

**Issue: CUDA out of memory**

Reduce batch size or use gradient checkpointing:
```python
model = MambaLMHeadModel(config, device="cuda", dtype=torch.float16)
model.gradient_checkpointing_enable()  # Enable checkpointing
```

**Issue: Slow installation**

Install binary wheels (not source):
```bash
pip install mamba-ssm --no-build-isolation
```

**Issue: Missing causal-conv1d**

Install separately:
```bash
pip install causal-conv1d>=1.4.0
```

**Issue: Model not loading from HuggingFace**

Use `MambaLMHeadModel.from_pretrained` (not `AutoModel`):
```python
from mamba_ssm.models.mixer_seq_simple import MambaLMHeadModel
model = MambaLMHeadModel.from_pretrained("state-spaces/mamba-2.8b")
```

## Advanced topics

**Selective SSM**: See [references/selective-ssm.md](references/selective-ssm.md) for mathematical formulation, state-space equations, and how selectivity enables O(n) complexity.

**Mamba-2 architecture**: See [references/mamba2-details.md](references/mamba2-details.md) for multi-head structure, tensor parallelism, and distributed training setup.

**Performance optimization**: See [references/performance.md](references/performance.md) for hardware-aware design, CUDA kernels, and memory efficiency techniques.

## Hardware requirements

- **GPU**: NVIDIA with CUDA 11.6+
- **VRAM**:
  - 130M model: 2GB
  - 370M model: 4GB
  - 790M model: 8GB
  - 1.4B model: 14GB
  - 2.8B model: 28GB (FP16)
- **Inference**: 5× faster than Transformers
- **Memory**: No KV cache (lower than Transformers)

**Performance** (vs Transformers):
- **Speed**: 5× faster inference
- **Memory**: 50% less (no KV cache)
- **Scaling**: Linear vs quadratic

## Resources

- Paper (Mamba-1): https://arxiv.org/abs/2312.00752 (Dec 2023)
- Paper (Mamba-2): https://arxiv.org/abs/2405.21060 (May 2024)
- GitHub: https://github.com/state-spaces/mamba ⭐ 13,000+
- Models: https://huggingface.co/state-spaces
- Docs: Repository README and wiki




================================================
FILE: 01-model-architecture/mamba/references/architecture-details.md
================================================
# Mamba Architecture Details

## Selective State Space Mechanism

Mamba's core innovation is the **Selective SSM (S6)** layer that makes state space model parameters input-dependent.

### How S6 Works

**Traditional SSMs** (non-selective):
```python
# Fixed A, B, C matrices for all inputs
h(t) = A * h(t-1) + B * x(t)  # State update
y(t) = C * h(t)                # Output
```

**Mamba's Selective SSM**:
```python
# Input-dependent parameters
B(t) = Linear_B(x(t))  # Selection mechanism
C(t) = Linear_C(x(t))  # Output projection
Δ(t) = Linear_Δ(x(t))  # Discretization step

# Selective state update
h(t) = discretize(A, Δ(t)) * h(t-1) + Δ(t) * B(t) * x(t)
y(t) = C(t) * h(t)
```

### Key Advantages

**1. Content-based reasoning**:
- Can selectively remember or forget based on input
- Addresses discrete modality weakness of traditional SSMs
- Example: Remembers important tokens, forgets padding

**2. Input-dependent selection**:
```python
# Mamba decides per token what to remember
if is_important(x(t)):
    Δ(t) = large_value   # Keep in state
else:
    Δ(t) = small_value   # Forget quickly
```

**3. No attention required**:
- Replaces O(n²) attention with O(n) state updates
- State dimension is constant (typically 16)

## Model Configuration

### Core Parameters

```python
from mamba_ssm import Mamba

model = Mamba(
    d_model=256,      # Hidden dimension (256, 512, 768, 1024, 2048)
    d_state=16,       # SSM state dimension (fixed at 16 is optimal)
    d_conv=4,         # Local convolution width (4 is standard)
    expand=2,         # Expansion factor (1.5-2.0)
    dt_rank="auto",   # Rank of Δ projection (auto = d_model / 16)
    dt_min=0.001,     # Min Δ init (controls forgetting rate)
    dt_max=0.1,       # Max Δ init
    dt_init="random", # Δ initialization (random, constant)
    dt_scale=1.0,     # Δ scaling factor
    conv_bias=True,   # Use bias in convolution
    bias=False        # Use bias in linear projections
)
```

### Parameter Impact

**d_state** (SSM state dimension):
- Standard: 16 (optimal from ablations)
- Smaller (8): Faster but less capacity
- Larger (32, 64): Minimal improvement, 2× slower

**expand** (block expansion):
- Standard: 2.0
- Range: 1.5-2.0
- Controls inner dimension = expand * d_model

**d_conv** (convolution width):
- Standard: 4
- Local context window before SSM
- Helps with positional information

**dt_rank** (Δ projection rank):
- Auto: d_model / 16 (recommended)
- Controls Δ parameter efficiency
- Lower rank = more efficient but less expressive

## Mamba Block Structure

```python
# Mamba block (replaces Transformer block)
class MambaBlock(nn.Module):
    def __init__(self, d_model):
        self.norm = RMSNorm(d_model)
        self.mamba = Mamba(d_model, d_state=16, d_conv=4, expand=2)

    def forward(self, x):
        return x + self.mamba(self.norm(x))  # Residual

# Full model (stack of Mamba blocks)
model = nn.Sequential(
    Embedding(...),
    *[MambaBlock(d_model) for _ in range(n_layers)],
    RMSNorm(d_model),
    LMHead(...)
)
```

**Key differences from Transformers**:
- No multi-head attention (MHA)
- No feedforward network (FFN)
- Single Mamba layer per block
- 2× more layers than equivalent Transformer

## Hardware-Aware Implementation

### Parallel Algorithm

Mamba uses a **scan-based parallel algorithm** for training:

```python
# Parallel mode (training)
# GPU kernel fuses operations
y = parallel_scan(A, B, C, x)  # O(n log n) parallel

# Sequential mode (inference)
# Constant memory RNN-style
h = 0
for x_t in sequence:
    h = A*h + B*x_t
    y_t = C*h
```

### Memory Efficiency

**Training**:
- Recomputes activations in backward pass
- Similar to FlashAttention strategy
- Memory: O(batch_size * seq_len * d_model)

**Inference**:
- RNN-style sequential processing
- State size: O(d_model * d_state) = constant
- No KV cache needed (huge advantage!)

### CUDA Kernel Optimizations

```python
# Fused kernel operations
- Discretization (continuous → discrete A, B)
- SSM recurrence (parallel scan)
- Convolution (efficient 1D conv)
- All in single GPU kernel
```

## Layer Count Scaling

Mamba models use **2× layers** compared to Transformers:

| Model | d_model | n_layers | Params |
|-------|---------|----------|--------|
| Mamba-130M | 768 | 24 | 130M |
| Mamba-370M | 1024 | 48 | 370M |
| Mamba-790M | 1536 | 48 | 790M |
| Mamba-1.4B | 2048 | 48 | 1.4B |
| Mamba-2.8B | 2560 | 64 | 2.8B |

**Why 2× layers?**
- Mamba blocks are simpler (no MHA, no FFN)
- ~50% fewer parameters per layer
- Doubling layers matches compute budget

## Initialization Strategy

```python
# Δ (discretization step) initialization
dt_init_floor = 1e-4
dt = torch.exp(
    torch.rand(d_inner) * (math.log(dt_max) - math.log(dt_min))
    + math.log(dt_min)
).clamp(min=dt_init_floor)

# A (state transition) initialization
A = -torch.exp(torch.rand(d_inner, d_state))  # Negative for stability

# B, C (input/output) initialization
B = torch.randn(d_inner, d_state)
C = torch.randn(d_inner, d_state)
```

**Critical for stability**:
- A must be negative (exponential decay)
- Δ in range [dt_min, dt_max]
- Random initialization helps diversity

## Resources

- Paper: https://arxiv.org/abs/2312.00752 (Mamba-1)
- Paper: https://arxiv.org/abs/2405.21060 (Mamba-2)
- GitHub: https://github.com/state-spaces/mamba
- Models: https://huggingface.co/state-spaces
- CUDA kernels: https://github.com/state-spaces/mamba/tree/main/csrc


================================================
FILE: 01-model-architecture/mamba/references/benchmarks.md
================================================
# Mamba Performance Benchmarks

## Inference Speed Comparison

### Throughput (tokens/sec)

**Mamba-1.4B vs Transformer-1.3B** on single A100 80GB:

| Sequence Length | Mamba-1.4B | Transformer-1.3B | Speedup |
|----------------|------------|------------------|---------|
| 512 | 8,300 | 6,200 | 1.3× |
| 1024 | 7,800 | 4,100 | 1.9× |
| 2048 | 7,200 | 2,300 | 3.1× |
| 4096 | 6,800 | 1,200 | 5.7× |
| 8192 | 6,400 | 600 | **10.7×** |
| 16384 | 6,100 | OOM | ∞ |

**Key insight**: Speedup grows with sequence length (Mamba O(n) vs Transformer O(n²))

### Latency (ms per token)

**Generation latency** (batch size 1, autoregressive):

| Model | First Token | Per Token | 100 Tokens Total |
|-------|-------------|-----------|------------------|
| Mamba-130M | 3 ms | 0.8 ms | 83 ms |
| Transformer-130M | 5 ms | 1.2 ms | 125 ms |
| Mamba-1.4B | 12 ms | 3.2 ms | 332 ms |
| Transformer-1.3B | 18 ms | 8.5 ms | 868 ms |
| Mamba-2.8B | 20 ms | 6.1 ms | 631 ms |
| Transformer-2.7B | 35 ms | 18.2 ms | 1855 ms |

**Mamba advantage**: Constant per-token latency regardless of context length

## Memory Usage

### Training Memory (BF16, per GPU)

**Mamba-1.4B** training memory breakdown:

| Sequence Length | Activations | Gradients | Optimizer | Total | vs Transformer |
|----------------|-------------|-----------|-----------|-------|----------------|
| 512 | 2.1 GB | 3.2 GB | 11.2 GB | 16.5 GB | 0.9× |
| 1024 | 3.8 GB | 3.2 GB | 11.2 GB | 18.2 GB | 0.6× |
| 2048 | 7.2 GB | 3.2 GB | 11.2 GB | 21.6 GB | 0.4× |
| 4096 | 14.1 GB | 3.2 GB | 11.2 GB | 28.5 GB | 0.25× |
| 8192 | 28.0 GB | 3.2 GB | 11.2 GB | 42.4 GB | 0.15× |

**Note**: Transformer OOMs at 8K sequence length on 40GB A100

### Inference Memory (FP16, batch size 1)

| Model | KV Cache (8K ctx) | State (Mamba) | Ratio |
|-------|------------------|---------------|-------|
| 130M | 2.1 GB | 0 MB | ∞ |
| 370M | 5.2 GB | 0 MB | ∞ |
| 1.4B | 19.7 GB | 0 MB | ∞ |
| 2.8B | 38.4 GB | 0 MB | ∞ |

**Mamba stores no KV cache** - constant memory per token!

Actual Mamba state size:
- 130M: ~3 MB (d_model × d_state × n_layers = 768 × 16 × 24)
- 2.8B: ~13 MB (2560 × 16 × 64)

## Language Modeling Benchmarks

### Perplexity on Common Datasets

**Models trained on The Pile (300B tokens)**:

| Model | Params | Pile (val) | WikiText-103 | C4 | Lambada |
|-------|--------|------------|--------------|-----|---------|
| Pythia | 160M | 29.6 | 28.4 | 23.1 | 51.2 |
| **Mamba** | **130M** | **28.1** | **26.7** | **21.8** | **48.3** |
| Pythia | 410M | 18.3 | 17.6 | 16.2 | 32.1 |
| **Mamba** | **370M** | **16.7** | **16.2** | **15.1** | **28.4** |
| Pythia | 1.4B | 10.8 | 10.2 | 11.3 | 15.2 |
| **Mamba** | **1.4B** | **9.1** | **9.6** | **10.1** | **12.8** |
| Pythia | 2.8B | 8.3 | 7.9 | 9.2 | 10.6 |
| **Mamba** | **2.8B** | **7.4** | **7.2** | **8.3** | **9.1** |

**Mamba consistently outperforms** Transformers of similar size by 10-20%

### Zero-Shot Task Performance

**Mamba-2.8B vs Transformer-2.7B** on common benchmarks:

| Task | Mamba-2.8B | Transformer-2.7B | Delta |
|------|------------|------------------|-------|
| HellaSwag | 61.3 | 58.7 | +2.6 |
| PIQA | 78.1 | 76.4 | +1.7 |
| ARC-Easy | 68.2 | 65.9 | +2.3 |
| ARC-Challenge | 42.7 | 40.1 | +2.6 |
| WinoGrande | 64.8 | 62.3 | +2.5 |
| OpenBookQA | 43.2 | 41.8 | +1.4 |
| BoolQ | 71.4 | 68.2 | +3.2 |
| MMLU (5-shot) | 35.2 | 33.8 | +1.4 |

**Average improvement**: +2.2 points across benchmarks

## Audio Modeling Benchmarks

### SC09 (Speech Commands)

**Task**: Audio classification (10 classes)

| Model | Params | Accuracy | Inference (ms) |
|-------|--------|----------|----------------|
| Transformer | 8.2M | 96.2% | 18 ms |
| S4 | 6.1M | 97.1% | 8 ms |
| **Mamba** | **6.3M** | **98.4%** | **6 ms** |

### LJSpeech (Speech Generation)

**Task**: Text-to-speech quality (MOS score)

| Model | Params | MOS ↑ | RTF ↓ |
|-------|--------|-------|-------|
| Transformer | 12M | 3.82 | 0.45 |
| Conformer | 11M | 3.91 | 0.38 |
| **Mamba** | **10M** | **4.03** | **0.21** |

**RTF** (Real-Time Factor): Lower is better (0.21 = 5× faster than real-time)

## Genomics Benchmarks

### Human Reference Genome (HG38)

**Task**: Next nucleotide prediction

| Model | Context Length | Perplexity | Throughput |
|-------|----------------|------------|------------|
| Transformer | 1024 | 3.21 | 1,200 bp/s |
| Hyena | 32768 | 2.87 | 8,500 bp/s |
| **Mamba** | **1M** | **2.14** | **45,000 bp/s** |

**Mamba handles million-length sequences** efficiently

## Scaling Laws

### Compute-Optimal Training

**FLOPs vs perplexity** (The Pile validation):

| Model Size | Training FLOPs | Mamba Perplexity | Transformer Perplexity |
|------------|----------------|------------------|------------------------|
| 130M | 6e19 | 28.1 | 29.6 |
| 370M | 3e20 | 16.7 | 18.3 |
| 790M | 8e20 | 12.3 | 13.9 |
| 1.4B | 2e21 | 9.1 | 10.8 |
| 2.8B | 6e21 | 7.4 | 8.3 |

**Scaling coefficient**: Mamba achieves same perplexity as Transformer with **0.8×** compute

### Parameter Efficiency

**Perplexity 10.0 target** on The Pile:

| Model Type | Parameters Needed | Memory (inference) |
|------------|-------------------|-------------------|
| Transformer | 1.6B | 3.2 GB |
| **Mamba** | **1.1B** | **2.2 GB** |

**Mamba needs ~30% fewer parameters** for same performance

## Long-Range Arena (LRA)

**Task**: Long-context understanding benchmarks

| Task | Length | Transformer | S4 | Mamba |
|------|--------|-------------|-----|-------|
| ListOps | 2K | 36.4% | 59.6% | **61.2%** |
| Text | 4K | 64.3% | 86.8% | **88.1%** |
| Retrieval | 4K | 57.5% | 90.9% | **92.3%** |
| Image | 1K | 42.4% | 88.7% | **89.4%** |
| PathFinder | 1K | 71.4% | 86.1% | **87.8%** |
| Path-X | 16K | OOM | 88.3% | **91.2%** |

**Average**: Mamba 85.0%, S4 83.4%, Transformer 54.4%

## Training Throughput

### Tokens/sec During Training

**8× A100 80GB** cluster, BF16, different sequence lengths:

| Model | Seq Len 512 | Seq Len 2K | Seq Len 8K | Seq Len 32K |
|-------|-------------|------------|------------|-------------|
| Transformer-1.3B | 180K | 52K | OOM | OOM |
| **Mamba-1.4B** | **195K** | **158K** | **121K** | **89K** |
| Transformer-2.7B | 92K | 26K | OOM | OOM |
| **Mamba-2.8B** | **98K** | **81K** | **62K** | **45K** |

**Mamba scales to longer sequences** without OOM

## Hardware Utilization

### GPU Memory Bandwidth

**Mamba-1.4B** inference on different GPUs:

| GPU | Memory BW | Tokens/sec | Efficiency |
|-----|-----------|------------|------------|
| A100 80GB | 2.0 TB/s | 6,800 | 85% |
| A100 40GB | 1.6 TB/s | 5,400 | 84% |
| V100 32GB | 900 GB/s | 3,100 | 86% |
| RTX 4090 | 1.0 TB/s | 3,600 | 90% |

**High efficiency**: Mamba is memory-bandwidth bound (good!)

### Multi-GPU Scaling

**Mamba-2.8B** training throughput:

| GPUs | Tokens/sec | Scaling Efficiency |
|------|------------|-------------------|
| 1× A100 | 12,300 | 100% |
| 2× A100 | 23,800 | 97% |
| 4× A100 | 46,100 | 94% |
| 8× A100 | 89,400 | 91% |
| 16× A100 | 172,000 | 88% |

**Near-linear scaling** up to 16 GPUs

## Cost Analysis

### Training Cost (USD)

**Training to The Pile perplexity 10.0** on cloud GPUs:

| Model | Cloud GPUs | Hours | Cost (A100) | Cost (H100) |
|-------|------------|-------|-------------|-------------|
| Transformer-1.6B | 8× A100 | 280 | $8,400 | $4,200 |
| **Mamba-1.1B** | **8× A100** | **180** | **$5,400** | **$2,700** |

**Savings**: 36% cost reduction vs Transformer

### Inference Cost (USD/million tokens)

**API-style inference** (batch size 1, 2K context):

| Model | Latency | Cost/M tokens | Quality (perplexity) |
|-------|---------|---------------|---------------------|
| Transformer-1.3B | 8.5 ms/tok | $0.42 | 10.8 |
| **Mamba-1.4B** | **3.2 ms/tok** | **$0.18** | **9.1** |

**Mamba provides**: 2.6× faster, 57% cheaper, better quality

## Resources

- Benchmarks code: https://github.com/state-spaces/mamba/tree/main/benchmarks
- Paper (Mamba-1): https://arxiv.org/abs/2312.00752 (Section 4: Experiments)
- Paper (Mamba-2): https://arxiv.org/abs/2405.21060 (Section 5: Experiments)
- Pretrained models: https://huggingface.co/state-spaces


================================================
FILE: 01-model-architecture/mamba/references/training-guide.md
================================================
# Mamba Training Guide

## Training from Scratch

### Setup Environment

```bash
# Install dependencies
pip install torch>=1.12.0 --extra-index-url https://download.pytorch.org/whl/cu116
pip install packaging ninja
pip install causal-conv1d>=1.1.0
pip install mamba-ssm

# Verify CUDA
python -c "import torch; print(torch.cuda.is_available())"
```

### Basic Training Loop

```python
import torch
from mamba_ssm import Mamba
from torch.utils.data import DataLoader

# Model setup
model = Mamba(
    d_model=512,
    d_state=16,
    d_conv=4,
    expand=2
).cuda()

# Optimizer (same as GPT)
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=6e-4,
    betas=(0.9, 0.95),
    weight_decay=0.1
)

# Training loop
for batch in dataloader:
    inputs, targets = batch
    inputs, targets = inputs.cuda(), targets.cuda()

    # Forward
    logits = model(inputs)
    loss = F.cross_entropy(logits.view(-1, vocab_size), targets.view(-1))

    # Backward
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
```

## Distributed Training

### Single-Node Multi-GPU (DDP)

```python
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize process group
dist.init_process_group("nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)

# Wrap model
model = Mamba(...).cuda()
model = DDP(model, device_ids=[local_rank])

# Train
optimizer = torch.optim.AdamW(model.parameters(), lr=6e-4)
for batch in dataloader:
    loss = compute_loss(model, batch)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
```

**Launch**:
```bash
torchrun --nproc_per_node=8 train.py
```

### Multi-Node Training

```bash
# Node 0 (master)
torchrun --nproc_per_node=8 \
  --nnodes=4 --node_rank=0 \
  --master_addr=$MASTER_ADDR --master_port=29500 \
  train.py

# Node 1-3 (workers)
torchrun --nproc_per_node=8 \
  --nnodes=4 --node_rank=$NODE_RANK \
  --master_addr=$MASTER_ADDR --master_port=29500 \
  train.py
```

## Mixed Precision Training

### BF16 (Recommended)

```python
from torch.cuda.amp import autocast, GradScaler

# BF16 (no scaler needed on A100/H100)
for batch in dataloader:
    with autocast(dtype=torch.bfloat16):
        logits = model(inputs)
        loss = F.cross_entropy(logits.view(-1, vocab_size), targets.view(-1))

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
```

### FP16 (with gradient scaling)

```python
scaler = GradScaler()

for batch in dataloader:
    with autocast(dtype=torch.float16):
        logits = model(inputs)
        loss = F.cross_entropy(logits.view(-1, vocab_size), targets.view(-1))

    optimizer.zero_grad()
    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    scaler.step(optimizer)
    scaler.update()
```

## Hyperparameter Recommendations

### Learning Rate Schedule

```python
# Cosine decay with warmup (GPT-3 style)
def get_lr(it, warmup_iters=2000, lr_decay_iters=600000):
    max_lr = 6e-4
    min_lr = 6e-5

    # Warmup
    if it < warmup_iters:
        return max_lr * it / warmup_iters

    # Decay
    if it > lr_decay_iters:
        return min_lr

    # Cosine
    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))
    return min_lr + coeff * (max_lr - min_lr)

# Apply in training loop
for it, batch in enumerate(dataloader):
    lr = get_lr(it)
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr
```

### Batch Size Recommendations

| Model Size | Per-GPU Batch | Gradient Accum | Effective Batch | GPUs |
|------------|---------------|----------------|-----------------|------|
| 130M | 32 | 4 | 1024 | 8 |
| 370M | 16 | 8 | 1024 | 8 |
| 790M | 8 | 8 | 512 | 8 |
| 1.4B | 4 | 16 | 512 | 8 |
| 2.8B | 2 | 16 | 256 | 8 |

```python
# Gradient accumulation
accumulation_steps = 8
optimizer.zero_grad()

for i, batch in enumerate(dataloader):
    loss = compute_loss(model, batch) / accumulation_steps
    loss.backward()

    if (i + 1) % accumulation_steps == 0:
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        optimizer.zero_grad()
```

### Optimizer Configuration

```python
# AdamW (recommended)
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=6e-4,           # Peak learning rate
    betas=(0.9, 0.95), # Standard for LLMs
    eps=1e-8,
    weight_decay=0.1   # Important for generalization
)

# Weight decay exemptions (optional)
decay = set()
no_decay = set()
for name, param in model.named_parameters():
    if 'norm' in name or 'bias' in name:
        no_decay.add(param)
    else:
        decay.add(param)

optimizer = torch.optim.AdamW([
    {'params': list(decay), 'weight_decay': 0.1},
    {'params': list(no_decay), 'weight_decay': 0.0}
], lr=6e-4, betas=(0.9, 0.95))
```

## Memory Optimization

### Gradient Checkpointing

```python
from torch.utils.checkpoint import checkpoint

class MambaBlock(nn.Module):
    def __init__(self, d_model, use_checkpoint=False):
        super().__init__()
        self.use_checkpoint = use_checkpoint
        self.norm = RMSNorm(d_model)
        self.mamba = Mamba(d_model)

    def forward(self, x):
        if self.use_checkpoint and self.training:
            return x + checkpoint(self._forward, x, use_reentrant=False)
        return x + self._forward(x)

    def _forward(self, x):
        return self.mamba(self.norm(x))

# Enable for training
model = MambaLM(use_checkpoint=True)
```

**Memory savings**: ~30-40% with minimal speed impact

### Flash Attention Integration

Mamba's CUDA kernels already use flash-attention-style optimizations:
- Fused operations in single kernel
- Recomputation in backward pass
- No intermediate activation storage

## Long Context Training

### Sequence Length Progression

```python
# Start short, increase gradually
training_stages = [
    {'seq_len': 512,  'iters': 50000},
    {'seq_len': 1024, 'iters': 100000},
    {'seq_len': 2048, 'iters': 150000},
    {'seq_len': 4096, 'iters': 200000},
]

for stage in training_stages:
    dataloader = create_dataloader(seq_len=stage['seq_len'])
    train(model, dataloader, max_iters=stage['iters'])
```

### Memory Requirements (Batch Size 1)

| Sequence Length | 130M Model | 370M Model | 1.4B Model |
|----------------|------------|------------|------------|
| 2K | 4 GB | 8 GB | 24 GB |
| 4K | 5 GB | 10 GB | 32 GB |
| 8K | 6 GB | 14 GB | 48 GB |
| 16K | 8 GB | 20 GB | 64 GB |
| 32K | 12 GB | 32 GB | 96 GB |

**Mamba advantage**: Memory grows **linearly**, Transformers grow **quadratically**

## Common Training Issues

### Issue: OOM during training

**Solution 1**: Reduce batch size
```python
per_gpu_batch = 8  # Reduce from 16
gradient_accumulation = 8  # Increase from 4
```

**Solution 2**: Enable gradient checkpointing
```python
model = MambaLM(use_checkpoint=True)
```

**Solution 3**: Use smaller sequence length
```python
seq_len = 1024  # Reduce from 2048
```

### Issue: Training unstable (loss spikes)

**Solution 1**: Check gradient norm
```python
grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
print(f"Grad norm: {grad_norm}")  # Should be < 10
```

**Solution 2**: Lower learning rate
```python
max_lr = 3e-4  # Reduce from 6e-4
```

**Solution 3**: Check Δ initialization
```python
# Ensure dt_min, dt_max are reasonable
model = Mamba(
    d_model=512,
    dt_min=0.001,  # Not too small
    dt_max=0.1     # Not too large
)
```

### Issue: Slow training speed

**Solution 1**: Verify CUDA kernels installed
```python
import mamba_ssm
print(mamba_ssm.__version__)  # Should have CUDA kernels
```

**Solution 2**: Use BF16 on A100/H100
```python
with autocast(dtype=torch.bfloat16):  # Faster than FP16
    loss = model(inputs)
```

**Solution 3**: Increase batch size if possible
```python
per_gpu_batch = 16  # Increase from 8 (better GPU utilization)
```

## Checkpointing

### Save/Load Model

```python
# Save
checkpoint = {
    'model': model.state_dict(),
    'optimizer': optimizer.state_dict(),
    'iter': iteration,
    'config': model_config
}
torch.save(checkpoint, f'checkpoint_{iteration}.pt')

# Load
checkpoint = torch.load('checkpoint_100000.pt')
model.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])
iteration = checkpoint['iter']
```

### Best Practices

```python
# Save every N iterations
if iteration % save_interval == 0:
    save_checkpoint(model, optimizer, iteration)

# Keep only last K checkpoints
checkpoints = sorted(glob.glob('checkpoint_*.pt'))
if len(checkpoints) > keep_last:
    for ckpt in checkpoints[:-keep_last]:
        os.remove(ckpt)
```

## Resources

- Training code: https://github.com/state-spaces/mamba/tree/main/benchmarks
- Pretrained models: https://huggingface.co/state-spaces
- CUDA installation: https://github.com/state-spaces/mamba#installation


================================================
FILE: 01-model-architecture/nanogpt/SKILL.md
================================================
---
name: nanogpt
description: Educational GPT implementation in ~300 lines. Reproduces GPT-2 (124M) on OpenWebText. Clean, hackable code for learning transformers. By Andrej Karpathy. Perfect for understanding GPT architecture from scratch. Train on Shakespeare (CPU) or OpenWebText (multi-GPU).
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Model Architecture, NanoGPT, GPT-2, Educational, Andrej Karpathy, Transformer, Minimalist, From Scratch, Training]
dependencies: [torch, transformers, datasets, tiktoken, wandb]
---

# nanoGPT - Minimalist GPT Training

## Quick start

nanoGPT is a simplified GPT implementation designed for learning and experimentation.

**Installation**:
```bash
pip install torch numpy transformers datasets tiktoken wandb tqdm
```

**Train on Shakespeare** (CPU-friendly):
```bash
# Prepare data
python data/shakespeare_char/prepare.py

# Train (5 minutes on CPU)
python train.py config/train_shakespeare_char.py

# Generate text
python sample.py --out_dir=out-shakespeare-char
```

**Output**:
```
ROMEO:
What say'st thou? Shall I speak, and be a man?

JULIET:
I am afeard, and yet I'll speak; for thou art
One that hath been a man, and yet I know not
What thou art.
```

## Common workflows

### Workflow 1: Character-level Shakespeare

**Complete training pipeline**:
```bash
# Step 1: Prepare data (creates train.bin, val.bin)
python data/shakespeare_char/prepare.py

# Step 2: Train small model
python train.py config/train_shakespeare_char.py

# Step 3: Generate text
python sample.py --out_dir=out-shakespeare-char
```

**Config** (`config/train_shakespeare_char.py`):
```python
# Model config
n_layer = 6          # 6 transformer layers
n_head = 6           # 6 attention heads
n_embd = 384         # 384-dim embeddings
block_size = 256     # 256 char context

# Training config
batch_size = 64
learning_rate = 1e-3
max_iters = 5000
eval_interval = 500

# Hardware
device = 'cpu'  # Or 'cuda'
compile = False # Set True for PyTorch 2.0
```

**Training time**: ~5 minutes (CPU), ~1 minute (GPU)

### Workflow 2: Reproduce GPT-2 (124M)

**Multi-GPU training on OpenWebText**:
```bash
# Step 1: Prepare OpenWebText (takes ~1 hour)
python data/openwebtext/prepare.py

# Step 2: Train GPT-2 124M with DDP (8 GPUs)
torchrun --standalone --nproc_per_node=8 \
  train.py config/train_gpt2.py

# Step 3: Sample from trained model
python sample.py --out_dir=out
```

**Config** (`config/train_gpt2.py`):
```python
# GPT-2 (124M) architecture
n_layer = 12
n_head = 12
n_embd = 768
block_size = 1024
dropout = 0.0

# Training
batch_size = 12
gradient_accumulation_steps = 5 * 8  # Total batch ~0.5M tokens
learning_rate = 6e-4
max_iters = 600000
lr_decay_iters = 600000

# System
compile = True  # PyTorch 2.0
```

**Training time**: ~4 days (8× A100)

### Workflow 3: Fine-tune pretrained GPT-2

**Start from OpenAI checkpoint**:
```python
# In train.py or config
init_from = 'gpt2'  # Options: gpt2, gpt2-medium, gpt2-large, gpt2-xl

# Model loads OpenAI weights automatically
python train.py config/finetune_shakespeare.py
```

**Example config** (`config/finetune_shakespeare.py`):
```python
# Start from GPT-2
init_from = 'gpt2'

# Dataset
dataset = 'shakespeare_char'
batch_size = 1
block_size = 1024

# Fine-tuning
learning_rate = 3e-5  # Lower LR for fine-tuning
max_iters = 2000
warmup_iters = 100

# Regularization
weight_decay = 1e-1
```

### Workflow 4: Custom dataset

**Train on your own text**:
```python
# data/custom/prepare.py
import numpy as np

# Load your data
with open('my_data.txt', 'r') as f:
    text = f.read()

# Create character mappings
chars = sorted(list(set(text)))
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

# Tokenize
data = np.array([stoi[ch] for ch in text], dtype=np.uint16)

# Split train/val
n = len(data)
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]

# Save
train_data.tofile('data/custom/train.bin')
val_data.tofile('data/custom/val.bin')
```

**Train**:
```bash
python data/custom/prepare.py
python train.py --dataset=custom
```

## When to use vs alternatives

**Use nanoGPT when**:
- Learning how GPT works
- Experimenting with transformer variants
- Teaching/education purposes
- Quick prototyping
- Limited compute (can run on CPU)

**Simplicity advantages**:
- **~300 lines**: Entire model in `model.py`
- **~300 lines**: Training loop in `train.py`
- **Hackable**: Easy to modify
- **No abstractions**: Pure PyTorch

**Use alternatives instead**:
- **HuggingFace Transformers**: Production use, many models
- **Megatron-LM**: Large-scale distributed training
- **LitGPT**: More architectures, production-ready
- **PyTorch Lightning**: Need high-level framework

## Common issues

**Issue: CUDA out of memory**

Reduce batch size or context length:
```python
batch_size = 1  # Reduce from 12
block_size = 512  # Reduce from 1024
gradient_accumulation_steps = 40  # Increase to maintain effective batch
```

**Issue: Training too slow**

Enable compilation (PyTorch 2.0+):
```python
compile = True  # 2× speedup
```

Use mixed precision:
```python
dtype = 'bfloat16'  # Or 'float16'
```

**Issue: Poor generation quality**

Train longer:
```python
max_iters = 10000  # Increase from 5000
```

Lower temperature:
```python
# In sample.py
temperature = 0.7  # Lower from 1.0
top_k = 200       # Add top-k sampling
```

**Issue: Can't load GPT-2 weights**

Install transformers:
```bash
pip install transformers
```

Check model name:
```python
init_from = 'gpt2'  # Valid: gpt2, gpt2-medium, gpt2-large, gpt2-xl
```

## Advanced topics

**Model architecture**: See [references/architecture.md](references/architecture.md) for GPT block structure, multi-head attention, and MLP layers explained simply.

**Training loop**: See [references/training.md](references/training.md) for learning rate schedule, gradient accumulation, and distributed data parallel setup.

**Data preparation**: See [references/data.md](references/data.md) for tokenization strategies (character-level vs BPE) and binary format details.

## Hardware requirements

- **Shakespeare (char-level)**:
  - CPU: 5 minutes
  - GPU (T4): 1 minute
  - VRAM: <1GB

- **GPT-2 (124M)**:
  - 1× A100: ~1 week
  - 8× A100: ~4 days
  - VRAM: ~16GB per GPU

- **GPT-2 Medium (350M)**:
  - 8× A100: ~2 weeks
  - VRAM: ~40GB per GPU

**Performance**:
- With `compile=True`: 2× speedup
- With `dtype=bfloat16`: 50% memory reduction

## Resources

- GitHub: https://github.com/karpathy/nanoGPT ⭐ 48,000+
- Video: "Let's build GPT" by Andrej Karpathy
- Paper: "Attention is All You Need" (Vaswani et al.)
- OpenWebText: https://huggingface.co/datasets/Skylion007/openwebtext
- Educational: Best for understanding transformers from scratch




================================================
FILE: 01-model-architecture/nanogpt/references/architecture.md
================================================
# NanoGPT Architecture

## Model Structure (~300 Lines)

NanoGPT implements a clean GPT-2 architecture in minimal code for educational purposes.

### Complete Model (model.py)

```python
import torch
import torch.nn as nn
from torch.nn import functional as F

class CausalSelfAttention(nn.Module):
    """Multi-head masked self-attention layer."""

    def __init__(self, config):
        super().__init__()
        assert config.n_embd % config.n_head == 0

        # Key, query, value projections for all heads (batched)
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd, bias=config.bias)
        # Output projection
        self.c_proj = nn.Linear(config.n_embd, config.n_embd, bias=config.bias)

        # Regularization
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)

        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.dropout = config.dropout

        # Flash attention flag
        self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')

        if not self.flash:
            # Causal mask (lower triangular)
            self.register_buffer("bias", torch.tril(
                torch.ones(config.block_size, config.block_size)
            ).view(1, 1, config.block_size, config.block_size))

    def forward(self, x):
        B, T, C = x.size()  # batch, seq_len, embedding_dim

        # Calculate Q, K, V for all heads in batch
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)

        # Reshape for multi-head attention
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)  # (B, nh, T, hs)

        # Attention
        if self.flash:
            # Flash Attention (PyTorch 2.0+)
            y = torch.nn.functional.scaled_dot_product_attention(
                q, k, v,
                attn_mask=None,
                dropout_p=self.dropout if self.training else 0,
                is_causal=True
            )
        else:
            # Manual attention implementation
            att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
            att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float('-inf'))
            att = F.softmax(att, dim=-1)
            att = self.attn_dropout(att)
            y = att @ v  # (B, nh, T, hs)

        # Reassemble all head outputs
        y = y.transpose(1, 2).contiguous().view(B, T, C)

        # Output projection
        y = self.resid_dropout(self.c_proj(y))
        return y


class MLP(nn.Module):
    """Feedforward network (2-layer with GELU activation)."""

    def __init__(self, config):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd, bias=config.bias)
        self.gelu = nn.GELU()
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd, bias=config.bias)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        x = self.dropout(x)
        return x


class Block(nn.Module):
    """Transformer block (attention + MLP with residuals)."""

    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = MLP(config)

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))  # Pre-norm + residual
        x = x + self.mlp(self.ln_2(x))   # Pre-norm + residual
        return x


@dataclass
class GPTConfig:
    """GPT model configuration."""
    block_size: int = 1024    # Max sequence length
    vocab_size: int = 50304   # GPT-2 vocab size (50257 rounded up for efficiency)
    n_layer: int = 12         # Number of layers
    n_head: int = 12          # Number of attention heads
    n_embd: int = 768         # Embedding dimension
    dropout: float = 0.0      # Dropout rate
    bias: bool = True         # Use bias in Linear and LayerNorm layers


class GPT(nn.Module):
    """GPT Language Model."""

    def __init__(self, config):
        super().__init__()
        assert config.vocab_size is not None
        assert config.block_size is not None
        self.config = config

        self.transformer = nn.ModuleDict(dict(
            wte=nn.Embedding(config.vocab_size, config.n_embd),  # Token embeddings
            wpe=nn.Embedding(config.block_size, config.n_embd),  # Position embeddings
            drop=nn.Dropout(config.dropout),
            h=nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f=nn.LayerNorm(config.n_embd),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

        # Weight tying (share embeddings and output projection)
        self.transformer.wte.weight = self.lm_head.weight

        # Initialize weights
        self.apply(self._init_weights)
        # Apply special scaled init to residual projections
        for pn, p in self.named_parameters():
            if pn.endswith('c_proj.weight'):
                torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * config.n_layer))

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        device = idx.device
        b, t = idx.size()
        assert t <= self.config.block_size, f"Cannot forward sequence length {t}, max is {self.config.block_size}"

        # Generate position indices
        pos = torch.arange(0, t, dtype=torch.long, device=device).unsqueeze(0)  # (1, t)

        # Forward the GPT model
        tok_emb = self.transformer.wte(idx)  # Token embeddings (b, t, n_embd)
        pos_emb = self.transformer.wpe(pos)  # Position embeddings (1, t, n_embd)
        x = self.transformer.drop(tok_emb + pos_emb)

        for block in self.transformer.h:
            x = block(x)

        x = self.transformer.ln_f(x)

        if targets is not None:
            # Training mode: compute loss
            logits = self.lm_head(x)
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
        else:
            # Inference mode: only compute logits for last token
            logits = self.lm_head(x[:, [-1], :])  # (b, 1, vocab_size)
            loss = None

        return logits, loss

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """Generate new tokens autoregressively."""
        for _ in range(max_new_tokens):
            # Crop context if needed
            idx_cond = idx if idx.size(1) <= self.config.block_size else idx[:, -self.config.block_size:]

            # Forward pass
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :] / temperature  # Scale by temperature

            # Optionally crop logits to top k
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')

            # Sample from distribution
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)

            # Append to sequence
            idx = torch.cat((idx, idx_next), dim=1)

        return idx
```

## Key Design Decisions

### 1. Pre-Norm vs Post-Norm

**NanoGPT uses Pre-Norm** (LayerNorm before sub-layers):

```python
# Pre-norm (NanoGPT)
x = x + attn(ln(x))
x = x + mlp(ln(x))

# Post-norm (original Transformer)
x = ln(x + attn(x))
x = ln(x + mlp(x))
```

**Why Pre-Norm?**
- More stable training (no gradient explosion)
- Used in GPT-2, GPT-3
- Standard for large language models

### 2. Weight Tying

**Shared weights between embeddings and output**:

```python
self.transformer.wte.weight = self.lm_head.weight
```

**Why?**
- Reduces parameters: `vocab_size × n_embd` saved
- Improves training (same semantic space)
- Standard in GPT-2

### 3. Scaled Residual Initialization

```python
# Scale down residual projections by layer depth
std = 0.02 / math.sqrt(2 * n_layer)
torch.nn.init.normal_(c_proj.weight, mean=0.0, std=std)
```

**Why?**
- Prevents gradient explosion in deep networks
- Each residual path contributes ~equally
- From GPT-2 paper

### 4. Flash Attention

```python
if hasattr(torch.nn.functional, 'scaled_dot_product_attention'):
    # Use PyTorch 2.0 Flash Attention (2× faster!)
    y = F.scaled_dot_product_attention(q, k, v, is_causal=True)
else:
    # Fallback to manual attention
    att = (q @ k.T) / sqrt(d)
    att = masked_fill(att, causal_mask, -inf)
    y = softmax(att) @ v
```

**Speedup**: 2× faster with same accuracy

## Model Sizes

| Model | n_layer | n_head | n_embd | Params | Config Name |
|-------|---------|--------|--------|--------|-------------|
| GPT-2 Small | 12 | 12 | 768 | 124M | `gpt2` |
| GPT-2 Medium | 24 | 16 | 1024 | 350M | `gpt2-medium` |
| GPT-2 Large | 36 | 20 | 1280 | 774M | `gpt2-large` |
| GPT-2 XL | 48 | 25 | 1600 | 1558M | `gpt2-xl` |

**NanoGPT default** (Shakespeare):
```python
config = GPTConfig(
    block_size=256,   # Short context for char-level
    vocab_size=65,    # Small vocab (a-z, A-Z, punctuation)
    n_layer=6,        # Shallow network
    n_head=6,
    n_embd=384,       # Small embeddings
    dropout=0.2       # Regularization
)
# Total: ~10M parameters
```

## Attention Visualization

```python
# What each token attends to (lower triangular)
# Token t can only attend to tokens 0...t

Attention Pattern (causal mask):
    t=0  t=1  t=2  t=3
t=0  ✓    -    -    -
t=1  ✓    ✓    -    -
t=2  ✓    ✓    ✓    -
t=3  ✓    ✓    ✓    ✓

# Prevents "cheating" by looking at future tokens
```

## Residual Stream

**Information flow through residuals**:

```python
# Input
x = token_emb + pos_emb

# Block 1
x = x + attn_1(ln(x))   # Attention adds to residual
x = x + mlp_1(ln(x))    # MLP adds to residual

# Block 2
x = x + attn_2(ln(x))
x = x + mlp_2(ln(x))

# ... (repeat for all layers)

# Output
logits = lm_head(ln(x))
```

**Key insight**: Each layer refines the representation, residuals preserve gradients

## Tokenization

### Character-Level (Shakespeare)

```python
# data/shakespeare_char/prepare.py
text = open('input.txt', 'r').read()
chars = sorted(list(set(text)))  # ['!', ',', '.', 'A', 'B', ..., 'z']
vocab_size = len(chars)  # 65

stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

# Encode
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)
```

### BPE (GPT-2)

```python
# data/openwebtext/prepare.py
import tiktoken

enc = tiktoken.get_encoding("gpt2")  # GPT-2 BPE tokenizer
vocab_size = enc.n_vocab  # 50257

# Encode
tokens = enc.encode_ordinary("Hello world")  # [15496, 995]

# Decode
text = enc.decode(tokens)  # "Hello world"
```

## Resources

- **GitHub**: https://github.com/karpathy/nanoGPT ⭐ 48,000+
- **Video**: "Let's build GPT" by Andrej Karpathy
- **Paper**: "Attention is All You Need" (Vaswani et al.)
- **Paper**: "Language Models are Unsupervised Multitask Learners" (GPT-2)
- **Code walkthrough**: https://github.com/karpathy/nanoGPT/blob/master/ARCHITECTURE.md


================================================
FILE: 01-model-architecture/nanogpt/references/data.md
================================================
# NanoGPT Data Preparation

## Data Format

NanoGPT uses **binary token files** for efficient loading:

```
dataset/
├── train.bin       # Training tokens (uint16 array)
├── val.bin         # Validation tokens (uint16 array)
└── meta.pkl        # Metadata (vocab_size, mappings)
```

**Why binary?**
- 100× faster than reading text files
- Memory-mapped loading (no RAM overhead)
- Simple format (just token IDs)

## Character-Level Tokenization

### Shakespeare Example

**Input text**:
```
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.
```

**Character vocabulary** (65 total):
```python
chars = ['\n', ' ', '!', ',', '.', ':', ';', '?', 'A', 'B', ..., 'z']
stoi = {'\n': 0, ' ': 1, '!': 2, ...}  # char → ID
itos = {0: '\n', 1: ' ', 2: '!', ...}  # ID → char
```

**Tokenization**:
```python
text = "First Citizen:"
tokens = [18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 63, 43, 52, 10]
# F=18, i=47, r=56, s=57, t=58, ' '=1, C=15, ...
```

**Full preparation script**:

```python
# data/shakespeare_char/prepare.py
import os
import requests
import pickle
import numpy as np

# Download Shakespeare dataset
input_file = 'input.txt'
if not os.path.exists(input_file):
    url = 'https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt'
    with open(input_file, 'w') as f:
        f.write(requests.get(url).text)

# Load text
with open(input_file, 'r') as f:
    data = f.read()

print(f"Dataset size: {len(data):,} characters")

# Build vocabulary
chars = sorted(list(set(data)))
vocab_size = len(chars)
print(f"Vocabulary: {vocab_size} unique characters")
print(f"Characters: {''.join(chars[:20])}...")

# Create mappings
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

# Encode full dataset
def encode(s):
    return [stoi[c] for c in s]

def decode(l):
    return ''.join([itos[i] for i in l])

# Split train/val (90/10)
n = len(data)
train_data = data[:int(n * 0.9)]
val_data = data[int(n * 0.9):]

# Tokenize
train_ids = encode(train_data)
val_ids = encode(val_data)

print(f"Train: {len(train_ids):,} tokens")
print(f"Val: {len(val_ids):,} tokens")

# Save as binary (uint16)
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)

train_ids.tofile('train.bin')
val_ids.tofile('val.bin')

# Save metadata
meta = {
    'vocab_size': vocab_size,
    'itos': itos,
    'stoi': stoi,
}

with open('meta.pkl', 'wb') as f:
    pickle.dump(meta, f)

print("Saved train.bin, val.bin, meta.pkl")
```

**Output**:
```
Dataset size: 1,115,394 characters
Vocabulary: 65 unique characters
Characters:  !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Train: 1,003,854 tokens
Val: 111,540 tokens
Saved train.bin, val.bin, meta.pkl
```

### Custom Character Dataset

```python
# For your own text dataset
text = open('my_data.txt', 'r').read()

# Build vocab
chars = sorted(list(set(text)))
vocab_size = len(chars)

# Create mappings
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i: ch for i, ch in enumerate(chars)}

# Encode
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

# Split and save
data = np.array(encode(text), dtype=np.uint16)
n = len(data)
train = data[:int(n*0.9)]
val = data[int(n*0.9):]

train.tofile('data/custom/train.bin')
val.tofile('data/custom/val.bin')

# Save meta
with open('data/custom/meta.pkl', 'wb') as f:
    pickle.dump({'vocab_size': vocab_size, 'itos': itos, 'stoi': stoi}, f)
```

## BPE (Byte Pair Encoding)

### OpenWebText with GPT-2 Tokenizer

**BPE advantages**:
- Handles rare words better (subword units)
- Standard for GPT-2, GPT-3
- Vocabulary: 50,257 tokens

**Preparation script**:

```python
# data/openwebtext/prepare.py
import os
import numpy as np
import tiktoken
from datasets import load_dataset
from tqdm import tqdm

# Number of workers for parallel processing
num_proc = 8
num_proc_load_dataset = num_proc

# Download OpenWebText dataset
dataset = load_dataset("openwebtext", num_proc=num_proc_load_dataset)

# Use GPT-2 tokenizer
enc = tiktoken.get_encoding("gpt2")

def process(example):
    """Tokenize a single example."""
    ids = enc.encode_ordinary(example['text'])  # Tokenize
    ids.append(enc.eot_token)  # Add end-of-text token
    out = {'ids': ids, 'len': len(ids)}
    return out

# Tokenize entire dataset (parallel)
tokenized = dataset.map(
    process,
    remove_columns=['text'],
    desc="Tokenizing",
    num_proc=num_proc,
)

# Concatenate all into one big array
train_ids = np.concatenate([
    np.array(sample['ids'], dtype=np.uint16)
    for sample in tqdm(tokenized['train'], desc="Concatenating")
])

print(f"Total tokens: {len(train_ids):,}")  # ~9 billion tokens

# Save train.bin
train_ids.tofile(os.path.join(os.path.dirname(__file__), 'train.bin'))

# Create val.bin (sample from train)
# Take first 5000 documents for validation
val_ids = np.concatenate([
    np.array(samp

Download .txt

gitextract_o6a5td4x/

├── .claude-plugin/
│   └── marketplace.json
├── .github/
│   └── workflows/
│       ├── claude.yml
│       ├── publish-npm.yml
│       └── sync-skills.yml
├── .gitignore
├── 0-autoresearch-skill/
│   ├── SKILL.md
│   ├── references/
│   │   ├── agent-continuity.md
│   │   ├── progress-reporting.md
│   │   └── skill-routing.md
│   └── templates/
│       ├── findings.md
│       ├── progress-presentation.html
│       ├── research-log.md
│       └── research-state.yaml
├── 01-model-architecture/
│   ├── .gitkeep
│   ├── litgpt/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── custom-models.md
│   │       ├── distributed-training.md
│   │       ├── supported-models.md
│   │       └── training-recipes.md
│   ├── mamba/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── architecture-details.md
│   │       ├── benchmarks.md
│   │       └── training-guide.md
│   ├── nanogpt/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── architecture.md
│   │       ├── data.md
│   │       └── training.md
│   ├── rwkv/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── architecture-details.md
│   │       ├── rwkv7.md
│   │       └── state-management.md
│   └── torchtitan/
│       ├── SKILL.md
│       └── references/
│           ├── checkpoint.md
│           ├── custom-models.md
│           ├── float8.md
│           └── fsdp.md
├── 02-tokenization/
│   ├── .gitkeep
│   ├── huggingface-tokenizers/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── algorithms.md
│   │       ├── integration.md
│   │       ├── pipeline.md
│   │       └── training.md
│   └── sentencepiece/
│       ├── SKILL.md
│       └── references/
│           ├── algorithms.md
│           └── training.md
├── 03-fine-tuning/
│   ├── axolotl/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── api.md
│   │       ├── dataset-formats.md
│   │       ├── index.md
│   │       └── other.md
│   ├── llama-factory/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── _images.md
│   │       ├── advanced.md
│   │       ├── getting_started.md
│   │       ├── index.md
│   │       └── other.md
│   ├── peft/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   └── unsloth/
│       ├── SKILL.md
│       └── references/
│           ├── index.md
│           ├── llms-full.md
│           ├── llms-txt.md
│           └── llms.md
├── 04-mechanistic-interpretability/
│   ├── nnsight/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── README.md
│   │       ├── api.md
│   │       └── tutorials.md
│   ├── pyvene/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── README.md
│   │       ├── api.md
│   │       └── tutorials.md
│   ├── saelens/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── README.md
│   │       ├── api.md
│   │       └── tutorials.md
│   └── transformer-lens/
│       ├── SKILL.md
│       └── references/
│           ├── README.md
│           ├── api.md
│           └── tutorials.md
├── 05-data-processing/
│   ├── .gitkeep
│   ├── nemo-curator/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── deduplication.md
│   │       └── filtering.md
│   └── ray-data/
│       ├── SKILL.md
│       └── references/
│           ├── integration.md
│           └── transformations.md
├── 06-post-training/
│   ├── grpo-rl-training/
│   │   ├── README.md
│   │   ├── SKILL.md
│   │   ├── examples/
│   │   │   └── reward_functions_library.py
│   │   └── templates/
│   │       └── basic_grpo_training.py
│   ├── miles/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── api-reference.md
│   │       └── troubleshooting.md
│   ├── openrlhf/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── algorithm-comparison.md
│   │       ├── custom-rewards.md
│   │       ├── hybrid-engine.md
│   │       └── multi-node-training.md
│   ├── simpo/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── datasets.md
│   │       ├── hyperparameters.md
│   │       └── loss-functions.md
│   ├── slime/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── api-reference.md
│   │       └── troubleshooting.md
│   ├── torchforge/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── api-reference.md
│   │       └── troubleshooting.md
│   ├── trl-fine-tuning/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── dpo-variants.md
│   │       ├── online-rl.md
│   │       ├── reward-modeling.md
│   │       └── sft-training.md
│   └── verl/
│       ├── SKILL.md
│       └── references/
│           ├── api-reference.md
│           └── troubleshooting.md
├── 07-safety-alignment/
│   ├── .gitkeep
│   ├── constitutional-ai/
│   │   └── SKILL.md
│   ├── llamaguard/
│   │   └── SKILL.md
│   ├── nemo-guardrails/
│   │   └── SKILL.md
│   └── prompt-guard/
│       └── SKILL.md
├── 08-distributed-training/
│   ├── accelerate/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── custom-plugins.md
│   │       ├── megatron-integration.md
│   │       └── performance.md
│   ├── deepspeed/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── 08.md
│   │       ├── 09.md
│   │       ├── 2020.md
│   │       ├── 2023.md
│   │       ├── assets.md
│   │       ├── index.md
│   │       ├── mii.md
│   │       ├── other.md
│   │       └── tutorials.md
│   ├── megatron-core/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── benchmarks.md
│   │       ├── parallelism-guide.md
│   │       ├── production-examples.md
│   │       └── training-recipes.md
│   ├── pytorch-fsdp2/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── pytorch_dcp_async_recipe.md
│   │       ├── pytorch_dcp_overview.md
│   │       ├── pytorch_dcp_recipe.md
│   │       ├── pytorch_ddp_notes.md
│   │       ├── pytorch_device_mesh_tutorial.md
│   │       ├── pytorch_examples_fsdp2.md
│   │       ├── pytorch_fsdp1_api.md
│   │       ├── pytorch_fsdp2_tutorial.md
│   │       ├── pytorch_fully_shard_api.md
│   │       ├── pytorch_tp_tutorial.md
│   │       ├── ray_train_fsdp2_example.md
│   │       └── torchtitan_fsdp_notes.md
│   ├── pytorch-lightning/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── callbacks.md
│   │       ├── distributed.md
│   │       └── hyperparameter-tuning.md
│   └── ray-train/
│       ├── SKILL.md
│       └── references/
│           └── multi-node.md
├── 09-infrastructure/
│   ├── .gitkeep
│   ├── lambda-labs/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   ├── modal/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   └── skypilot/
│       ├── SKILL.md
│       └── references/
│           ├── advanced-usage.md
│           └── troubleshooting.md
├── 10-optimization/
│   ├── .gitkeep
│   ├── awq/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   ├── bitsandbytes/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── memory-optimization.md
│   │       ├── qlora-training.md
│   │       └── quantization-formats.md
│   ├── flash-attention/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── benchmarks.md
│   │       └── transformers-integration.md
│   ├── gguf/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   ├── gptq/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── calibration.md
│   │       ├── integration.md
│   │       └── troubleshooting.md
│   ├── hqq/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   └── ml-training-recipes/
│       ├── SKILL.md
│       └── references/
│           ├── architecture.md
│           ├── biomedical.md
│           ├── domain-specific.md
│           ├── experiment-loop.md
│           ├── optimizers.md
│           └── scaling-and-selection.md
├── 11-evaluation/
│   ├── .gitkeep
│   ├── bigcode-evaluation-harness/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── benchmarks.md
│   │       ├── custom-tasks.md
│   │       └── issues.md
│   ├── lm-evaluation-harness/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── api-evaluation.md
│   │       ├── benchmark-guide.md
│   │       ├── custom-tasks.md
│   │       └── distributed-eval.md
│   └── nemo-evaluator/
│       ├── SKILL.md
│       └── references/
│           ├── adapter-system.md
│           ├── configuration.md
│           ├── custom-benchmarks.md
│           └── execution-backends.md
├── 12-inference-serving/
│   ├── .gitkeep
│   ├── llama-cpp/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── optimization.md
│   │       ├── quantization.md
│   │       └── server.md
│   ├── sglang/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── deployment.md
│   │       ├── radix-attention.md
│   │       └── structured-generation.md
│   ├── tensorrt-llm/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── multi-gpu.md
│   │       ├── optimization.md
│   │       └── serving.md
│   └── vllm/
│       ├── SKILL.md
│       └── references/
│           ├── optimization.md
│           ├── quantization.md
│           ├── server-deployment.md
│           └── troubleshooting.md
├── 13-mlops/
│   ├── .gitkeep
│   ├── mlflow/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── deployment.md
│   │       ├── model-registry.md
│   │       └── tracking.md
│   ├── swanlab/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── integrations.md
│   │       └── visualization.md
│   ├── tensorboard/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── integrations.md
│   │       ├── profiling.md
│   │       └── visualization.md
│   └── weights-and-biases/
│       ├── SKILL.md
│       └── references/
│           ├── artifacts.md
│           ├── integrations.md
│           └── sweeps.md
├── 14-agents/
│   ├── .gitkeep
│   ├── a-evolve/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── README.md
│   │       ├── api.md
│   │       ├── architecture.md
│   │       ├── design-patterns.md
│   │       ├── examples.md
│   │       ├── issues.md
│   │       ├── releases.md
│   │       └── tutorials.md
│   ├── autogpt/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   ├── crewai/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── flows.md
│   │       ├── tools.md
│   │       └── troubleshooting.md
│   ├── langchain/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── agents.md
│   │       ├── integration.md
│   │       └── rag.md
│   └── llamaindex/
│       ├── SKILL.md
│       └── references/
│           ├── agents.md
│           ├── data_connectors.md
│           └── query_engines.md
├── 15-rag/
│   ├── .gitkeep
│   ├── chroma/
│   │   ├── SKILL.md
│   │   └── references/
│   │       └── integration.md
│   ├── faiss/
│   │   ├── SKILL.md
│   │   └── references/
│   │       └── index_types.md
│   ├── pinecone/
│   │   ├── SKILL.md
│   │   └── references/
│   │       └── deployment.md
│   ├── qdrant/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   └── sentence-transformers/
│       ├── SKILL.md
│       └── references/
│           └── models.md
├── 16-prompt-engineering/
│   ├── .gitkeep
│   ├── dspy/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── examples.md
│   │       ├── modules.md
│   │       └── optimizers.md
│   ├── guidance/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── backends.md
│   │       ├── constraints.md
│   │       └── examples.md
│   ├── instructor/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── examples.md
│   │       ├── providers.md
│   │       └── validation.md
│   └── outlines/
│       ├── SKILL.md
│       └── references/
│           ├── backends.md
│           ├── examples.md
│           └── json_generation.md
├── 17-observability/
│   ├── .gitkeep
│   ├── langsmith/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   └── phoenix/
│       ├── SKILL.md
│       └── references/
│           ├── advanced-usage.md
│           └── troubleshooting.md
├── 18-multimodal/
│   ├── .gitkeep
│   ├── audiocraft/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   ├── blip-2/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   ├── clip/
│   │   ├── SKILL.md
│   │   └── references/
│   │       └── applications.md
│   ├── cosmos-policy/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── libero-commands.md
│   │       └── robocasa-commands.md
│   ├── llava/
│   │   ├── SKILL.md
│   │   └── references/
│   │       └── training.md
│   ├── openpi/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── checkpoints-and-env-map.md
│   │       ├── config-recipes.md
│   │       ├── pytorch-gotchas.md
│   │       ├── remote-client-pattern.md
│   │       └── training-debugging.md
│   ├── openvla-oft/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── aloha-workflow.md
│   │       ├── config-troubleshooting.md
│   │       ├── libero-workflow.md
│   │       └── paper-and-checkpoints.md
│   ├── segment-anything/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   ├── stable-diffusion/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── advanced-usage.md
│   │       └── troubleshooting.md
│   └── whisper/
│       ├── SKILL.md
│       └── references/
│           └── languages.md
├── 19-emerging-techniques/
│   ├── .gitkeep
│   ├── knowledge-distillation/
│   │   ├── SKILL.md
│   │   └── references/
│   │       └── minillm.md
│   ├── long-context/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── extension_methods.md
│   │       ├── fine_tuning.md
│   │       └── rope.md
│   ├── model-merging/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── evaluation.md
│   │       ├── examples.md
│   │       └── methods.md
│   ├── model-pruning/
│   │   ├── SKILL.md
│   │   └── references/
│   │       └── wanda.md
│   ├── moe-training/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── architectures.md
│   │       ├── inference.md
│   │       └── training.md
│   └── speculative-decoding/
│       ├── SKILL.md
│       └── references/
│           ├── lookahead.md
│           └── medusa.md
├── 20-ml-paper-writing/
│   ├── academic-plotting/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── data-visualization.md
│   │       ├── diagram-generation.md
│   │       └── style-guide.md
│   ├── ml-paper-writing/
│   │   ├── SKILL.md
│   │   ├── references/
│   │   │   ├── checklists.md
│   │   │   ├── citation-workflow.md
│   │   │   ├── reviewer-guidelines.md
│   │   │   ├── sources.md
│   │   │   └── writing-guide.md
│   │   └── templates/
│   │       ├── README.md
│   │       ├── aaai2026/
│   │       │   ├── README.md
│   │       │   ├── aaai2026-unified-supp.tex
│   │       │   ├── aaai2026-unified-template.tex
│   │       │   ├── aaai2026.bib
│   │       │   ├── aaai2026.bst
│   │       │   └── aaai2026.sty
│   │       ├── acl/
│   │       │   ├── README.md
│   │       │   ├── acl.sty
│   │       │   ├── acl_latex.tex
│   │       │   ├── acl_lualatex.tex
│   │       │   ├── acl_natbib.bst
│   │       │   ├── anthology.bib.txt
│   │       │   ├── custom.bib
│   │       │   └── formatting.md
│   │       ├── colm2025/
│   │       │   ├── README.md
│   │       │   ├── colm2025_conference.bib
│   │       │   ├── colm2025_conference.bst
│   │       │   ├── colm2025_conference.sty
│   │       │   ├── colm2025_conference.tex
│   │       │   ├── fancyhdr.sty
│   │       │   ├── math_commands.tex
│   │       │   └── natbib.sty
│   │       ├── iclr2026/
│   │       │   ├── fancyhdr.sty
│   │       │   ├── iclr2026_conference.bib
│   │       │   ├── iclr2026_conference.bst
│   │       │   ├── iclr2026_conference.sty
│   │       │   ├── iclr2026_conference.tex
│   │       │   ├── math_commands.tex
│   │       │   └── natbib.sty
│   │       ├── icml2026/
│   │       │   ├── algorithm.sty
│   │       │   ├── algorithmic.sty
│   │       │   ├── example_paper.bib
│   │       │   ├── example_paper.tex
│   │       │   ├── fancyhdr.sty
│   │       │   ├── icml2026.bst
│   │       │   └── icml2026.sty
│   │       └── neurips2025/
│   │           ├── Makefile
│   │           ├── extra_pkgs.tex
│   │           ├── main.tex
│   │           └── neurips.sty
│   ├── presenting-conference-talks/
│   │   ├── SKILL.md
│   │   └── references/
│   │       └── slide-templates.md
│   └── systems-paper-writing/
│       ├── SKILL.md
│       ├── references/
│       │   ├── checklist.md
│       │   ├── reviewer-guidelines.md
│       │   ├── section-blueprints.md
│       │   ├── systems-conferences.md
│       │   └── writing-patterns.md
│       └── templates/
│           ├── asplos2027/
│           │   ├── main.tex
│           │   └── references.bib
│           ├── nsdi2027/
│           │   ├── main.tex
│           │   ├── references.bib
│           │   └── usenix-2020-09.sty
│           ├── osdi2026/
│           │   ├── main.tex
│           │   ├── references.bib
│           │   └── usenix-2020-09.sty
│           └── sosp2026/
│               ├── main.tex
│               └── references.bib
├── 21-research-ideation/
│   ├── brainstorming-research-ideas/
│   │   └── SKILL.md
│   └── creative-thinking-for-research/
│       └── SKILL.md
├── 22-agent-native-research-artifact/
│   ├── compiler/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── ara-schema.md
│   │       ├── exploration-tree-spec.md
│   │       └── validation-checklist.md
│   ├── research-manager/
│   │   ├── SKILL.md
│   │   └── references/
│   │       ├── event-taxonomy.md
│   │       ├── provenance-tags.md
│   │       └── session-protocol.md
│   └── rigor-reviewer/
│       ├── SKILL.md
│       └── references/
│           └── review-dimensions.md
├── CITATION.cff
├── CLAUDE.md
├── CONTRIBUTING.md
├── LICENSE
├── README.md
├── WELCOME.md
├── anthropic_official_docs/
│   ├── best_practices.md
│   └── skills_overview.md
├── demos/
│   ├── README.md
│   ├── autoresearch-norm-heterogeneity/
│   │   └── README.md
│   ├── autoresearch-rl-brain-scan/
│   │   └── README.md
│   └── scientific-plotting-demo/
│       ├── README.md
│       └── figures/
│           ├── gen_fig_andes_architecture_gemini.py
│           ├── gen_fig_andes_workflow.py
│           └── gen_fig_experiment_results.py
├── dev_data/
│   ├── GITHUB_SKILLS_SYNC_SETUP.md
│   ├── PROJECT_ANALYSIS.md
│   ├── RESEARCH_QUESTIONNAIRE.md
│   ├── RESEARCH_QUESTIONNAIRE_PART1.md
│   ├── RESEARCH_QUESTIONNAIRE_PART2.md
│   ├── RESEARCH_QUESTIONNAIRE_PART3.md
│   ├── SCRAPING_STATUS.md
│   ├── SKILL_BUILD_PLAN.md
│   ├── SKILL_STRUCTURE_VERIFICATION.md
│   └── deep_research_report_1.md
├── docs/
│   ├── ROADMAP.md
│   ├── SKILL_CREATION_GUIDE.md
│   ├── SKILL_TEMPLATE.md
│   ├── npm-package-plan.md
│   ├── npm-package-ux-mockup.html
│   └── writing-assets/
│       ├── ML_paper_guide.md
│       └── ml_paper_writing_sources.md
├── package.json
├── packages/
│   └── ai-research-skills/
│       ├── .gitignore
│       ├── README.md
│       ├── bin/
│       │   └── cli.js
│       ├── package.json
│       └── src/
│           ├── agents.js
│           ├── ascii.js
│           ├── index.js
│           ├── installer.js
│           └── prompts.js
└── video-promo/
    └── ai-research-skills-promo/
        ├── .gitignore
        ├── package.json
        ├── remotion.config.ts
        ├── src/
        │   ├── AIResearchSkillsPromo.tsx
        │   ├── Root.tsx
        │   ├── components/
        │   │   ├── AgentDetection.tsx
        │   │   ├── CallToAction.tsx
        │   │   ├── CategorySelection.tsx
        │   │   ├── InstallProgress.tsx
        │   │   ├── OrchestraLogo.tsx
        │   │   ├── StatsDisplay.tsx
        │   │   ├── SuccessScreen.tsx
        │   │   └── Terminal.tsx
        │   └── index.ts
        └── tsconfig.json

Download .txt

SYMBOL INDEX (132 symbols across 19 files)

FILE: 06-post-training/grpo-rl-training/examples/reward_functions_library.py
  function exact_match_reward (line 21) | def exact_match_reward(prompts, completions, answer, **kwargs) -> List[f...
  function fuzzy_match_reward (line 33) | def fuzzy_match_reward(prompts, completions, answer, **kwargs) -> List[f...
  function numeric_correctness_reward (line 52) | def numeric_correctness_reward(prompts, completions, answer, tolerance=0...
  function code_execution_reward (line 76) | def code_execution_reward(prompts, completions, test_cases, **kwargs) ->...
  function strict_xml_format_reward (line 99) | def strict_xml_format_reward(completions, **kwargs) -> List[float]:
  function soft_xml_format_reward (line 111) | def soft_xml_format_reward(completions, **kwargs) -> List[float]:
  function json_format_reward (line 123) | def json_format_reward(completions, **kwargs) -> List[float]:
  function incremental_format_reward (line 144) | def incremental_format_reward(completions, tags=['reasoning', 'answer'],...
  function ideal_length_reward (line 173) | def ideal_length_reward(completions, ideal_tokens=100, **kwargs) -> List...
  function min_length_reward (line 192) | def min_length_reward(completions, min_tokens=50, **kwargs) -> List[float]:
  function max_length_penalty (line 209) | def max_length_penalty(completions, max_tokens=500, **kwargs) -> List[fl...
  function reasoning_quality_reward (line 228) | def reasoning_quality_reward(completions, **kwargs) -> List[float]:
  function citation_reward (line 251) | def citation_reward(completions, **kwargs) -> List[float]:
  function no_repetition_penalty (line 274) | def no_repetition_penalty(completions, **kwargs) -> List[float]:
  function math_problem_reward (line 297) | def math_problem_reward(prompts, completions, answer, **kwargs) -> List[...
  function code_generation_reward (line 309) | def code_generation_reward(prompts, completions, test_cases, **kwargs) -...
  function extract_answer (line 323) | def extract_answer(text: str) -> str:
  function extract_xml_tag (line 327) | def extract_xml_tag(text: str, tag: str) -> str:
  function extract_code_block (line 333) | def extract_code_block(text: str) -> str:
  function run_test_cases (line 339) | def run_test_cases(code: str, test_cases: List[tuple]) -> bool:

FILE: 06-post-training/grpo-rl-training/templates/basic_grpo_training.py
  function get_dataset (line 39) | def get_dataset(split="train"):
  function extract_xml_tag (line 66) | def extract_xml_tag(text: str, tag: str) -> str:
  function extract_answer (line 72) | def extract_answer(text: str) -> str:
  function correctness_reward_func (line 78) | def correctness_reward_func(prompts, completions, answer, **kwargs):
  function format_reward_func (line 87) | def format_reward_func(completions, **kwargs):
  function incremental_format_reward_func (line 96) | def incremental_format_reward_func(completions, **kwargs):
  function setup_model_and_tokenizer (line 126) | def setup_model_and_tokenizer():
  function get_peft_config (line 140) | def get_peft_config():
  function main (line 155) | def main():

FILE: demos/scientific-plotting-demo/figures/gen_fig_andes_architecture_gemini.py
  function generate_image (line 254) | def generate_image(prompt_text, attempt_num):
  function main (line 284) | def main():

FILE: demos/scientific-plotting-demo/figures/gen_fig_andes_workflow.py
  function draw_rounded_box (line 47) | def draw_rounded_box(ax, xy, width, height, label, facecolor, edgecolor,
  function draw_arrow (line 64) | def draw_arrow(ax, start, end, color="#2D3436", style="-|>", linewidth=1.2,
  function draw_circled_number (line 77) | def draw_circled_number(ax, xy, number, color="#F4A261", fontsize=8):

FILE: demos/scientific-plotting-demo/figures/gen_fig_experiment_results.py
  function generate_cdf_data (line 61) | def generate_cdf_data(n=500, seed=42):
  function plot_cdf_panels (line 84) | def plot_cdf_panels():
  function plot_burst_intensity (line 137) | def plot_burst_intensity():
  function plot_summary_improvements (line 206) | def plot_summary_improvements():
  function plot_qoe_definition (line 275) | def plot_qoe_definition():

FILE: packages/ai-research-skills/src/agents.js
  constant SUPPORTED_AGENTS (line 14) | const SUPPORTED_AGENTS = [
  function detectAgents (line 93) | function detectAgents() {
  function buildLocalAgentTargets (line 119) | function buildLocalAgentTargets(agents, projectDir) {
  function detectLocalAgents (line 134) | function detectLocalAgents(projectDir) {
  function getAgentById (line 161) | function getAgentById(id) {
  function getSupportedAgentIds (line 169) | function getSupportedAgentIds() {

FILE: packages/ai-research-skills/src/ascii.js
  function showWelcome (line 18) | function showWelcome(skillCount = 98, categoryCount = 23, agentCount = 9) {
  function showAgentsDetected (line 36) | function showAgentsDetected(agents) {
  function showMenuHeader (line 57) | function showMenuHeader() {
  function showSuccess (line 69) | function showSuccess(skillCount, agents) {
  function showLocalSuccess (line 106) | function showLocalSuccess(skillCount, agents, projectDir) {
  function showNoAgents (line 143) | function showNoAgents() {

FILE: packages/ai-research-skills/src/index.js
  function sleep (line 48) | function sleep(ms) {
  function interactiveFlow (line 55) | async function interactiveFlow() {
  function commandMode (line 405) | async function commandMode(options) {
  function main (line 522) | async function main() {

FILE: packages/ai-research-skills/src/installer.js
  constant REPO_URL (line 8) | const REPO_URL = 'https://github.com/Orchestra-Research/AI-research-SKIL...
  constant CANONICAL_DIR (line 9) | const CANONICAL_DIR = join(homedir(), '.orchestra', 'skills');
  constant LOCK_FILE (line 10) | const LOCK_FILE = join(homedir(), '.orchestra', '.lock.json');
  constant LOCAL_LOCK_FILENAME (line 11) | const LOCAL_LOCK_FILENAME = '.orchestra-skills.json';
  function copyDirectoryContents (line 16) | function copyDirectoryContents(source, dest) {
  function ensureCanonicalDir (line 28) | function ensureCanonicalDir() {
  function readLock (line 41) | function readLock() {
  function writeLock (line 55) | function writeLock(data) {
  function downloadSkills (line 62) | async function downloadSkills(categories, spinner) {
  function createSymlinks (line 132) | function createSymlinks(agent, skills, spinner) {
  function downloadSpecificSkills (line 175) | async function downloadSpecificSkills(skillPaths, spinner) {
  function installSpecificSkills (line 243) | async function installSpecificSkills(skillPaths, agents) {
  function installSkills (line 279) | async function installSkills(categories, agents) {
  function listInstalledSkills (line 317) | function listInstalledSkills() {
  function getAllCategoryIds (line 387) | function getAllCategoryIds() {
  function getInstalledSkillPaths (line 418) | function getInstalledSkillPaths() {
  function updateInstalledSkills (line 453) | async function updateInstalledSkills(agents) {
  function uninstallAllSkills (line 496) | async function uninstallAllSkills(agents) {
  function uninstallSpecificSkills (line 544) | async function uninstallSpecificSkills(skillPaths, agents) {
  function getInstalledSkillsForSelection (line 620) | function getInstalledSkillsForSelection() {
  function getLocalLockPath (line 641) | function getLocalLockPath(projectDir) {
  function readLocalLock (line 648) | function readLocalLock(projectDir) {
  function writeLocalLock (line 663) | function writeLocalLock(projectDir, data) {
  function copySkillsToLocal (line 673) | function copySkillsToLocal(agent, skills, tempDir) {
  function installSkillsLocal (line 708) | async function installSkillsLocal(categories, agents, projectDir) {
  function installSpecificSkillsLocal (line 784) | async function installSpecificSkillsLocal(skillPaths, agents, projectDir) {
  function listLocalSkills (line 857) | function listLocalSkills(projectDir) {
  function getLocalSkillPaths (line 908) | function getLocalSkillPaths(projectDir) {
  function getLocalSkillsForSelection (line 922) | function getLocalSkillsForSelection(projectDir) {
  function updateLocalSkills (line 940) | async function updateLocalSkills(agents, projectDir) {
  function uninstallLocalSkills (line 955) | async function uninstallLocalSkills(skillPaths, agents, projectDir) {
  function uninstallAllLocalSkills (line 998) | async function uninstallAllLocalSkills(agents, projectDir) {

FILE: packages/ai-research-skills/src/prompts.js
  constant CATEGORIES (line 7) | const CATEGORIES = [
  constant INDIVIDUAL_SKILLS (line 36) | const INDIVIDUAL_SKILLS = [
  constant QUICK_START_SKILLS (line 82) | const QUICK_START_SKILLS = [
  function getTotalSkillCount (line 103) | function getTotalSkillCount() {
  function askMainMenuAction (line 110) | async function askMainMenuAction(projectDir) {
  function askSelectLocalAgents (line 137) | async function askSelectLocalAgents(agents) {
  function askLocalConfirmation (line 204) | async function askLocalConfirmation(skillCount, agents, projectDir, cate...
  function askUninstallChoice (line 272) | async function askUninstallChoice() {
  function askSelectSkillsToUninstall (line 297) | async function askSelectSkillsToUninstall(installedSkills) {
  function askConfirmUninstall (line 341) | async function askConfirmUninstall(count) {
  function askInstallChoice (line 364) | async function askInstallChoice() {
  function askCategories (line 409) | async function askCategories() {
  function askIndividualSkills (line 453) | async function askIndividualSkills() {
  function askConfirmation (line 497) | async function askConfirmation(skillCount, agents, selectedCategories, s...
  function askSelectAgents (line 563) | async function askSelectAgents(agents) {
  function askAfterAction (line 630) | async function askAfterAction() {
  function parseArgs (line 650) | function parseArgs(args) {

FILE: video-promo/ai-research-skills-promo/src/AIResearchSkillsPromo.tsx
  constant SCENE_TIMING (line 27) | const SCENE_TIMING = {

FILE: video-promo/ai-research-skills-promo/src/components/AgentDetection.tsx
  constant COLORS (line 12) | const COLORS = {
  constant AGENTS (line 18) | const AGENTS = [
  type AgentItemProps (line 26) | type AgentItemProps = {
  type AgentDetectionProps (line 88) | type AgentDetectionProps = {

FILE: video-promo/ai-research-skills-promo/src/components/CallToAction.tsx
  constant COLORS (line 14) | const COLORS = {
  type CallToActionProps (line 23) | type CallToActionProps = {

FILE: video-promo/ai-research-skills-promo/src/components/CategorySelection.tsx
  constant COLORS (line 12) | const COLORS = {
  constant CATEGORIES (line 21) | const CATEGORIES = [
  type CategoryItemProps (line 30) | type CategoryItemProps = {
  type CategorySelectionProps (line 115) | type CategorySelectionProps = {

FILE: video-promo/ai-research-skills-promo/src/components/InstallProgress.tsx
  constant COLORS (line 13) | const COLORS = {
  constant SKILL_NAMES (line 22) | const SKILL_NAMES = [
  type InstallProgressProps (line 38) | type InstallProgressProps = {

FILE: video-promo/ai-research-skills-promo/src/components/OrchestraLogo.tsx
  constant ORCHESTRA_ASCII (line 13) | const ORCHESTRA_ASCII = `
  type OrchestraLogoProps (line 22) | type OrchestraLogoProps = {

FILE: video-promo/ai-research-skills-promo/src/components/StatsDisplay.tsx
  constant COLORS (line 13) | const COLORS = {
  type StatItemProps (line 20) | type StatItemProps = {
  type StatsDisplayProps (line 95) | type StatsDisplayProps = {

FILE: video-promo/ai-research-skills-promo/src/components/SuccessScreen.tsx
  constant COLORS (line 15) | const COLORS = {
  constant EXAMPLE_PROMPTS (line 23) | const EXAMPLE_PROMPTS = [
  type SuccessScreenProps (line 29) | type SuccessScreenProps = {

FILE: video-promo/ai-research-skills-promo/src/components/Terminal.tsx
  constant COLORS (line 13) | const COLORS = {
  type TerminalProps (line 26) | type TerminalProps = {
  type CursorProps (line 134) | type CursorProps = {
  type TypewriterProps (line 167) | type TypewriterProps = {
  type CommandLineProps (line 201) | type CommandLineProps = {
  type ColoredTextProps (line 221) | type ColoredTextProps = {

Download .json

Condensed preview — 499 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (8,176K chars).

[
  {
    "path": ".claude-plugin/marketplace.json",
    "chars": 12081,
    "preview": "{\n  \"name\": \"ai-research-skills\",\n  \"owner\": {\n    \"name\": \"Orchestra Research\",\n    \"email\": \"zechen@orchestra-research"
  },
  {
    "path": ".github/workflows/claude.yml",
    "chars": 1109,
    "preview": "name: Claude Code\non:\n  issue_comment:\n    types: [created]\n  pull_request_review_comment:\n    types: [created]\n  issues"
  },
  {
    "path": ".github/workflows/publish-npm.yml",
    "chars": 2485,
    "preview": "name: Publish to npm\n\non:\n  push:\n    branches: [main]\n    paths:\n      - 'packages/ai-research-skills/**'\n\npermissions:"
  },
  {
    "path": ".github/workflows/sync-skills.yml",
    "chars": 7574,
    "preview": "name: Sync Skills to Orchestra\n\non:\n  push:\n    branches:\n      - main\n  workflow_dispatch: # Allow manual trigger\n\njobs"
  },
  {
    "path": ".gitignore",
    "chars": 1099,
    "preview": "# Python\n__pycache__/\n*.py[cod]\n*$py.class\n*.so\n\n# LaTeX auxiliary files\n*.aux\n*.bbl\n*.blg\n*.out\n*.fls\n*.fdb_latexmk\n*.s"
  },
  {
    "path": "0-autoresearch-skill/SKILL.md",
    "chars": 24691,
    "preview": "---\nname: autoresearch\ndescription: Orchestrates end-to-end autonomous AI research projects using a two-loop architectur"
  },
  {
    "path": "0-autoresearch-skill/references/agent-continuity.md",
    "chars": 4668,
    "preview": "# Agent Continuity: Keeping Research Running\n\nAutonomous research requires agents that keep working continuously — hours"
  },
  {
    "path": "0-autoresearch-skill/references/progress-reporting.md",
    "chars": 6447,
    "preview": "# Progress Reporting: Research Presentations\n\nWhen the research produces something worth sharing, create a compelling pr"
  },
  {
    "path": "0-autoresearch-skill/references/skill-routing.md",
    "chars": 8542,
    "preview": "# Skill Routing: When to Use Which Domain Skill\n\nThe autoresearch skill orchestrates — domain skills execute. This refer"
  },
  {
    "path": "0-autoresearch-skill/templates/findings.md",
    "chars": 1492,
    "preview": "# Research Findings\n\n## Research Question\n\n<!-- What are we trying to discover? One clear sentence. -->\n\n## Current Unde"
  },
  {
    "path": "0-autoresearch-skill/templates/progress-presentation.html",
    "chars": 8685,
    "preview": "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n    <meta charset=\"UTF-8\">\n    <meta name=\"viewport\" content=\"width=device-width"
  },
  {
    "path": "0-autoresearch-skill/templates/research-log.md",
    "chars": 4977,
    "preview": "# Research Log\n\nChronological record of research decisions and actions. Append-only.\n\n| # | Date | Type | Summary |\n|---"
  },
  {
    "path": "0-autoresearch-skill/templates/research-state.yaml",
    "chars": 2100,
    "preview": "# Research State — Central Project Tracking\n# Copy this template to your project root and fill in as you go.\n# Updated b"
  },
  {
    "path": "01-model-architecture/.gitkeep",
    "chars": 172,
    "preview": "# Skills Coming Soon\n\nThis directory will contain high-quality AI research skills for model architecture.\n\nSee [CONTRIBU"
  },
  {
    "path": "01-model-architecture/litgpt/SKILL.md",
    "chars": 11005,
    "preview": "---\nname: implementing-llms-litgpt\ndescription: Implements and trains LLMs using Lightning AI's LitGPT with 20+ pretrain"
  },
  {
    "path": "01-model-architecture/litgpt/references/custom-models.md",
    "chars": 15651,
    "preview": "# Custom Models\n\nGuide to implementing custom model architectures in LitGPT.\n\n## Overview\n\nLitGPT's clean, single-file i"
  },
  {
    "path": "01-model-architecture/litgpt/references/distributed-training.md",
    "chars": 11086,
    "preview": "# Distributed Training\n\nGuide to FSDP (Fully Sharded Data Parallel) distributed training in LitGPT for scaling to multip"
  },
  {
    "path": "01-model-architecture/litgpt/references/supported-models.md",
    "chars": 7933,
    "preview": "# Supported Models\n\nComplete list of model architectures supported by LitGPT with parameter sizes and variants.\n\n## Over"
  },
  {
    "path": "01-model-architecture/litgpt/references/training-recipes.md",
    "chars": 10977,
    "preview": "# Training Recipes\n\nComplete hyperparameter configurations for LoRA, QLoRA, and full fine-tuning across different model "
  },
  {
    "path": "01-model-architecture/mamba/SKILL.md",
    "chars": 7360,
    "preview": "---\nname: mamba-architecture\ndescription: State-space model with O(n) complexity vs Transformers' O(n²). 5× faster infer"
  },
  {
    "path": "01-model-architecture/mamba/references/architecture-details.md",
    "chars": 5434,
    "preview": "# Mamba Architecture Details\n\n## Selective State Space Mechanism\n\nMamba's core innovation is the **Selective SSM (S6)** "
  },
  {
    "path": "01-model-architecture/mamba/references/benchmarks.md",
    "chars": 8078,
    "preview": "# Mamba Performance Benchmarks\n\n## Inference Speed Comparison\n\n### Throughput (tokens/sec)\n\n**Mamba-1.4B vs Transformer-"
  },
  {
    "path": "01-model-architecture/mamba/references/training-guide.md",
    "chars": 9011,
    "preview": "# Mamba Training Guide\n\n## Training from Scratch\n\n### Setup Environment\n\n```bash\n# Install dependencies\npip install torc"
  },
  {
    "path": "01-model-architecture/nanogpt/SKILL.md",
    "chars": 6744,
    "preview": "---\nname: nanogpt\ndescription: Educational GPT implementation in ~300 lines. Reproduces GPT-2 (124M) on OpenWebText. Cle"
  },
  {
    "path": "01-model-architecture/nanogpt/references/architecture.md",
    "chars": 11691,
    "preview": "# NanoGPT Architecture\n\n## Model Structure (~300 Lines)\n\nNanoGPT implements a clean GPT-2 architecture in minimal code f"
  },
  {
    "path": "01-model-architecture/nanogpt/references/data.md",
    "chars": 11186,
    "preview": "# NanoGPT Data Preparation\n\n## Data Format\n\nNanoGPT uses **binary token files** for efficient loading:\n\n```\ndataset/\n├──"
  },
  {
    "path": "01-model-architecture/nanogpt/references/training.md",
    "chars": 13520,
    "preview": "# NanoGPT Training Guide\n\n## Training Loop (~300 Lines)\n\nNanoGPT's `train.py` is a self-contained training script with m"
  },
  {
    "path": "01-model-architecture/rwkv/SKILL.md",
    "chars": 7085,
    "preview": "---\nname: rwkv-architecture\ndescription: RNN+Transformer hybrid with O(n) inference. Linear time, infinite context, no K"
  },
  {
    "path": "01-model-architecture/rwkv/references/architecture-details.md",
    "chars": 9296,
    "preview": "# RWKV Architecture Details\n\n## Time-Mixing and Channel-Mixing Blocks\n\nRWKV alternates between **Time-Mixing** (sequence"
  },
  {
    "path": "01-model-architecture/rwkv/references/rwkv7.md",
    "chars": 10421,
    "preview": "# RWKV-7: Latest Improvements (March 2025)\n\n## Overview\n\nRWKV-7 is the latest version released in March 2025, introducin"
  },
  {
    "path": "01-model-architecture/rwkv/references/state-management.md",
    "chars": 9488,
    "preview": "# RWKV State Management\n\n## Understanding RWKV State\n\nUnlike Transformers with KV cache, RWKV maintains a **fixed-size r"
  },
  {
    "path": "01-model-architecture/torchtitan/SKILL.md",
    "chars": 8927,
    "preview": "---\nname: distributed-llm-pretraining-torchtitan\ndescription: Provides PyTorch-native distributed LLM pretraining using "
  },
  {
    "path": "01-model-architecture/torchtitan/references/checkpoint.md",
    "chars": 4166,
    "preview": "# Checkpointing in TorchTitan\n\nTorchTitan uses PyTorch Distributed Checkpoint (DCP) for fault-tolerant, interoperable ch"
  },
  {
    "path": "01-model-architecture/torchtitan/references/custom-models.md",
    "chars": 7281,
    "preview": "# Adding Custom Models to TorchTitan\n\nThis guide explains how to add a new model to TorchTitan following the established"
  },
  {
    "path": "01-model-architecture/torchtitan/references/float8.md",
    "chars": 4055,
    "preview": "# Float8 Training in TorchTitan\n\nFloat8 training provides substantial speedups for models where GEMMs are large enough t"
  },
  {
    "path": "01-model-architecture/torchtitan/references/fsdp.md",
    "chars": 3888,
    "preview": "# FSDP2 in TorchTitan\n\n## Why FSDP2?\n\nFSDP2 is a rewrite of PyTorch's Fully Sharded Data Parallel (FSDP) API, removing t"
  },
  {
    "path": "02-tokenization/.gitkeep",
    "chars": 166,
    "preview": "# Skills Coming Soon\n\nThis directory will contain high-quality AI research skills for tokenization.\n\nSee [CONTRIBUTING.m"
  },
  {
    "path": "02-tokenization/huggingface-tokenizers/SKILL.md",
    "chars": 13643,
    "preview": "---\nname: huggingface-tokenizers\ndescription: Fast tokenizers optimized for research and production. Rust-based implemen"
  },
  {
    "path": "02-tokenization/huggingface-tokenizers/references/algorithms.md",
    "chars": 15053,
    "preview": "# Tokenization Algorithms Deep Dive\n\nComprehensive explanation of BPE, WordPiece, and Unigram algorithms.\n\n## Byte-Pair "
  },
  {
    "path": "02-tokenization/huggingface-tokenizers/references/integration.md",
    "chars": 15329,
    "preview": "# Transformers Integration\n\nComplete guide to using HuggingFace Tokenizers with the Transformers library.\n\n## AutoTokeni"
  },
  {
    "path": "02-tokenization/huggingface-tokenizers/references/pipeline.md",
    "chars": 16624,
    "preview": "# Tokenization Pipeline Components\n\nComplete guide to normalizers, pre-tokenizers, models, post-processors, and decoders"
  },
  {
    "path": "02-tokenization/huggingface-tokenizers/references/training.md",
    "chars": 14545,
    "preview": "# Training Custom Tokenizers\n\nComplete guide to training tokenizers from scratch.\n\n## Training workflow\n\n### Step 1: Cho"
  },
  {
    "path": "02-tokenization/sentencepiece/SKILL.md",
    "chars": 5609,
    "preview": "---\nname: sentencepiece\ndescription: Language-independent tokenizer treating text as raw Unicode. Supports BPE and Unigr"
  },
  {
    "path": "02-tokenization/sentencepiece/references/algorithms.md",
    "chars": 4182,
    "preview": "# Tokenization Algorithms\n\nBPE vs Unigram comparison and subword regularization.\n\n## BPE (Byte-Pair Encoding)\n\n### Algor"
  },
  {
    "path": "02-tokenization/sentencepiece/references/training.md",
    "chars": 6239,
    "preview": "# SentencePiece Training Guide\n\nComplete guide to training SentencePiece models.\n\n## Training workflow\n\n### Step 1: Prep"
  },
  {
    "path": "03-fine-tuning/axolotl/SKILL.md",
    "chars": 4787,
    "preview": "---\nname: axolotl\ndescription: Expert guidance for fine-tuning LLMs with Axolotl - YAML configs, 100+ models, LoRA/QLoRA"
  },
  {
    "path": "03-fine-tuning/axolotl/references/api.md",
    "chars": 120859,
    "preview": "# Axolotl - Api\n\n**Pages:** 150\n\n---\n\n## cli.cloud.modal_\n\n**URL:** https://docs.axolotl.ai/docs/api/cli.cloud.modal_.ht"
  },
  {
    "path": "03-fine-tuning/axolotl/references/dataset-formats.md",
    "chars": 45917,
    "preview": "# Axolotl - Dataset-Formats\n\n**Pages:** 9\n\n---\n\n## Custom Pre-Tokenized Dataset\n\n**URL:** https://docs.axolotl.ai/docs/d"
  },
  {
    "path": "03-fine-tuning/axolotl/references/index.md",
    "chars": 199,
    "preview": "# Axolotl Documentation Index\n\n## Categories\n\n### Api\n**File:** `api.md`\n**Pages:** 150\n\n### Dataset-Formats\n**File:** `"
  },
  {
    "path": "03-fine-tuning/axolotl/references/other.md",
    "chars": 140220,
    "preview": "# Axolotl - Other\n\n**Pages:** 26\n\n---\n\n## Mixed Precision Training\n\n**URL:** https://docs.axolotl.ai/docs/mixed_precisio"
  },
  {
    "path": "03-fine-tuning/llama-factory/SKILL.md",
    "chars": 2472,
    "preview": "---\nname: llama-factory\ndescription: Expert guidance for fine-tuning LLMs with LLaMA-Factory - WebUI no-code, 100+ model"
  },
  {
    "path": "03-fine-tuning/llama-factory/references/_images.md",
    "chars": 307,
    "preview": "# Llama-Factory -  Images\n\n**Pages:** 3\n\n---\n\n## \n\n**URL:** https://llamafactory.readthedocs.io/en/latest/_images/logo.p"
  },
  {
    "path": "03-fine-tuning/llama-factory/references/advanced.md",
    "chars": 27131,
    "preview": "# Llama-Factory - Advanced\n\n**Pages:** 14\n\n---\n\n## GPT-OSS¶\n\n**URL:** https://llamafactory.readthedocs.io/en/latest/adva"
  },
  {
    "path": "03-fine-tuning/llama-factory/references/getting_started.md",
    "chars": 8639,
    "preview": "# Llama-Factory - Getting Started\n\n**Pages:** 7\n\n---\n\n## Installation¶\n\n**URL:** https://llamafactory.readthedocs.io/en/"
  },
  {
    "path": "03-fine-tuning/llama-factory/references/index.md",
    "chars": 262,
    "preview": "# Llama-Factory Documentation Index\n\n## Categories\n\n###  Images\n**File:** `_images.md`\n**Pages:** 3\n\n### Advanced\n**File"
  },
  {
    "path": "03-fine-tuning/llama-factory/references/other.md",
    "chars": 1163,
    "preview": "# Llama-Factory - Other\n\n**Pages:** 1\n\n---\n\n## Welcome to LLaMA Factory!¶\n\n**URL:** https://llamafactory.readthedocs.io/"
  },
  {
    "path": "03-fine-tuning/peft/SKILL.md",
    "chars": 12210,
    "preview": "---\nname: peft-fine-tuning\ndescription: Parameter-efficient fine-tuning for LLMs using LoRA, QLoRA, and 25+ methods. Use"
  },
  {
    "path": "03-fine-tuning/peft/references/advanced-usage.md",
    "chars": 12541,
    "preview": "# PEFT Advanced Usage Guide\n\n## Advanced LoRA Variants\n\n### DoRA (Weight-Decomposed Low-Rank Adaptation)\n\nDoRA decompose"
  },
  {
    "path": "03-fine-tuning/peft/references/troubleshooting.md",
    "chars": 10344,
    "preview": "# PEFT Troubleshooting Guide\n\n## Installation Issues\n\n### bitsandbytes CUDA Error\n\n**Error**: `CUDA Setup failed despite"
  },
  {
    "path": "03-fine-tuning/unsloth/SKILL.md",
    "chars": 2307,
    "preview": "---\nname: unsloth\ndescription: Expert guidance for fast fine-tuning with Unsloth - 2-5x faster training, 50-80% less mem"
  },
  {
    "path": "03-fine-tuning/unsloth/references/index.md",
    "chars": 98,
    "preview": "# Unsloth Documentation Index\n\n## Categories\n\n### Llms-Txt\n**File:** `llms-txt.md`\n**Pages:** 136\n"
  },
  {
    "path": "03-fine-tuning/unsloth/references/llms-full.md",
    "chars": 1074743,
    "preview": "# Unsloth Docs\n\nTrain your own model with Unsloth, an open-source framework for LLM fine-tuning and reinforcement learni"
  },
  {
    "path": "03-fine-tuning/unsloth/references/llms-txt.md",
    "chars": 810855,
    "preview": "# Unsloth - Llms-Txt\n\n**Pages:** 136\n\n---\n\n## !pip install huggingface_hub hf_transfer\n\n**URL:** llms-txt#!pip-install-h"
  },
  {
    "path": "03-fine-tuning/unsloth/references/llms.md",
    "chars": 12507,
    "preview": "# Unsloth Documentation\n\n## Unsloth Documentation\n\n- [Unsloth Docs](/get-started/unsloth-docs.md): Train your own model "
  },
  {
    "path": "04-mechanistic-interpretability/nnsight/SKILL.md",
    "chars": 13056,
    "preview": "---\nname: nnsight-remote-interpretability\ndescription: Provides guidance for interpreting and manipulating neural networ"
  },
  {
    "path": "04-mechanistic-interpretability/nnsight/references/README.md",
    "chars": 1913,
    "preview": "# nnsight Reference Documentation\n\nThis directory contains comprehensive reference materials for nnsight.\n\n## Contents\n\n"
  },
  {
    "path": "04-mechanistic-interpretability/nnsight/references/api.md",
    "chars": 7160,
    "preview": "# nnsight API Reference\n\n## LanguageModel\n\nMain class for wrapping language models with intervention capabilities.\n\n### "
  },
  {
    "path": "04-mechanistic-interpretability/nnsight/references/tutorials.md",
    "chars": 7970,
    "preview": "# nnsight Tutorials\n\n## Tutorial 1: Basic Activation Analysis\n\n### Goal\nLoad a model, access internal activations, and a"
  },
  {
    "path": "04-mechanistic-interpretability/pyvene/SKILL.md",
    "chars": 14136,
    "preview": "---\nname: pyvene-interventions\ndescription: Provides guidance for performing causal interventions on PyTorch models usin"
  },
  {
    "path": "04-mechanistic-interpretability/pyvene/references/README.md",
    "chars": 2105,
    "preview": "# pyvene Reference Documentation\n\nThis directory contains comprehensive reference materials for pyvene.\n\n## Contents\n\n- "
  },
  {
    "path": "04-mechanistic-interpretability/pyvene/references/api.md",
    "chars": 7871,
    "preview": "# pyvene API Reference\n\n## IntervenableModel\n\nThe core class that wraps PyTorch models for intervention.\n\n### Basic Usag"
  },
  {
    "path": "04-mechanistic-interpretability/pyvene/references/tutorials.md",
    "chars": 10111,
    "preview": "# pyvene Tutorials\n\n## Tutorial 1: Basic Activation Patching\n\n### Goal\nSwap activations between two prompts to test caus"
  },
  {
    "path": "04-mechanistic-interpretability/saelens/SKILL.md",
    "chars": 12729,
    "preview": "---\nname: sparse-autoencoder-training\ndescription: Provides guidance for training and analyzing Sparse Autoencoders (SAE"
  },
  {
    "path": "04-mechanistic-interpretability/saelens/references/README.md",
    "chars": 2152,
    "preview": "# SAELens Reference Documentation\n\nThis directory contains comprehensive reference materials for SAELens.\n\n## Contents\n\n"
  },
  {
    "path": "04-mechanistic-interpretability/saelens/references/api.md",
    "chars": 6967,
    "preview": "# SAELens API Reference\n\n## SAE Class\n\nThe core class representing a Sparse Autoencoder.\n\n### Loading Pre-trained SAEs\n\n"
  },
  {
    "path": "04-mechanistic-interpretability/saelens/references/tutorials.md",
    "chars": 9379,
    "preview": "# SAELens Tutorials\n\n## Tutorial 1: Loading and Analyzing Pre-trained SAEs\n\n### Goal\nLoad a pre-trained SAE and analyze "
  },
  {
    "path": "04-mechanistic-interpretability/transformer-lens/SKILL.md",
    "chars": 12026,
    "preview": "---\nname: transformer-lens-interpretability\ndescription: Provides guidance for mechanistic interpretability research usi"
  },
  {
    "path": "04-mechanistic-interpretability/transformer-lens/references/README.md",
    "chars": 1643,
    "preview": "# TransformerLens Reference Documentation\n\nThis directory contains comprehensive reference materials for TransformerLens"
  },
  {
    "path": "04-mechanistic-interpretability/transformer-lens/references/api.md",
    "chars": 8349,
    "preview": "# TransformerLens API Reference\n\n## HookedTransformer\n\nThe core class for mechanistic interpretability, wrapping transfo"
  },
  {
    "path": "04-mechanistic-interpretability/transformer-lens/references/tutorials.md",
    "chars": 10140,
    "preview": "# TransformerLens Tutorials\n\n## Tutorial 1: Basic Activation Analysis\n\n### Goal\nUnderstand how to load models, cache act"
  },
  {
    "path": "05-data-processing/.gitkeep",
    "chars": 169,
    "preview": "# Skills Coming Soon\n\nThis directory will contain high-quality AI research skills for data processing.\n\nSee [CONTRIBUTIN"
  },
  {
    "path": "05-data-processing/nemo-curator/SKILL.md",
    "chars": 9319,
    "preview": "---\nname: nemo-curator\ndescription: GPU-accelerated data curation for LLM training. Supports text/image/video/audio. Fea"
  },
  {
    "path": "05-data-processing/nemo-curator/references/deduplication.md",
    "chars": 2137,
    "preview": "# Deduplication Guide\n\nComplete guide to exact, fuzzy, and semantic deduplication.\n\n## Exact deduplication\n\nRemove docum"
  },
  {
    "path": "05-data-processing/nemo-curator/references/filtering.md",
    "chars": 2349,
    "preview": "# Quality Filtering Guide\n\nComplete guide to NeMo Curator's 30+ quality filters.\n\n## Text-based filters\n\n### Word count\n"
  },
  {
    "path": "05-data-processing/ray-data/SKILL.md",
    "chars": 7302,
    "preview": "---\nname: ray-data\ndescription: Scalable data processing for ML workloads. Streaming execution across CPU/GPU, supports "
  },
  {
    "path": "05-data-processing/ray-data/references/integration.md",
    "chars": 1851,
    "preview": "# Ray Data Integration Guide\n\nIntegration with Ray Train and ML frameworks.\n\n## Ray Train integration\n\n### Basic trainin"
  },
  {
    "path": "05-data-processing/ray-data/references/transformations.md",
    "chars": 1660,
    "preview": "# Ray Data Transformations\n\nComplete guide to data transformations in Ray Data.\n\n## Core operations\n\n### Map batches (ve"
  },
  {
    "path": "06-post-training/grpo-rl-training/README.md",
    "chars": 3437,
    "preview": "# GRPO/RL Training Skill\n\n**Expert-level guidance for Group Relative Policy Optimization with TRL**\n\n## 📁 Skill Structur"
  },
  {
    "path": "06-post-training/grpo-rl-training/SKILL.md",
    "chars": 17184,
    "preview": "---\nname: grpo-rl-training\ndescription: Expert guidance for GRPO/RL fine-tuning with TRL for reasoning and task-specific"
  },
  {
    "path": "06-post-training/grpo-rl-training/examples/reward_functions_library.py",
    "chars": 11568,
    "preview": "\"\"\"\nGRPO Reward Functions Library\n===============================\n\nA collection of battle-tested reward functions for co"
  },
  {
    "path": "06-post-training/grpo-rl-training/templates/basic_grpo_training.py",
    "chars": 6122,
    "preview": "\"\"\"\nBasic GRPO Training Template\n=============================\n\nA minimal, production-ready template for GRPO training w"
  },
  {
    "path": "06-post-training/miles/SKILL.md",
    "chars": 8894,
    "preview": "---\nname: miles-rl-training\ndescription: Provides guidance for enterprise-grade RL training using miles, a production-re"
  },
  {
    "path": "06-post-training/miles/references/api-reference.md",
    "chars": 4137,
    "preview": "# miles API Reference\n\n## Overview\n\nmiles is an enterprise-grade RL framework built on slime, adding advanced features f"
  },
  {
    "path": "06-post-training/miles/references/troubleshooting.md",
    "chars": 5814,
    "preview": "# miles Troubleshooting Guide\n\n## FP8 Training Issues\n\n### Issue: FP8 Training Collapse\n\n**Symptoms**: Loss explodes, Na"
  },
  {
    "path": "06-post-training/openrlhf/SKILL.md",
    "chars": 8369,
    "preview": "---\nname: openrlhf-training\ndescription: High-performance RLHF framework with Ray+vLLM acceleration. Use for PPO, GRPO, "
  },
  {
    "path": "06-post-training/openrlhf/references/algorithm-comparison.md",
    "chars": 9651,
    "preview": "# Algorithm Comparison\n\nComplete guide to RL algorithms in OpenRLHF: PPO, REINFORCE++, GRPO, RLOO, and their variants.\n\n"
  },
  {
    "path": "06-post-training/openrlhf/references/custom-rewards.md",
    "chars": 15865,
    "preview": "# Custom Reward Functions\n\nComplete guide to implementing custom reward functions and agent RLHF in OpenRLHF.\n\n## Overvi"
  },
  {
    "path": "06-post-training/openrlhf/references/hybrid-engine.md",
    "chars": 7235,
    "preview": "# Hybrid Engine Architecture\n\nComplete guide to OpenRLHF's GPU resource sharing system for maximizing utilization during"
  },
  {
    "path": "06-post-training/openrlhf/references/multi-node-training.md",
    "chars": 11007,
    "preview": "# Multi-Node Training\n\nComplete guide to distributed Ray cluster training with OpenRLHF across multiple machines.\n\n## Ov"
  },
  {
    "path": "06-post-training/simpo/SKILL.md",
    "chars": 5913,
    "preview": "---\nname: simpo-training\ndescription: Simple Preference Optimization for LLM alignment. Reference-free alternative to DP"
  },
  {
    "path": "06-post-training/simpo/references/datasets.md",
    "chars": 10865,
    "preview": "# Datasets\n\nComplete guide to preference datasets for SimPO training.\n\n## Dataset Format\n\n### Required Fields\n\nPreferenc"
  },
  {
    "path": "06-post-training/simpo/references/hyperparameters.md",
    "chars": 8496,
    "preview": "# Hyperparameters\n\nComplete guide to SimPO hyperparameter selection and tuning.\n\n## Overview\n\nKey hyperparameters in Sim"
  },
  {
    "path": "06-post-training/simpo/references/loss-functions.md",
    "chars": 6997,
    "preview": "# Loss Functions\n\nComplete guide to SimPO loss functions and mathematical formulations.\n\n## Overview\n\nSimPO supports two"
  },
  {
    "path": "06-post-training/slime/SKILL.md",
    "chars": 11070,
    "preview": "---\nname: slime-rl-training\ndescription: Provides guidance for LLM post-training with RL using slime, a Megatron+SGLang "
  },
  {
    "path": "06-post-training/slime/references/api-reference.md",
    "chars": 11387,
    "preview": "# slime API Reference\n\n## Architecture Overview\n\nslime operates with a three-module architecture orchestrated by Ray:\n\n`"
  },
  {
    "path": "06-post-training/slime/references/troubleshooting.md",
    "chars": 7217,
    "preview": "# slime Troubleshooting Guide\n\n## Common Issues and Solutions\n\n### SGLang Issues\n\n#### Issue: SGLang Engine Crash\n\n**Sym"
  },
  {
    "path": "06-post-training/torchforge/SKILL.md",
    "chars": 9954,
    "preview": "---\nname: torchforge-rl-training\ndescription: Provides guidance for PyTorch-native agentic RL using torchforge, Meta's l"
  },
  {
    "path": "06-post-training/torchforge/references/api-reference.md",
    "chars": 7877,
    "preview": "# torchforge API Reference\n\n## Architecture Overview\n\ntorchforge implements a fully asynchronous RL system built on:\n\n- "
  },
  {
    "path": "06-post-training/torchforge/references/troubleshooting.md",
    "chars": 6709,
    "preview": "# torchforge Troubleshooting Guide\n\n## GPU Resource Issues\n\n### Issue: Not Enough GPUs\n\n**Symptoms**: \"Insufficient GPU "
  },
  {
    "path": "06-post-training/trl-fine-tuning/SKILL.md",
    "chars": 11447,
    "preview": "---\nname: fine-tuning-with-trl\ndescription: Fine-tune LLMs using reinforcement learning with TRL - SFT for instruction t"
  },
  {
    "path": "06-post-training/trl-fine-tuning/references/dpo-variants.md",
    "chars": 4288,
    "preview": "# DPO Variants\n\nComplete guide to Direct Preference Optimization loss variants in TRL.\n\n## Overview\n\nDPO optimizes model"
  },
  {
    "path": "06-post-training/trl-fine-tuning/references/online-rl.md",
    "chars": 1971,
    "preview": "# Online RL Methods\n\nGuide to online reinforcement learning with PPO, GRPO, RLOO, and OnlineDPO.\n\n## Overview\n\nOnline RL"
  },
  {
    "path": "06-post-training/trl-fine-tuning/references/reward-modeling.md",
    "chars": 2597,
    "preview": "# Reward Modeling\n\nGuide to training reward models with TRL for RLHF pipelines.\n\n## Overview\n\nReward models score comple"
  },
  {
    "path": "06-post-training/trl-fine-tuning/references/sft-training.md",
    "chars": 3236,
    "preview": "# SFT Training Guide\n\nComplete guide to Supervised Fine-Tuning (SFT) with TRL for instruction tuning and task-specific f"
  },
  {
    "path": "06-post-training/verl/SKILL.md",
    "chars": 9789,
    "preview": "---\nname: verl-rl-training\ndescription: Provides guidance for training LLMs with reinforcement learning using verl (Volc"
  },
  {
    "path": "06-post-training/verl/references/api-reference.md",
    "chars": 6941,
    "preview": "# verl API Reference\n\n## Core Classes\n\n### RayPPOTrainer\n\nThe central controller for the training loop. Manages resource"
  },
  {
    "path": "06-post-training/verl/references/troubleshooting.md",
    "chars": 6792,
    "preview": "# verl Troubleshooting Guide\n\n## Common Issues and Solutions\n\n### OOM (Out of Memory) Issues\n\n#### Issue: OOM During Rol"
  },
  {
    "path": "07-safety-alignment/.gitkeep",
    "chars": 170,
    "preview": "# Skills Coming Soon\n\nThis directory will contain high-quality AI research skills for safety alignment.\n\nSee [CONTRIBUTI"
  },
  {
    "path": "07-safety-alignment/constitutional-ai/SKILL.md",
    "chars": 8175,
    "preview": "---\nname: constitutional-ai\ndescription: Anthropic's method for training harmless AI through self-improvement. Two-phase"
  },
  {
    "path": "07-safety-alignment/llamaguard/SKILL.md",
    "chars": 9197,
    "preview": "---\nname: llamaguard\ndescription: Meta's 7-8B specialized moderation model for LLM input/output filtering. 6 safety cate"
  },
  {
    "path": "07-safety-alignment/nemo-guardrails/SKILL.md",
    "chars": 7640,
    "preview": "---\nname: nemo-guardrails\ndescription: NVIDIA's runtime safety framework for LLM applications. Features jailbreak detect"
  },
  {
    "path": "07-safety-alignment/prompt-guard/SKILL.md",
    "chars": 9441,
    "preview": "---\nname: prompt-guard\ndescription: Meta's 86M prompt injection and jailbreak detector. Filters malicious prompts and th"
  },
  {
    "path": "08-distributed-training/accelerate/SKILL.md",
    "chars": 8334,
    "preview": "---\nname: huggingface-accelerate\ndescription: Simplest distributed training API. 4 lines to add distributed support to a"
  },
  {
    "path": "08-distributed-training/accelerate/references/custom-plugins.md",
    "chars": 11781,
    "preview": "# Custom Plugins for Accelerate\n\n## Overview\n\nAccelerate allows creating **custom plugins** to extend distributed traini"
  },
  {
    "path": "08-distributed-training/accelerate/references/megatron-integration.md",
    "chars": 11235,
    "preview": "# Megatron Integration with Accelerate\n\n## Overview\n\nAccelerate supports Megatron-LM for massive model training with ten"
  },
  {
    "path": "08-distributed-training/accelerate/references/performance.md",
    "chars": 12558,
    "preview": "# Accelerate Performance Tuning\n\n## Profiling\n\n### Basic Profiling\n\n```python\nfrom accelerate import Accelerator\nimport "
  },
  {
    "path": "08-distributed-training/deepspeed/SKILL.md",
    "chars": 144563,
    "preview": "---\nname: deepspeed\ndescription: Expert guidance for distributed training with DeepSpeed - ZeRO optimization stages, pip"
  },
  {
    "path": "08-distributed-training/deepspeed/references/08.md",
    "chars": 304,
    "preview": "# Deepspeed - 08\n\n**Pages:** 1\n\n---\n\n## DeepSpeed powers 8x larger MoE model training with high performance\n\n**URL:** ht"
  },
  {
    "path": "08-distributed-training/deepspeed/references/09.md",
    "chars": 27495,
    "preview": "# Deepspeed - 09\n\n**Pages:** 2\n\n---\n\n## DeepSpeed-MoE for NLG: Reducing the training cost of language models by 5 times\n"
  },
  {
    "path": "08-distributed-training/deepspeed/references/2020.md",
    "chars": 35764,
    "preview": "# Deepspeed - 2020\n\n**Pages:** 16\n\n---\n\n## DeepSpeed Microsoft Research Webinar is now on-demand\n\n**URL:** https://www.d"
  },
  {
    "path": "08-distributed-training/deepspeed/references/2023.md",
    "chars": 10357,
    "preview": "# Deepspeed - 2023\n\n**Pages:** 21\n\n---\n\n## DeepSpeed-VisualChat: Improve Your Chat Experience with Multi-Round Multi-Ima"
  },
  {
    "path": "08-distributed-training/deepspeed/references/assets.md",
    "chars": 2491,
    "preview": "# Deepspeed - Assets\n\n**Pages:** 29\n\n---\n\n## \n\n**URL:** https://www.deepspeed.ai/assets/images/zero1_dp8_1.5B_log.png\n\n-"
  },
  {
    "path": "08-distributed-training/deepspeed/references/index.md",
    "chars": 402,
    "preview": "# Deepspeed Documentation Index\n\n## Categories\n\n### 08\n**File:** `08.md`\n**Pages:** 1\n\n### 09\n**File:** `09.md`\n**Pages:"
  },
  {
    "path": "08-distributed-training/deepspeed/references/mii.md",
    "chars": 11030,
    "preview": "# Deepspeed - Mii\n\n**Pages:** 1\n\n---\n\n## DeepSpeed-MII: instant speedup on 24,000+ open-source DL models with up to 40x "
  },
  {
    "path": "08-distributed-training/deepspeed/references/other.md",
    "chars": 99401,
    "preview": "# Deepspeed - Other\n\n**Pages:** 15\n\n---\n\n## Training Overview and Features\n\n**URL:** https://www.deepspeed.ai/training/\n"
  },
  {
    "path": "08-distributed-training/deepspeed/references/tutorials.md",
    "chars": 453776,
    "preview": "# Deepspeed - Tutorials\n\n**Pages:** 59\n\n---\n\n## DeepNVMe\n\n**URL:** https://www.deepspeed.ai/tutorials/deepnvme/\n\n**Conte"
  },
  {
    "path": "08-distributed-training/megatron-core/SKILL.md",
    "chars": 9699,
    "preview": "---\nname: training-llms-megatron\ndescription: Trains large language models (2B-462B parameters) using NVIDIA Megatron-Co"
  },
  {
    "path": "08-distributed-training/megatron-core/references/benchmarks.md",
    "chars": 7345,
    "preview": "# Performance Benchmarks\n\nPerformance metrics and benchmarks for Megatron-Core across different model sizes and hardware"
  },
  {
    "path": "08-distributed-training/megatron-core/references/parallelism-guide.md",
    "chars": 9568,
    "preview": "# Parallelism Strategies Guide\n\nComplete guide to parallelism in Megatron-Core: when to use each strategy, performance c"
  },
  {
    "path": "08-distributed-training/megatron-core/references/production-examples.md",
    "chars": 10887,
    "preview": "# Production Examples\n\nReal-world production deployments using Megatron-Core and Megatron-LM.\n\n## Meta LLaMA 3\n\n### Over"
  },
  {
    "path": "08-distributed-training/megatron-core/references/training-recipes.md",
    "chars": 11584,
    "preview": "# Training Recipes\n\nComplete hyperparameter configurations and training recipes for GPT, LLaMA, and Mixtral models.\n\n## "
  },
  {
    "path": "08-distributed-training/pytorch-fsdp2/SKILL.md",
    "chars": 10947,
    "preview": "---\nname: pytorch-fsdp2\ndescription: Adds PyTorch FSDP2 (fully_shard) to training scripts with correct init, sharding, m"
  },
  {
    "path": "08-distributed-training/pytorch-fsdp2/references/pytorch_dcp_async_recipe.md",
    "chars": 798,
    "preview": "# Reference: Asynchronous Saving with Distributed Checkpoint (DCP) recipe\n\n**Source (official):** PyTorch Tutorials reci"
  },
  {
    "path": "08-distributed-training/pytorch-fsdp2/references/pytorch_dcp_overview.md",
    "chars": 1006,
    "preview": "# Reference: Distributed Checkpoint (DCP) overview (torch.distributed.checkpoint)\n\n**Source (official):** PyTorch docs —"
  },
  {
    "path": "08-distributed-training/pytorch-fsdp2/references/pytorch_dcp_recipe.md",
    "chars": 1031,
    "preview": "# Reference: Getting Started with Distributed Checkpoint (DCP) recipe\n\n**Source (official):** PyTorch Tutorials recipe —"
  },
  {
    "path": "08-distributed-training/pytorch-fsdp2/references/pytorch_ddp_notes.md",
    "chars": 455,
    "preview": "# Reference: Distributed Data Parallel (DDP) notes\n\n**Source (official):** PyTorch docs — “Distributed Data Parallel”  \n"
  },
  {
    "path": "08-distributed-training/pytorch-fsdp2/references/pytorch_device_mesh_tutorial.md",
    "chars": 1212,
    "preview": "# Reference: Getting Started with DeviceMesh (PyTorch tutorial)\n\n**Source (official):** PyTorch Recipes — “Getting Start"
  },
  {
    "path": "08-distributed-training/pytorch-fsdp2/references/pytorch_examples_fsdp2.md",
    "chars": 742,
    "preview": "# Reference: Official `pytorch/examples` FSDP2 scripts\n\n**Sources (official, code):**\n- `pytorch/examples` repository: h"
  },
  {
    "path": "08-distributed-training/pytorch-fsdp2/references/pytorch_fsdp1_api.md",
    "chars": 390,
    "preview": "# Reference: Fully Sharded Data Parallel (FSDP1) API\n\n**Source (official):** PyTorch docs — “Fully Sharded Data Parallel"
  },
  {
    "path": "08-distributed-training/pytorch-fsdp2/references/pytorch_fsdp2_tutorial.md",
    "chars": 2477,
    "preview": "# Reference: Getting Started with Fully Sharded Data Parallel (FSDP2) tutorial\n\n**Source (official):** PyTorch Tutorials"
  },
  {
    "path": "08-distributed-training/pytorch-fsdp2/references/pytorch_fully_shard_api.md",
    "chars": 2862,
    "preview": "# Reference: `torch.distributed.fsdp.fully_shard` API (FSDP2)\n\n**Source (official):** PyTorch docs — `torch.distributed."
  },
  {
    "path": "08-distributed-training/pytorch-fsdp2/references/pytorch_tp_tutorial.md",
    "chars": 984,
    "preview": "# Reference: Tensor Parallel (TP) tutorial (and how it composes with FSDP)\n\n**Source (official):** PyTorch Tutorials — “"
  },
  {
    "path": "08-distributed-training/pytorch-fsdp2/references/ray_train_fsdp2_example.md",
    "chars": 586,
    "preview": "# Reference: Ray Train FSDP2 integration guide (third-party, useful patterns)\n\n**Source (third-party):** Ray docs — “Get"
  },
  {
    "path": "08-distributed-training/pytorch-fsdp2/references/torchtitan_fsdp_notes.md",
    "chars": 660,
    "preview": "# Reference: TorchTitan notes on FSDP/FSDP2 (production-oriented)\n\n**Source (official-ish, PyTorch org):** TorchTitan — "
  },
  {
    "path": "08-distributed-training/pytorch-lightning/SKILL.md",
    "chars": 9116,
    "preview": "---\nname: pytorch-lightning\ndescription: High-level PyTorch framework with Trainer class, automatic distributed training"
  },
  {
    "path": "08-distributed-training/pytorch-lightning/references/callbacks.md",
    "chars": 12031,
    "preview": "# PyTorch Lightning Callbacks\n\n## Overview\n\nCallbacks add functionality to training without modifying the LightningModul"
  },
  {
    "path": "08-distributed-training/pytorch-lightning/references/distributed.md",
    "chars": 10811,
    "preview": "# PyTorch Lightning Distributed Training\n\n## Distributed Strategies\n\nLightning supports multiple distributed strategies "
  },
  {
    "path": "08-distributed-training/pytorch-lightning/references/hyperparameter-tuning.md",
    "chars": 12515,
    "preview": "# Hyperparameter Tuning with PyTorch Lightning\n\n## Integration with Tuning Frameworks\n\nLightning integrates seamlessly w"
  },
  {
    "path": "08-distributed-training/ray-train/SKILL.md",
    "chars": 10692,
    "preview": "---\nname: ray-train\ndescription: Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFa"
  },
  {
    "path": "08-distributed-training/ray-train/references/multi-node.md",
    "chars": 13523,
    "preview": "# Ray Train Multi-Node Setup\n\n## Ray Cluster Architecture\n\nRay Train runs on a **Ray cluster** with one head node and mu"
  },
  {
    "path": "09-infrastructure/.gitkeep",
    "chars": 168,
    "preview": "# Skills Coming Soon\n\nThis directory will contain high-quality AI research skills for infrastructure.\n\nSee [CONTRIBUTING"
  },
  {
    "path": "09-infrastructure/lambda-labs/SKILL.md",
    "chars": 12123,
    "preview": "---\nname: lambda-labs-gpu-cloud\ndescription: Reserved and on-demand GPU cloud instances for ML training and inference. U"
  },
  {
    "path": "09-infrastructure/lambda-labs/references/advanced-usage.md",
    "chars": 15016,
    "preview": "# Lambda Labs Advanced Usage Guide\n\n## Multi-Node Distributed Training\n\n### PyTorch DDP across nodes\n\n```python\n# train_"
  },
  {
    "path": "09-infrastructure/lambda-labs/references/troubleshooting.md",
    "chars": 11679,
    "preview": "# Lambda Labs Troubleshooting Guide\n\n## Instance Launch Issues\n\n### No instances available\n\n**Error**: \"No capacity avai"
  },
  {
    "path": "09-infrastructure/modal/SKILL.md",
    "chars": 8553,
    "preview": "---\nname: modal-serverless-gpu\ndescription: Serverless GPU cloud platform for running ML workloads. Use when you need on"
  },
  {
    "path": "09-infrastructure/modal/references/advanced-usage.md",
    "chars": 10903,
    "preview": "# Modal Advanced Usage Guide\n\n## Multi-GPU Training\n\n### Single-node multi-GPU\n\n```python\nimport modal\n\napp = modal.App("
  },
  {
    "path": "09-infrastructure/modal/references/troubleshooting.md",
    "chars": 10516,
    "preview": "# Modal Troubleshooting Guide\n\n## Installation Issues\n\n### Authentication fails\n\n**Error**: `modal setup` doesn't comple"
  },
  {
    "path": "09-infrastructure/skypilot/SKILL.md",
    "chars": 9640,
    "preview": "---\nname: skypilot-multi-cloud-orchestration\ndescription: Multi-cloud orchestration for ML workloads with automatic cost"
  },
  {
    "path": "09-infrastructure/skypilot/references/advanced-usage.md",
    "chars": 7469,
    "preview": "# SkyPilot Advanced Usage Guide\n\n## Multi-Cloud Strategies\n\n### Cloud selection patterns\n\n```yaml\n# Prefer specific clou"
  },
  {
    "path": "09-infrastructure/skypilot/references/troubleshooting.md",
    "chars": 10493,
    "preview": "# SkyPilot Troubleshooting Guide\n\n## Installation Issues\n\n### Cloud credentials not found\n\n**Error**: `sky check` shows "
  },
  {
    "path": "10-optimization/.gitkeep",
    "chars": 166,
    "preview": "# Skills Coming Soon\n\nThis directory will contain high-quality AI research skills for optimization.\n\nSee [CONTRIBUTING.m"
  },
  {
    "path": "10-optimization/awq/SKILL.md",
    "chars": 8383,
    "preview": "---\nname: awq-quantization\ndescription: Activation-aware weight quantization for 4-bit LLM compression with 3x speedup a"
  },
  {
    "path": "10-optimization/awq/references/advanced-usage.md",
    "chars": 7983,
    "preview": "# AWQ Advanced Usage Guide\n\n## Quantization Algorithm Details\n\n### How AWQ Works\n\nAWQ (Activation-aware Weight Quantizat"
  },
  {
    "path": "10-optimization/awq/references/troubleshooting.md",
    "chars": 7733,
    "preview": "# AWQ Troubleshooting Guide\n\n## Installation Issues\n\n### CUDA Version Mismatch\n\n**Error**: `RuntimeError: CUDA error: no"
  },
  {
    "path": "10-optimization/bitsandbytes/SKILL.md",
    "chars": 10112,
    "preview": "---\nname: quantizing-models-bitsandbytes\ndescription: Quantizes LLMs to 8-bit or 4-bit for 50-75% memory reduction with "
  },
  {
    "path": "10-optimization/bitsandbytes/references/memory-optimization.md",
    "chars": 12608,
    "preview": "# Memory Optimization\n\nComplete guide to CPU offloading, gradient checkpointing, memory profiling, and advanced memory-s"
  },
  {
    "path": "10-optimization/bitsandbytes/references/qlora-training.md",
    "chars": 12003,
    "preview": "# QLoRA Training\n\nComplete guide to fine-tuning large language models using 4-bit quantization with QLoRA (Quantized Low"
  },
  {
    "path": "10-optimization/bitsandbytes/references/quantization-formats.md",
    "chars": 10242,
    "preview": "# Quantization Formats\n\nComplete guide to INT8, NF4, FP4 quantization formats, double quantization, and custom configura"
  },
  {
    "path": "10-optimization/flash-attention/SKILL.md",
    "chars": 10189,
    "preview": "---\nname: optimizing-attention-flash\ndescription: Optimizes transformer attention with Flash Attention for 2-4x speedup "
  },
  {
    "path": "10-optimization/flash-attention/references/benchmarks.md",
    "chars": 7079,
    "preview": "# Performance Benchmarks\n\n## Contents\n- Speed comparisons across GPUs\n- Memory usage analysis\n- Scaling with sequence le"
  },
  {
    "path": "10-optimization/flash-attention/references/transformers-integration.md",
    "chars": 7427,
    "preview": "# HuggingFace Transformers Integration\n\n## Contents\n- Enabling Flash Attention in Transformers\n- Supported model archite"
  },
  {
    "path": "10-optimization/gguf/SKILL.md",
    "chars": 10320,
    "preview": "---\nname: gguf-quantization\ndescription: GGUF format and llama.cpp quantization for efficient CPU/GPU inference. Use whe"
  },
  {
    "path": "10-optimization/gguf/references/advanced-usage.md",
    "chars": 10887,
    "preview": "# GGUF Advanced Usage Guide\n\n## Speculative Decoding\n\n### Draft Model Approach\n\n```bash\n# Use smaller model as draft for"
  },
  {
    "path": "10-optimization/gguf/references/troubleshooting.md",
    "chars": 8904,
    "preview": "# GGUF Troubleshooting Guide\n\n## Installation Issues\n\n### Build Fails\n\n**Error**: `make: *** No targets specified and no"
  },
  {
    "path": "10-optimization/gptq/SKILL.md",
    "chars": 11562,
    "preview": "---\nname: gptq\ndescription: Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying larg"
  },
  {
    "path": "10-optimization/gptq/references/calibration.md",
    "chars": 8164,
    "preview": "# GPTQ Calibration Guide\n\nComplete guide to calibration data selection and quantization process.\n\n## Calibration Data Se"
  },
  {
    "path": "10-optimization/gptq/references/integration.md",
    "chars": 2797,
    "preview": "# GPTQ Integration Guide\n\nIntegration with transformers, PEFT, vLLM, and other frameworks.\n\n## Transformers Integration\n"
  },
  {
    "path": "10-optimization/gptq/references/troubleshooting.md",
    "chars": 1899,
    "preview": "# GPTQ Troubleshooting Guide\n\nCommon issues and solutions for GPTQ quantization and inference.\n\n## Installation Issues\n\n"
  },
  {
    "path": "10-optimization/hqq/SKILL.md",
    "chars": 11466,
    "preview": "---\nname: hqq-quantization\ndescription: Half-Quadratic Quantization for LLMs without calibration data. Use when quantizi"
  },
  {
    "path": "10-optimization/hqq/references/advanced-usage.md",
    "chars": 14077,
    "preview": "# HQQ Advanced Usage Guide\n\n## Custom Backend Configuration\n\n### Backend Selection by Hardware\n\n```python\nfrom hqq.core."
  },
  {
    "path": "10-optimization/hqq/references/troubleshooting.md",
    "chars": 11086,
    "preview": "# HQQ Troubleshooting Guide\n\n## Installation Issues\n\n### Package Not Found\n\n**Error**: `ModuleNotFoundError: No module n"
  },
  {
    "path": "10-optimization/ml-training-recipes/SKILL.md",
    "chars": 11407,
    "preview": "---\r\nname: ml-training-recipes\r\ndescription: Battle-tested PyTorch training recipes for all domains — LLMs, vision, diff"
  },
  {
    "path": "10-optimization/ml-training-recipes/references/architecture.md",
    "chars": 10509,
    "preview": "# Architecture Patterns Reference\r\n\r\nDetailed code patterns for modern transformer architectures. Referenced from the ma"
  },
  {
    "path": "10-optimization/ml-training-recipes/references/biomedical.md",
    "chars": 21234,
    "preview": "# Biomedical & Pharmaceutical ML Reference\r\n\r\nModels, architectures, and training patterns specific to biomedical and ph"
  },
  {
    "path": "10-optimization/ml-training-recipes/references/domain-specific.md",
    "chars": 19599,
    "preview": "# Domain-Specific Training Patterns\r\n\r\nPatterns for vision, diffusion, and other non-LLM training scenarios. Referenced "
  },
  {
    "path": "10-optimization/ml-training-recipes/references/experiment-loop.md",
    "chars": 4558,
    "preview": "# Autonomous Experiment Loop (autoresearch pattern)\r\n\r\nA systematic workflow for rapid ML experimentation, drawn from Ka"
  },
  {
    "path": "10-optimization/ml-training-recipes/references/optimizers.md",
    "chars": 10879,
    "preview": "# Optimizer Patterns Reference\r\n\r\nDeep dive into optimizer configurations for modern LLM training. Referenced from the m"
  },
  {
    "path": "10-optimization/ml-training-recipes/references/scaling-and-selection.md",
    "chars": 16864,
    "preview": "# Scaling Laws & Architecture Selection Reference\r\n\r\nDetailed decision frameworks for choosing architectures based on da"
  },
  {
    "path": "11-evaluation/.gitkeep",
    "chars": 164,
    "preview": "# Skills Coming Soon\n\nThis directory will contain high-quality AI research skills for evaluation.\n\nSee [CONTRIBUTING.md]"
  },
  {
    "path": "11-evaluation/bigcode-evaluation-harness/SKILL.md",
    "chars": 11689,
    "preview": "---\nname: evaluating-code-models\ndescription: Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15"
  }
]

// ... and 299 more files (download for full content)

About this extraction

This page contains the full source code of the Orchestra-Research/AI-research-SKILLs GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 499 files (7.4 MB), approximately 2.0M tokens, and a symbol index with 132 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo