PorTAL: Portable Task Adapters for LLMs

@RampLabs
영어1일 전 · 2026년 7월 01일
335K
506
45
17
867

TL;DR

PorTAL is a base-agnostic hypernetwork architecture that enables the transfer of LLM task adaptations across different models, significantly reducing the cost and data required for fine-tuning.

Researcher: Ben Geist

Abstract

Parameter-efficient fine-tuning (e.g. LoRA) adapts a frozen LLM to a task, but the resulting adapter is locked to one base model. When a new model is released, the adaptation must be relearned from scratch. We study portable task adaptation: learning a task adaptation once, in a base-agnostic form, and transferring it to new frozen models by refitting only a small per model component. Concretely, we learn a base-agnostic task latent z_t and a hypernetwork decoder D_b that generates per-layer LoRA adapters for a frozen base. The decoder is made of a base-agnostic shared core and a thin per base converter. To port to a new base, we freeze z_t and the shared core and refit only the converter on a small amount of data.

This architecture, which we name PorTAL, recovers the accuracy lift of per task LoRA both within a model family and, more strikingly, across model families. We illustrate this by freezing a task latent and shared core decoder learned on Qwen3-1.7B and 4B, then refitting only a thin per base converter and recovering ~98% of LoRA's accuracy gain on an unseen Qwen3-8B, and ~94% on Gemma-3-4B. It far outperforms the current portable task adaptation methods: the Cross-LoRA baseline recovers only ~14% of the gain on the unseen Qwen3-8B, versus our 98%. Additionally the refit is data efficient: PorTAL reaches the from scratch LoRA accuracy plateau with roughly half as much calibration data, and at equal accuracy is consistently better calibrated (lower held-out log-loss) than a from scratch LoRA at every data size. This considerably cuts the FLOPs needed to fine-tune subsequent base models.

1. Introduction & Motivation

New language models arrive at an accelerating pace: the number of notable foundation models released per year rose from 2 in 2020 to 9 in 2021, 32 in 2022, and 149 in 2023 [1], and by 2024-2025 the cadence of SOTA releases had compressed so far that the SOTA model only held the top of the public leaderboard for ~35 days on average, down from nearly a year for GPT-4 [2].

Adapting a model to a task, however, is a per model cost that does not amortize across these releases. A fine-tune (full or LoRA) is locked to one base model's weight space; when the next model ships, the adaptation must be redone on the new base. Parameter efficient methods lowered the unit cost (a LoRA on a 7B model runs ~$1-3k vs ~$12k for full fine-tuning [3]) but not its structure: you still pay for data curation + a training run + evaluation once per (task, model), and full fine-tuning cost still scales with ever growing model size [4].

The result is that the cost of maintaining a portfolio of fine-tuned capabilities on the current frontier model roughly scales inversely with time between model releases. Re-tuning per model becomes the dominant, ever growing cost of keeping a system specialized while also gaining the raw intelligence of each newer, smarter base.

Our answer is to pay for task adaptation once and amortize it across every future base. Inspired by the Platonic Representation Hypothesis [5], we learn the adaptation in a base-agnostic form and carry it to each new model by refitting only a light per base map on a handful of examples.

2. Related work

Our contribution combines ideas from three lines of work, which we review here.

Single-base LoRA generation via hypernetworks

Text-to-LoRA [6], in-context SHINE [7], and Profile-to-PEFT [8] amortize per task or per user adaptation into a single forward pass, but target a fixed base and generalize across tasks or users, not across models (Text-to-LoRA explicitly leaves cross-model transfer open).

Cross-architecture LoRA generation

LoRAGen [9] uses a structural embedding (latent + module/layer embeddings) to emit LoRA for different bases, but is trained by reconstructing existing LoRAs; we share its decoder shape but train end-to-end on task loss, and crucially, freeze a shared task latent and a shared core, refitting only a thin per base converter to reach an unseen base.

Cross-model LoRA transfer

Cross-LoRA [10], LoRA-X [11], and CAST [12] target the same goal we do, but by translating one already trained adapter via subspace or activation manifold alignment. We instead learn a base-agnostic latent and recalibrate the converter per base. We find this small calibration step is important. Cross-LoRA, which transfers an existing adapter without refitting, recovers only ~14% of LoRA's lift on the unseen 8B, versus our ~98% (§6.2).

In short, single-base LoRA generation, cross-architecture generation, and cross-model transfer all have prior art. Our contribution combines them into one recipe that learns a shared task latent and core, freezes them, and refits only a thin per base converter to reach a new base. We frame this as a maintenance cost answer to an accelerating model release cadence, and show it dominates the cross-model transfer line empirically.

3. Background: LoRA and LoRA hypernetworks

LoRA [13]. For a frozen weight matrix, LoRA learns a low-rank update built from two small matrices A and B of rank r ; only these two matrices train:

ΔW=αrBA,A∈Rr×din,  B∈Rdout×r,  r≪d,y=Wx+αrB(Ax)\Delta W = \tfrac{\alpha}{r} B A,\qquad A \in \mathbb{R}^{r\times d_{in}},\; B \in \mathbb{R}^{d_{out}\times r},\; r \ll d,\qquad y = Wx + \tfrac{\alpha}{r} B(Ax)

LoRA hypernetworks. Rather than training A and B directly, a hypernetwork generates them from a conditioning input. Text-to-LoRA [6] trains a hypernetwork to emit a full LoRA for a single base model from a task description embedding, end-to-end through the frozen base. This trains one hypernetwork instead of a separate LoRA for every task, but it stays single-base, generalizing across tasks, not across models. Our design borrows the hypernetwork LoRA generation idea but targets a different goal, cross-base transfer of a shared, learned task representation.

4. Method

Design. Our goal is a task adaptation that is learned once and ported cheaply to new frozen models. We split the adapter generator into two parts: a large base-agnostic core decoder, shared across all models, that emits low-rank factors at a fixed core width d_c ; and a thin per base converter that conditions the shared core's inputs and projects its outputs to a specific model's dimensions. We train on one or more frozen bases, then port to an unseen model by refitting only this small per base converter.

This amortizes the learned adaptation into a shared representation and makes each new base cheap to support. By construction, the shared latent and core hold most of the parameters and absorb both the task representation and the bulk of its mapping into adapter space; only a small converter remains model-specific. We define the components below.

Setup. Let a frozen base b have transformer layers = 1, …, L_b with per-layer weight matrices W_ℓ, m at the adapted modules m ∈ {q_proj, v_proj} (we extend m to all attention and MLP projections in the full-module variant). Let θ_b denote the frozen base parameters.

Task latent. Each task t is mapped to a learned task latent z_t, a base-agnostic vector of dimension d_z = 256.

Decoder. Our hypernetwork D_b is composed of a base-agnostic core decoder and a thin per-base converter; it maps the task latent z_t and a per-layer embedding e_ℓ to each module's LoRA factors:

(Aℓ,m, Bℓ,m)=Db(zt,eℓ,m),Aℓ,m∈Rr×dℓin,  Bℓ,m∈Rdℓ,mout×r(A_{\ell,m},\, B_{\ell,m}) = D_b(z_t, e_\ell, m), \qquad A_{\ell,m}\in\mathbb{R}^{r\times d^{in}_\ell},\; B_{\ell,m}\in\mathbb{R}^{d^{out}_{\ell,m}\times r}

Internally, we condition a single shared trunk with FiLM. The trunk takes the per-layer embedding e_ℓ as input, while the task latent z_t scales and shifts its hidden features. This produces a per-layer hidden state:

hℓ=ϕ(W2 [(1+γ(zt))⊙ψ(W1[zt;eℓ])+β(zt)]),h_\ell = \phi\big(W_2\,\big[(1+\gamma(z_t))\odot \psi(W_1[z_t; e_\ell]) + \beta(z_t)\big]\big),

Per-module heads then map this hidden state to core-width factors:

A^ℓ,m=HeadmA(hℓ)∈Rr×dc,B^ℓ,m=HeadmB(hℓ)∈Rdc×r.\hat A_{\ell,m} = \mathrm{Head}^{A}_{m}(h_\ell) \in \mathbb{R}^{r\times d_c}, \qquad \hat B_{\ell,m} = \mathrm{Head}^{B}_{m}(h_\ell) \in \mathbb{R}^{d_c\times r}.

Finally, an aligner projects them to the base's dimensions via per-module linear maps:

Aℓ,m=A^ℓ,m Pbin,Bℓ,m=Pbout B^ℓ,m,A_{\ell,m} = \hat A_{\ell,m}\,P^{in}_b, \qquad B_{\ell,m} = P^{out}_b\,\hat B_{\ell,m},

The generated adapter is injected as a standard LoRA delta:

yℓ,m=Wℓ,m x+αr Bℓ,m (Aℓ,m x).y_{\ell,m} = W_{\ell,m}\,x + \tfrac{\alpha}{r}\, B_{\ell,m}\,(A_{\ell,m}\,x).

Training. We train {z_t} and D_b while keeping the base parameters θ_b frozen. We minimize the gold-continuation NLL (loss only on answer tokens):

min⁡{zt}, Db  ∑t E(x,y)∼Dttrain[−log⁡p θb ⊕ Db(zt)(y∣x)].\min_{\{z_t\},\, D_b}\; \sum_{t}\, \mathbb{E}_{(x,y)\sim \mathcal{D}^{train}_t}\big[-\log p_{\,\theta_b\,\oplus\, D_b(z_t)}(y \mid x)\big].

Multi-task training uses balanced per task steps with EMA loss normalization to keep hard tasks from collapsing to chance.

Ramp Labs - inline image

GIF

Multi-base training. When we train on several bases at once, a small base can dominate the shared latent's gradient. We apply gradient-norm balancing on z_t, rescaling each base's accumulated gradient to equal norm before the optimizer step, so every base contributes equally to the shared representation.

Porting. Given an unseen base b', we freeze the core decoder and {z_t} and refit only the per-base converter {e_ℓ , P_in, P_out } on a small calibration set:

min⁡{eℓ}, Pb′in,Pb′out  ∑tE(x,y)∼Dtport[−log⁡p θb′ ⊕ Db′(zt)(y∣x)].\min_{\{e_\ell\},\, P^{in}_{b'}, P^{out}_{b'}}\; \sum_t \mathbb{E}_{(x,y)\sim \mathcal{D}^{port}_t}\big[-\log p_{\,\theta_{b'}\,\oplus\, D_{b'}(z_t)}(y\mid x)\big].

Ramp Labs - inline image

GIF

5. Experimental setup

Tasks (14, standard multiple-choice). TruthfulQA, RTE, CB, COPA, WiC, WSC (SuperGLUE + TruthfulQA; higher-headroom), and BoolQ, ARC-Easy, ARC-Challenge, HellaSwag, OpenBookQA, WinoGrande, CommonsenseQA, SciQ (broader/bigger-eval).

Metric. Length-normalized log-likelihood over choices (acc_norm); we also report held-out log-loss (token-mean NLL of the gold continuation). §6.1–6.3 use best-epoch held-out selection (per-epoch eval) while §6.4 uses final-epoch eval. All are 3-seed means ± std.

Data. Up to 2,000 examples/task — a hard cap applied to both source training and the per base converter refit. §6.1–6.3 fit on the full 2,000/task; the data-efficiency study (§6.4) shows far fewer suffices. Evaluation sets range from 56 (CB) to 1,000 (BoolQ/WinoGrande/CSQA/SciQ); ~7,200 eval examples total on the 14-task suite.

Models. Seen bases: Qwen3-1.7B, Qwen3-4B. Unseen bases: Qwen3-8B and Gemma-3-4B. Per task LoRA baselines: rank 16 on q/k/v/o + MLP. LoRA Hypernet/PorTAL (§6.1–6.3): rank 8 on q/v.

Experiments reported. (i) LoRA Hypernet vs per task LoRA; (ii) portability to unseen bases within and across families; (iii) data efficiency of the converter refit;

6. Results

6.1 Source base

Method

Avg acc_norm (14 tasks)

Base

0.627

Per task LoRA

0.765 ± 0.003

LoRA Hypernet

(jointly train z4B, D4B)(\text{jointly train } z_{4B},\, D_{4B})

0.757 ± 0.003

We first confirm that a learned task latent z and a decoder, trained jointly on the source base, can match per task LoRAs trained independently on the same base. The generated LoRA Hypernet recovers ~94% of per task LoRA's lift on average and matches or beats it on 6/14 tasks (RTE, CB, COPA, WiC, ARC-Easy, CommonsenseQA).

6.2 Within-family portability

Method (on unseen 8B)

Avg acc_norm

Recovered lift

Base-8B

0.667

Per task 8B LoRA

0.795 ± 0.004

100%

Cross-LoRA transfer

0.685 ± 0.001

~14%

LoRA Hypernet (jointly train z8B, D8B)\text{LoRA Hypernet (jointly train } z_{8B},\, D_{8B})

0.785 ± 0.002

~92%

PorTAL

(frozen z(1.7B+4B), refit D8B)(\text{frozen } z_{(1.7B+4B)},\ \text{refit } D_{8B})

0.792 ± 0.004

~98%

We then test portability directly. We freeze the latent and core decoder, learned jointly on Qwen3-1.7B and 4B, and refit only the thin converter on an unseen base. On an unseen Qwen3-8B this recovers ~98% of the per task LoRA's lift, far above the ~14% recovered by Cross-LoRA, the comparable cross-model transfer method. Interestingly, training the latent and decoder jointly on Qwen3-8B reaches 0.785 (~92%), statistically on par with the ported latent, but slightly lower. We attribute PorTAL’s slightly higher performance to mild regularization across the multiple seen bases.

6.3 Cross-family portability

Unseen target

Base

Per task LoRA

PorTAL

Recovered lift

Gemma-3-4B

0.595

0.778 ± 0.004

0.767 ± 0.004

~94%

We then test cross-family transfer. We freeze the latent and core decoder trained on Qwen3-1.7B and 4B and refit the converter on Gemma-3-4B. This recovers ~94% of the from scratch LoRA's lift. Cross-family transfer is nearly lossless.

6.4 Data efficiency

PorTAL amortizes task adaptation: a latent and core learned once on the seen bases should make every subsequent model cheap to adapt, so porting to a new base needs far less data than training a LoRA from scratch. We show this on the unseen Qwen3-8B, sweeping the per task set size for PorTAL q/v r8, PorTAL full r8, and per task r16-full LoRA. For PorTAL this set is the calibration set it refits the converter on; for the from scratch LoRA it is the training set.

Raw 14-task averages, base-8B acc 0.667 / log-loss 3.819:

Ramp Labs - inline image
Ramp Labs - inline image

In both plots, curves are a rolling average over a window of 3, and stars mark where each method first reaches per task LoRA's peak.

PorTAL is substantially more data efficient. It matches per task LoRA's best accuracy using roughly half the data, and consistently beats it in the high data range. Because the frozen base dominates per step cost, reaching the plateau with half the data roughly halves the adaptation FLOPs. PorTAL is also better calibrated, with lower held-out log-loss than from scratch LoRA at every data size.

Note: We compare against r16-full LoRA throughout because we found it to be the strongest per task LoRA configuration in our sweep.

7. Future work

Gradient competition on hard tasks. Under best epoch selection most tasks reach LoRA's lift, but a few harder commonsense and knowledge tasks underfit, the worst being OpenBookQA (~42% of lift), WinoGrande (~57%), and HellaSwag (~61%). These are the most distinct tasks, and because the rank-8 decoder is shared across the suite, their gradients are outweighed by the others and they remain under-fit. We hypothesize that the root cause is optimization, not limited adapter expressiveness, since neither a larger rank-16 adapter nor a larger task latent helped. In future work we hope to pursue better multi-task optimization, such as per task capacity or curriculum, or a small per task residual on top of the shared decoder.

Amortized text-description variant. A natural extension replaces the free per task latent with an encoder over a task description, z_t = E(emb(desc_t)), so a brand-new task could be adapted zero-shot from its description alone (à la Text-to-LoRA), with no per task training. We leave a full study to future work.

Other directions. Larger and instruction/generation tasks beyond multiple-choice; and theory on when a frozen latent suffices vs. when base-specific adaptation is required.

Want to keep up with our next AI experiments? Subscribe here and follow us on @RampLabs. We’re also hiring across roles at Ramp.

References

  1. Stanford HAI — AI Index Report 2024 (foundation-model release counts). https://www.deeplearning.ai/the-batch/stanford-ai-index-report-shows-the-state-of-ai-in-2024
  2. Chiang et al. — Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference (ICML 2024). https://arxiv.org/abs/2403.04132. Turnover statistic (~35 days at #1) from the Arena Leaderboard Dataset, Arena (2025). https://arena.ai/blog/arena-leaderboard-dataset/
  3. Stanford HAI — AI Index Report 2025. https://hai.stanford.edu/ai-index/2025-ai-index-report
  4. Alloc Labs — The Hidden Cost of LLM Fine-Tuning. https://www.alloclabs.com/blog/hidden-cost-llm-finetuning
  5. Huh et al. — The Platonic Representation Hypothesis (2024). https://arxiv.org/abs/2405.07987
  6. Charakorn et al. — Text-to-LoRA: Instant Transformer Adaptation (ICML 2025). https://openreview.net/forum?id=zWskCdu3QA
  7. Liu et al. — SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA (2026). https://arxiv.org/abs/2602.06358
  8. Tan et al. — Instant Personalized LLM Adaptation via Hypernetwork (Profile-to-PEFT) (2025). https://arxiv.org/abs/2510.16282
  9. Huang et al. — LoRAGen: Structure-Aware LoRA Weight Generation. https://openreview.net/pdf?id=mrafO7aTYj
  10. Xia et al. — Cross-LoRA: A Data-Free LoRA Transfer Framework across Heterogeneous LLMs (2025). https://arxiv.org/abs/2508.05232
  11. Farhadzadeh et al. — LoRA-X: Bridging Foundation Models with Training-Free Cross-Model Adaptation (2025). https://arxiv.org/abs/2501.16559
  12. Al Kari — CAST: Activation Manifold Projection (Cartridge Activation Space Transfer) (2025). https://arxiv.org/abs/2510.17902
  13. Hu et al. — LoRA: Low-Rank Adaptation of Large Language Models (2021). https://arxiv.org/abs/2106.09685

Appendix

A. Training & hyperparameters

Setting

Value

Optimizer

AdamW

LR (decoder / latent)

1e-3 / 2e-3

Epochs / batch size

5 / 4

Multi-task balancing

balanced per task steps + EMA loss-normalization (0.9 / 0.1) with a 1e-3 floor for stability

Per task LoRA baseline

peft, rank 16, alpha 32, lr 1e-4, 5 epochs (best-epoch selection), modules q/k/v/o + MLP

Initialization

B-heads and FiLM γ, β zero-initialized, so the generated adapter is the identity (ΔW = 0) at start

Hardware

single NVIDIA B200 (per run)

B. Metrics

We report recovered lift while prior cross-model-transfer papers (Cross-LoRA, CAST) instead report retention. For a method m, unadapted base b, and from scratch per task LoRA L:

recovered lift=accm−accbaccL−accb,retention=accmaccL.\text{recovered lift} = \frac{\mathrm{acc}_m - \mathrm{acc}_b}{\mathrm{acc}_L - \mathrm{acc}_b}, \qquad \text{retention} = \frac{\mathrm{acc}_m}{\mathrm{acc}_L}.

Retention is near 100% whenever there is little headroom, the regime those papers operate in (their trained LoRA adds only ~1% over base), so it is not discriminative. We evaluate in a higher headroom setting and therefore use recovered lift. For comparability, in retention terms Cross-LoRA reimplementation scores ~86% (within CAST's reported 85-95% band) while recovering only ~14% of the lift, whereas our porting scores ~99% retention / ~98% recovered lift.

Cite this work

APA

Geist, B. (2026). PorTAL: Portable Task Adapters for LLMs. Ramp Labs. https://labs.ramp.com/research

BibTeX

text
1@techreport{portal2026ramplabs,
2 author = {Geist, Ben},
3 title = {PorTAL: Portable Task Adapters for LLMs},
4 year = {2026},
5 month = {June},
6 institution = {Ramp Labs},
7 url = {https://labs.ramp.com/research}
8}

Turn one viral article into a full content workflow

Collect the source, decode the pattern, create assets, draft the story, and distribute from one AI workspace.

Explore YouMind

분석할 패턴 더 보기

최근 바이럴 아티클

더 많은 바이럴 아티클 보기