Adapter-Fusion: A Practical, Parameter-Efficient Framework for Composable Control in Text-to-Image Diffusion

Yunzhong Zheng

doi:10.2991/978-94-6239-648-7_92

<Previous Article In Volume

Next Article In Volume>

Adapter-Fusion: A Practical, Parameter-Efficient Framework for Composable Control in Text-to-Image Diffusion

Authors

Yunzhong Zheng¹^{, *}

¹College of Art and Science, New York University, New York, 10003, USA

^*Corresponding author. Email: Zyz5678@outlook.com

Corresponding Author

Yunzhong Zheng

Available Online 24 April 2026.

DOI: 10.2991/978-94-6239-648-7_92 How to use a DOI?
Keywords: Diffusion Models; Controllable Generation; Parameter-Efficient Fine-Tuning (PEFT); LoRA; Multi-Control Composition
Abstract: The surge of text-to-image diffusion models is an innovative step in the development of generative artificial intelligence. However, when the model is applied in production, the lack of precise control is a critical constraint. There are existing methods that introduced singular control modalities. The naïve combination of different adapters can cause “signal interference”, meaning that the different effects from different adapters degrade one another, making the result worse. This paper introduces Adapter-Fusion, which is a novel framework aiming to achieve both high-accuracy generation and computational efficiency. Adapter-Fusion adopts a ‘frozen-backbone’ philosophy. It incorporates Control-LoRA and IP-Adapter without altering their pretrained weights. Control-LoRA is used for controlling the spatial structure. On the other hand, stylistic content is controlled by IP-Adapter. The central innovation of the research is the “Composer” module. The “composer” deploys a gated LoRA-switching mechanism. This mechanism predicts the gating coefficients for different blocks. Signals are sent to the layers of the U-Net by the LoRA-switching mechanism. Its main goal is to decouple the spatial and temporal domain signals. An artificially generated dataset is used for validation in the research. The aim of using a synthetic dataset is to isolate different control interactions. Adapter-Fusion obtains a superior balance of precision. The model has high CLIP-I score and CLIP-T score, while the RMSE is still robust. The architecture can generate images for consumer-level hardware at a relatively high speed. This result excels the guidance baselines. Thus, Adapter-Fusion is a practical solution to multi-modal control.
Copyright: © 2026 The Author(s)
Open Access: Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

<Previous Article In Volume

Next Article In Volume>

Volume Title: Proceedings of the International Workshop on Advances in Deep Learning for Image Analysis and Computer Vision (IWADIC 2025)
Series: Advances in Computer Science Research
Publication Date: 24 April 2026
ISBN: 978-94-6239-648-7
ISSN: 2352-538X
DOI: 10.2991/978-94-6239-648-7_92 How to use a DOI?
Open Access: Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

ris enw bib

TY  - CONF
AU  - Yunzhong Zheng
PY  - 2026
DA  - 2026/04/24
TI  - Adapter-Fusion: A Practical, Parameter-Efficient Framework for Composable Control in Text-to-Image Diffusion
BT  - Proceedings of the International Workshop on Advances in Deep Learning for Image Analysis and Computer Vision (IWADIC 2025)
PB  - Atlantis Press
SP  - 853
EP  - 861
SN  - 2352-538X
UR  - https://doi.org/10.2991/978-94-6239-648-7_92
DO  - 10.2991/978-94-6239-648-7_92
ID  - Zheng2026
ER  -

download .riscopy to clipboard