Semantic Image Synthesis via Class-Adaptive Cross-Attention

In semantic image synthesis the state of the art is dominated by methods that use customized variants of the SPatially-Adaptive DE-normalization (SPADE) layers, which allow for good visual generation quality and editing versatility. By design, such layers learn pixel-wise modulation parameters to de...

Full description

Saved in:

Bibliographic Details
Main Authors:	Tomaso Fontanini, Claudio Ferrari, Giuseppe Lisanti, Massimo Bertozzi, Andrea Prati
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Semantic image synthesis cross-attention image editing
Online Access:	https://ieeexplore.ieee.org/document/10841835/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832592949899165696
author	Tomaso Fontanini Claudio Ferrari Giuseppe Lisanti Massimo Bertozzi Andrea Prati
author_facet	Tomaso Fontanini Claudio Ferrari Giuseppe Lisanti Massimo Bertozzi Andrea Prati
author_sort	Tomaso Fontanini
collection	DOAJ
description	In semantic image synthesis the state of the art is dominated by methods that use customized variants of the SPatially-Adaptive DE-normalization (SPADE) layers, which allow for good visual generation quality and editing versatility. By design, such layers learn pixel-wise modulation parameters to de-normalize the generator activations based on the semantic class each pixel belongs to. Thus, they tend to overlook global image statistics, ultimately leading to unconvincing local style editing and causing global inconsistencies such as color or illumination distribution shifts. Also, SPADE layers require the semantic segmentation mask for mapping styles in the generator, preventing shape manipulations without manual intervention. In response, we designed a novel architecture where cross-attention layers are used in place of SPADE for learning shape-style correlations and so conditioning the image generation process. Our model inherits the versatility of SPADE, at the same time obtaining state-of-the-art generation quality improving FID score by 5.6%, 1.4% and 3.4% on CelebMask-HQ, Ade20k and DeepFashion datasets respectively, as well as improved global and local style transfer. Code and models available at <uri>https://github.com/TFonta/CA2SIS</uri>.
format	Article
id	doaj-art-b2ae68633daf4a45847e3354ddd1a040
institution	Kabale University
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-b2ae68633daf4a45847e3354ddd1a0402025-01-21T00:01:07ZengIEEEIEEE Access2169-35362025-01-0113103261033910.1109/ACCESS.2025.352921610841835Semantic Image Synthesis via Class-Adaptive Cross-AttentionTomaso Fontanini0https://orcid.org/0000-0001-6595-4874Claudio Ferrari1https://orcid.org/0000-0001-9465-6753Giuseppe Lisanti2https://orcid.org/0000-0002-0785-9972Massimo Bertozzi3https://orcid.org/0000-0003-1463-5384Andrea Prati4https://orcid.org/0000-0002-1211-529XDepartment of Architecture and Engineering, University of Parma, Parma, ItalyDepartment of Architecture and Engineering, University of Parma, Parma, ItalyDepartment of Computer Science and Engineering, University of Bologna, Bologna, ItalyDepartment of Architecture and Engineering, University of Parma, Parma, ItalyDepartment of Architecture and Engineering, University of Parma, Parma, ItalyIn semantic image synthesis the state of the art is dominated by methods that use customized variants of the SPatially-Adaptive DE-normalization (SPADE) layers, which allow for good visual generation quality and editing versatility. By design, such layers learn pixel-wise modulation parameters to de-normalize the generator activations based on the semantic class each pixel belongs to. Thus, they tend to overlook global image statistics, ultimately leading to unconvincing local style editing and causing global inconsistencies such as color or illumination distribution shifts. Also, SPADE layers require the semantic segmentation mask for mapping styles in the generator, preventing shape manipulations without manual intervention. In response, we designed a novel architecture where cross-attention layers are used in place of SPADE for learning shape-style correlations and so conditioning the image generation process. Our model inherits the versatility of SPADE, at the same time obtaining state-of-the-art generation quality improving FID score by 5.6%, 1.4% and 3.4% on CelebMask-HQ, Ade20k and DeepFashion datasets respectively, as well as improved global and local style transfer. Code and models available at <uri>https://github.com/TFonta/CA2SIS</uri>.https://ieeexplore.ieee.org/document/10841835/Semantic image synthesiscross-attentionimage editing
spellingShingle	Tomaso Fontanini Claudio Ferrari Giuseppe Lisanti Massimo Bertozzi Andrea Prati Semantic Image Synthesis via Class-Adaptive Cross-Attention IEEE Access Semantic image synthesis cross-attention image editing
title	Semantic Image Synthesis via Class-Adaptive Cross-Attention
title_full	Semantic Image Synthesis via Class-Adaptive Cross-Attention
title_fullStr	Semantic Image Synthesis via Class-Adaptive Cross-Attention
title_full_unstemmed	Semantic Image Synthesis via Class-Adaptive Cross-Attention
title_short	Semantic Image Synthesis via Class-Adaptive Cross-Attention
title_sort	semantic image synthesis via class adaptive cross attention
topic	Semantic image synthesis cross-attention image editing
url	https://ieeexplore.ieee.org/document/10841835/
work_keys_str_mv	AT tomasofontanini semanticimagesynthesisviaclassadaptivecrossattention AT claudioferrari semanticimagesynthesisviaclassadaptivecrossattention AT giuseppelisanti semanticimagesynthesisviaclassadaptivecrossattention AT massimobertozzi semanticimagesynthesisviaclassadaptivecrossattention AT andreaprati semanticimagesynthesisviaclassadaptivecrossattention

Semantic Image Synthesis via Class-Adaptive Cross-Attention

Similar Items