Semantic Image Synthesis via Class-Adaptive Cross-Attention
In semantic image synthesis the state of the art is dominated by methods that use customized variants of the SPatially-Adaptive DE-normalization (SPADE) layers, which allow for good visual generation quality and editing versatility. By design, such layers learn pixel-wise modulation parameters to de...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10841835/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832592949899165696 |
---|---|
author | Tomaso Fontanini Claudio Ferrari Giuseppe Lisanti Massimo Bertozzi Andrea Prati |
author_facet | Tomaso Fontanini Claudio Ferrari Giuseppe Lisanti Massimo Bertozzi Andrea Prati |
author_sort | Tomaso Fontanini |
collection | DOAJ |
description | In semantic image synthesis the state of the art is dominated by methods that use customized variants of the SPatially-Adaptive DE-normalization (SPADE) layers, which allow for good visual generation quality and editing versatility. By design, such layers learn pixel-wise modulation parameters to de-normalize the generator activations based on the semantic class each pixel belongs to. Thus, they tend to overlook global image statistics, ultimately leading to unconvincing local style editing and causing global inconsistencies such as color or illumination distribution shifts. Also, SPADE layers require the semantic segmentation mask for mapping styles in the generator, preventing shape manipulations without manual intervention. In response, we designed a novel architecture where cross-attention layers are used in place of SPADE for learning shape-style correlations and so conditioning the image generation process. Our model inherits the versatility of SPADE, at the same time obtaining state-of-the-art generation quality improving FID score by 5.6%, 1.4% and 3.4% on CelebMask-HQ, Ade20k and DeepFashion datasets respectively, as well as improved global and local style transfer. Code and models available at <uri>https://github.com/TFonta/CA2SIS</uri>. |
format | Article |
id | doaj-art-b2ae68633daf4a45847e3354ddd1a040 |
institution | Kabale University |
issn | 2169-3536 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-b2ae68633daf4a45847e3354ddd1a0402025-01-21T00:01:07ZengIEEEIEEE Access2169-35362025-01-0113103261033910.1109/ACCESS.2025.352921610841835Semantic Image Synthesis via Class-Adaptive Cross-AttentionTomaso Fontanini0https://orcid.org/0000-0001-6595-4874Claudio Ferrari1https://orcid.org/0000-0001-9465-6753Giuseppe Lisanti2https://orcid.org/0000-0002-0785-9972Massimo Bertozzi3https://orcid.org/0000-0003-1463-5384Andrea Prati4https://orcid.org/0000-0002-1211-529XDepartment of Architecture and Engineering, University of Parma, Parma, ItalyDepartment of Architecture and Engineering, University of Parma, Parma, ItalyDepartment of Computer Science and Engineering, University of Bologna, Bologna, ItalyDepartment of Architecture and Engineering, University of Parma, Parma, ItalyDepartment of Architecture and Engineering, University of Parma, Parma, ItalyIn semantic image synthesis the state of the art is dominated by methods that use customized variants of the SPatially-Adaptive DE-normalization (SPADE) layers, which allow for good visual generation quality and editing versatility. By design, such layers learn pixel-wise modulation parameters to de-normalize the generator activations based on the semantic class each pixel belongs to. Thus, they tend to overlook global image statistics, ultimately leading to unconvincing local style editing and causing global inconsistencies such as color or illumination distribution shifts. Also, SPADE layers require the semantic segmentation mask for mapping styles in the generator, preventing shape manipulations without manual intervention. In response, we designed a novel architecture where cross-attention layers are used in place of SPADE for learning shape-style correlations and so conditioning the image generation process. Our model inherits the versatility of SPADE, at the same time obtaining state-of-the-art generation quality improving FID score by 5.6%, 1.4% and 3.4% on CelebMask-HQ, Ade20k and DeepFashion datasets respectively, as well as improved global and local style transfer. Code and models available at <uri>https://github.com/TFonta/CA2SIS</uri>.https://ieeexplore.ieee.org/document/10841835/Semantic image synthesiscross-attentionimage editing |
spellingShingle | Tomaso Fontanini Claudio Ferrari Giuseppe Lisanti Massimo Bertozzi Andrea Prati Semantic Image Synthesis via Class-Adaptive Cross-Attention IEEE Access Semantic image synthesis cross-attention image editing |
title | Semantic Image Synthesis via Class-Adaptive Cross-Attention |
title_full | Semantic Image Synthesis via Class-Adaptive Cross-Attention |
title_fullStr | Semantic Image Synthesis via Class-Adaptive Cross-Attention |
title_full_unstemmed | Semantic Image Synthesis via Class-Adaptive Cross-Attention |
title_short | Semantic Image Synthesis via Class-Adaptive Cross-Attention |
title_sort | semantic image synthesis via class adaptive cross attention |
topic | Semantic image synthesis cross-attention image editing |
url | https://ieeexplore.ieee.org/document/10841835/ |
work_keys_str_mv | AT tomasofontanini semanticimagesynthesisviaclassadaptivecrossattention AT claudioferrari semanticimagesynthesisviaclassadaptivecrossattention AT giuseppelisanti semanticimagesynthesisviaclassadaptivecrossattention AT massimobertozzi semanticimagesynthesisviaclassadaptivecrossattention AT andreaprati semanticimagesynthesisviaclassadaptivecrossattention |