Enhanced Deep Feature Embedding for Vehicle Re-Identification using Hybrid CNN-Transformer Architecture

Akram Jabbar Mohaisen; Huda kadhim tayyeh

doi:10.32792/universityofthi-qar.v21i2.500

Authors

Akram Jabbar Mohaisen , Informatics Institute for Postgraduate Studies, University of Information Technology and Communications, Baghdad, Iraq
Huda kadhim tayyeh University of Information Technology and Communications, Baghdad, Iraq

DOI:

https://doi.org/10.32792/universityofthi-qar.v21i2.500

Keywords:

Vehicle Re-Identification, Hybrid Architecture, ConvNeXt, Swin Transformer, PHE-Net

Abstract

Vehicle re-identification (Re-ID) plays a central role in modern intelligent surveillance systems, yet its performance remains highly sensitive to real-world imaging conditions, particularly variations in illumination and contrast. Although recent hybrid architectures have advanced representation learning by combining convolutional and transformer-based models, the influence of input quality on the resulting feature embeddings is often underestimated. To address this limitation, this paper introduces the Preprocessing-Enhanced Hybrid Embedding Network (PHE-Net), an end-to-end framework that explicitly integrates input enhancement with hybrid feature learning. Rather than introducing a completely new preprocessing operator or a new backbone family, the contribution of this work lies in a surveillance-oriented integration strategy that explicitly couples input-quality normalization with hybrid local-global embedding learning for vehicle re-identification. The preprocessing stage is designed to mitigate common visual degradations through Contrast Limited Adaptive Histogram Equalization (CLAHE) and gamma correction, while maintaining vehicle geometry using aspect-ratio-aware resizing. By stabilizing the visual appearance of surveillance images prior to feature extraction, this stage provides a more reliable foundation for downstream embedding learning. For representation learning, PHE-Net combines the strong local inductive biases of ConvNeXt with the global context modeling capability of Swin Transformer blocks. This hybrid design enables the network to jointly capture fine-grained texture details and long-range structural relationships, resulting in a more expressive and discriminative vehicle representation. The model is trained using PK sampling and a joint optimization objective that integrates identity classification loss with triplet loss, encouraging embeddings that are both class-discriminative and retrieval-friendly. Extensive experiments on the VeRi-776 benchmark validate the effectiveness of the proposed framework. PHE-Net achieves 98.20% Rank-1 accuracy and 82.60% mAP, demonstrating that explicitly coupling input enhancement with hybrid CNN–Transformer feature learning leads to more robust and reliable vehicle re-identification under challenging environmental conditions.