2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)


22-24 October 2025, Singapore



Tutorials


Tutorial 1: Recent Advances in End-to-End Learned Image and Video Coding


Time: Wednesday, 22 Oct 2025, 08:00-11:30am

Venue: Lotus I

Presenter: Prof. Heming Sun and Prof. Wen-Hsiao Peng

Prof. Heming Sun
Prof. Heming Sun
Prof. Wen-Hsiao Peng
Prof. Wen-Hsiao Peng

Part I: Overview of Learned Image/Video Coding (by Prof. Peng; 15 mins)

  • Introduction to end-to-end learned image and video coding
  • The rate-distortion performance of SOTA learned image/video codecs
  • Standardization activities on neural image/video coding in JPEG and MPEG

Part II: End-to-End Learned Image Coding (by Prof. Sun; 70 mins)

  • Elements of end-to-end learned image coding
  • Review of a few notable tool features (e.g. fast context models)
  • Network pruning and quantization for learned image codecs
  • Implicit Neural Representation (INR)-based image coding systems
  • Real-time implementation of learned image codecs

Coffee Break (20 mins)

Part III: End-to-End Learned Video Coding (by Prof. Peng; 60 mins)

  • End-to-end learned video coding frameworks: residual coding, conditional coding, and conditional residual coding
  • Review of some notable systems
  • The explicit, implicit, and hybrid temporal buffering strategies
  • The rate-distortion-complexity trade-offs from the perspectives of coding frameworks and buffering strategies
  • Network quantization for learned video codecs

Part IV: Practical Implementation (30 minutes)

  • Emerging learned coding techniques for 3D/4D Gaussian Splatting and multi-modal large language models
  • Open issues and concluding remarks

Tutorial 2: Deep Speaker Modeling: Theories, Applications and Practice


Time: Wednesday, 22 Oct 2025, 08:00-11:30am

Venue: Lotus II

Presenter: Shuai Wang, Yanmin Qian and Haizhou Li

Shuai Wang
Shuai Wang
Yanmin Qian
Yanmin Qian
Haizhou Li
Haizhou Li

Part I: Foundations and Recent Advances (60 mins)

  • Foundational theories and review of traditional methods in speaker modeling
  • Evolution of speaker representation techniques in the deep learning era
    • From i-vector to various deep speaker representations
    • Applications of self-supervised and semi-supervised learning in speaker modeling
    • Analysis of speaker representation capabilities in foundation speech models
    • Leveraging pretrained large models

Part II: Applications Beyond Recognition (60 mins)

  • Speaker-adaptive speech synthesis
    • Voice cloning technologies and ethical considerations
    • Speaker representation in few-shot and zero-shot speech synthesis
  • Personalized voice conversion systems
  • Speaker perception in multimodal human-computer interaction
  • Target speaker speech processing
    • Target speaker extraction
    • Target speaker speech recognition
    • Target speaker verification
    • Personalized VAD

Part III: Challenges and Countermeasures (30 mins)

  • Domain adaptation and domain-invariant learning
  • Privacy-preserving speaker representations
  • Robustness and adversarial attack defense
  • Computational efficiency and model compression
  • Explainability techniques and methods

Part IV: Practical Implementation (30 mins)

  • Introduction to tools and frameworks
    • Wespeaker toolkit for speaker embedding learning
    • Wesep toolkit for target speech extraction
  • Case studies and demonstrations
  • Interactive discussion and Q&A session

Tutorial 3: From Detection to Direction: An Overview of Sound Event Localization and Detection


Time: Wednesday, 22 Oct 2025, 08:00-11:30am

Venue: Hibiscus III

Presenter: Jun Wei Yeow and Ee-Leng TAN

Jun Wei Yeow
Jun Wei Yeow
Ee-Leng TAN
Ee-Leng TAN

Part I: Overview of Sound Event Localization and Detection (SELD) (30 mins)

  • Introduction to SELD and its applications
  • History of SELD and its component tasks (Sound Event Detection and Sound Source Localization)
  • Recent advances and challenges in SELD
  • Publicly available SELD datasets

Part II: Core Technical Components of SELD (60 mins)

  • Spatial audio formats used for SELD, including First Order Ambisonics, microphone array signals, and binaural recordings
  • Contemporary feature extraction techniques that capture spatiotemporal cues needed for robust event detection and localization
  • Deep learning architectures designed for SELD, including convolutional recurrent networks (CRNNs), transformer-based models, and multi-branch or multi-task setups
  • Training strategies, such as multi-task learning (joint DOA and event classification), data augmentation for spatial audio, and domain adaptation techniques
  • Benchmark datasets and metrics, including a deep dive into the DCASE Challenge series as well as evaluation criteria such as localization errors, detection accuracies, and combined SELD scores

Coffee Break (30 mins)

Part III: Advanced and Emerging Topics (60 mins)

  • Semi-supervised and weakly labelled learning approaches
  • Robustness to reverberation, overlapping events, and unseen acoustic scenes
  • Multi-modal SELD systems that integrate complementary modalities, such as video recordings or motion sensors
  • Complementary performance using acoustic scene classification (ASC)

Coffee Break (30 mins)

Part IV: Real-Time Implementation of SELD (40 mins)

  • Real-time constraints and considerations
  • Lightweight models suitable for real-time and edge applications
  • Discussion and Q&A session

Tutorial 4: Adaptive Sensor Networks in Digital Health


Time: Wednesday, 22 Oct 2025, 08:00-11:30am

Venue: Peony I

Presenter: Prof. Saeid Sanei

Prof. Saeid Sanei
Prof. Saeid Sanei

This research shows the importance of distributed networks and cooperation, borrowed from multi-agent communication systems domain, in modelling industrial, biological, and diagnostic systems. In many patient monitoring systems such as multichannel EEG, electromyography (EMG), and electrocardiography (ECG) as well as industrial sensors such as smart meter networks, the sensor data can be aggregated in an adaptive manner. On the other hand, adaptive cooperative networks are used to model single- or multi-task systems which are available where the agents have multiple targets. In industry, the use of smart meter networks in a household area and the information transfer between the smart meters can highly reduce the peak energy supply. On the clinical side, an adaptive network can be devised to use multichannel EEG to translate the brain function into body movement or model the link between two brains in a multi-subject (a.k.a hyperscanning) scenario. It can be verified that distributed array processing (beamforming) can improve the system quality for localization of brain responses to deep brain single-pulse electrical stimulation (SPES), applicable to drug-resistant epileptic seizure diagnosis. Also, a cooperative particle filtering approach can significantly enhance identification and tracking of event-related potentials (ERPs) to monitor brain degenerative diseases, fatigue, or cognition deterioration. The tutorial will be in three hours with approximately 30 min tea/coffee break in between.


Part I: Outline of the Tutorial and the Material to Be Presented (by Prof. Saeid Sanei; ~3 hrs including tea/coffee break)

  • From adaptive filters to adaptive cooperative networks
  • Distributed sensor networks, definitions, examples, and applications
  • Adaptive cooperative network topologies
  • Single- and multi-task networks, optimizations, and applications
  • Estimation of network connectivity (information transfer) for accurate setting of the combination weights
  • Body sensor networks and their clinical applications
  • Cooperative systems in brain computer interfacing (BCI)
  • Distributed beamforming for seizure source localization of interictal epileptiform discharges and delayed/late brain responses to deep brain electrical stimulation
  • Distributed particle filtering for tracking the brain event related potentials
  • Distributed systems for crowd monitoring
  • Wider applications of cooperative systems (biological modelling, network security, and energy distribution)
  • Concluding remarks