Recent advances in AI music technology have largely focused on generating and manipulating raw audio. Diffusion models, neural synthesizers, and source separation systems have made waveform-level processing more powerful than ever. However, many creative tools for musicians still struggle with control, editability, and real-time interaction when operating directly on audio.
In this talk, we explore why symbolic representations—especially MIDI—remain a crucial layer for building practical and expressive AI music tools.
Rather than treating MIDI as a legacy protocol, we examine how it provides a powerful abstraction for structure, intent, and musical interaction. MIDI enables precise editing, controllable generation, and seamless integration with existing creative workflows. When combined with modern AI models, it becomes a flexible interface between human musical ideas and machine-generated content.
Through examples from our own researched, developed and deployed tools, we show how MIDI can serve as the control plane for generative music tools. Topics will be on the lines of symbolic-to-audio pipelines, AI-assisted composition workflows, expressive performance data, and hybrid systems that combine MIDI reasoning with neural audio synthesis.
The talk argues that many future music tools will not be purely “audio AI,” but instead layered systems where symbolic representations guide audio generation.