Comment by nowayno583

nowayno583 Sep 3, 2024 parent

Intuitively, audio is way more sensitive to phase and persistence because of the time domain. So maybe audio models look more like video models instead of image models?

I'm not really sure how current video generating models work, but maybe we could get some insight into them by looking at how current audio models work?

I think we are looking at an auto regression of auto regressions of sorts, where each PSD + phase is used to output the next, right? Probably with different sized windows of persistence as "tokens". But I'm a way out of my depth here!

bartwr Sep 3, 2024

It's the other way around - in hearing, phase is almost irrelevant. At medium frequencies, moving head by a few centimeters changes phase wand phase relationships of all frequencies - and we don't perceive it at all! Most audio synthesis methods work on variants of spectrograms and phase is approximated only later (mattering mostly for transients and rapid frequency content changes).

In images, scrambling phase yields a completely different image. A single edge will have the same spectral content as pink/brown~ish noise, but they look completely unlike one another.

nowayno583 OP Sep 3, 2024

Makes sense! My impression that phase matters from audio comes from when editing audio in a DAW or anything like that. We are very sensitive to sudden phase changes (which would be kind of like teleporting very fast from one point to another, from our heads point of view). Our ears kind of pick them up like sudden bursts of white noise (which also makes sense, given that they kind of look like an impulse when zoomed in a lot).

So when generating audio I think the next chunk needs to be continuous in phase to the last chunk, where in images a small discontinuity in phase would just result in a noisy patch in the image. That's why I think it should be somewhat like video models, where sudden, small phase changes from one frame to the next give that "AI graininess" that is so common in the current models

benanne Sep 4, 2024

I actually wrote down some thoughts about audio phase in a previous blog post: https://sander.ai/2020/03/24/audio-generation.html#motivatio...

I have an example audio clip in there where the phase information has been replaced with random noise, so you can perceive the effect. It certainly does matter perceptually, but it is tricky to model, and small "vocoder" models do a decent job of filling it in post-hoc.

nyanpasu64 Sep 3, 2024

(not the author) There's techniques for consistent-phase audio synthesis like phase vocoders, but they are beyond my current knowledge.

This item has no comments currently.

Preferences

Keyboard Shortcuts

Story Lists

Navigation

Miscellaneous