Classical timbre studies have modeled timbre as the integration of a limited number of auditory dimensions and proposed acoustic correlates to these dimensions to explain sound identification. Here, the goal was to highlight time-frequency patterns subserving identification of musical voices and instruments, without making any assumption about these patterns. We adapted a “random search method” proposed in vision. The method consists of synthesizing sounds by randomly selecting “auditory bubbles” (small time-frequency glimpses) from the original sounds’ spectrograms, and then inverting the resulting sparsified representation. For each bubble selection, a decision procedure categorizes the resulting sound as a voice or an instrument. After hundreds of trials, the whole time-frequency space is explored, and adding together the correct answers reveals the relevant time-frequency patterns for each category. We used this method with two decision procedures: human listeners and a decision algorithm using auditory distances based on spectro-temporal excitation patterns (STEPs). The patterns were strikingly similar for the two procedures: they highlighted higher frequencies (i.e., formants) for the voices, whereas instrument identification was based on lower frequencies (particularly during the onset). Altogether these results show that timbre can be analyzed as time-frequency weighted patterns corresponding to the important cues subserving sound identification.