Guide to OST heuristics (UW-Madison Audapter)
Guide to Audapter Online Status Tracking (OST) capabilities
The following table provides information about different heuristics, the parameters that they operate over, a description of their (actual) functionality, and examples of what they are good for. This includes both legacy heuristics as well as new heuristics introduced by Chris Naber at UW-Madison (denoted with *), which can be found in our fork of the Audapter repo (https://github.com/blab-lab/audapter_mex). Note that some of the functionality may differ from what is described in the manual, as the manual is not wholly accurate.
Some heuristics are more reliable than others. These are highlighted in green. Others can be used, but may not have the effect you intended for confusing reasons. These are highlighted in yellow. Some are buggy. These have dark red font.
Heuristic name |
Param 1 |
Param 2 |
Param 3 |
Function |
Use case |
---|---|---|---|---|---|
ELAPSED_TIME | duration | NaN | -- | Waits for the given amount of time, then advances. Note: This heuristic only increments the OST status value by 1 number. | Setting a trigger after a fixed time, rather than in relation to a speech event. |
INTENSITY_RISE_HOLD | RMS | duration | -- | Checks if RMS is above threshold for given amount of time. Does NOT check for rise. | Onsets of vowels. |
INTENSITY_RISE_HOLD_POS_SLOPE | RMS | duration | -- | Checks if RMS is above threshold for given amount of time. DOES check for rise (positive slope) | Onsets of vowels. |
POS_INTENSITY_SLOPE_STRETCH | frames | delta | -- | Checks if RMS is rising above threshold and covers a certain amount of ground. | Onsets of vowels. Somewhat unintuitive. |
NEG_INTENSITY_SLOPE_STRETCH_SPAN | frames | delta | -- | Checks for negative slope for a number of frames and distance covered. | Offsets of vowels/onsets of stops after vowels. Somewhat unintuitive. |
*INTENSITY_SLOPE_BELOW_THRESH | RMS slope | duration | -- | Checks that the RMS slope is below a threshold for a given duration. | Can check for slow increases in RMS or rapid decreases in RMS, like a vowel to stop transition. |
INTENSITY_FALL | RMS | duration | -- | Checks for RMS below threshold for given duration. Does NOT check for fall. | Offsets of vowels/onsets of stops after vowels |
*INTENSITY_BELOW_THRESH_NEG_SLOPE | RMS | duration | -- | Checks for RMS below threshold, with decreasing RMS (negative slope), for given duration. DOES check for fall. | Offsets of vowels/onsets of stops after vowels. |
INTENSITY_RATIO_RISE | ratio | duration | -- | Checks that ratio is over a particular threshold, for given duration. Is NOT actually a rise (just above threshold). | Onsets of fricatives. Can also be used to demarcate vowels from nasals, especially front vowels with high F2. |
*INTENSITY_RATIO_ABOVE_THRESH_ WITH_RMS_FLOOR |
ratio | duration | (RMS = 0.0003) | Checks that ratio is above a threshold for a given duration, and requires that the RMS itself be above 0.0003. | Onsets of fricatives, particularly if they are the first segment in an utterance. |
*INTENSITY_AND_RATIO_ABOVE_THRESH | RMS | ratio | duration | Checks that both RMS and ratio are above individually specified thresholds for a given duration. | Onsets of fricatives, particularly if they are the first segment in an utterance. Also good for demarcating the onset of a vowel after a nasal, with additional security from vowels being louder than nasals. More sophisticated version of the fixed RMS floor. |
*INTENSITY_AND_RATIO_BELOW_THRESH | RMS | ratio | duration | Checks that both RMS and ratio are below individually specified thresholds for a given duration. | Is the inverse of "above thresh" heuristic, so good for finding the onset of a nasal or voiced oral stop after a vowel. This is helpful beyond just RMS threshold because some people have dips in the RMS during the vowel itself that approach the level achieved during a voiced stop. |
INTENSITY_RATIO_FALL_HOLD | ratio | duration | -- | Checks for ratio below threshold for given duration. Does NOT check for fall in ratio. | Offsets of fricatives. Note that this does not check for fall, only threshold. |
*INTENSITY_RATIO_SLOPE_ABOVE_ THRESH |
ratio-slope | duration | -- | Checks that the ratio slope (i.e., change in ratio) is above a given threshold for a given duration. | Onsets of fricatives or front vowels, with attention paid to how quickly the segment onset is. |
*INTENSITY_RATIO_SLOPE_BELOW_ THRESH |
ratio-slope | duration | -- | Checks that the ratio slope (i.e., change in ratio) is below a given threshold for a given duration. | Offsets of fricatives or front vowels, with attention paid to how quickly the segment stops. |
General notes
- A solid knowledge of speech acoustics is highly recommended before implementing OSTs. Topics to know:
- Effects of vocal tract length
- What sounds have energy in high frequencies (and where/how much)
- Effects of voice quality
- Effects of microphone placement relative to mouth (distance and location relative to the centerline)
- Using "ratio" heuristics (RMS ratio, i.e. proportion of high frequency energy to low frequency energy)
- These are good for:
- Fricative boundaries, particularly sibilants
- Distinguishing vowels (which have more energy in higher formants) from nasals (which have nasal antiformants) or fully voiced stops (which have mostly energy through the voicing bar, and nothing above)
- Reasonable values
- With UW's setup, we tend to use values like 0.4 to find fricative boundaries. Sibilants easily get to values past 1.
- If a vowel is flanked by nasals or voiced stops, a lower threshold like 0.17 to detect the vowel is typically appropriate.
- Precautions
- Most of the ratio heuristics, particularly the legacy heuristics, do not check for direction of the passed threshold. So improper use of two ratio-based heuristics in a row can lead to blowing through OST statuses.
- Vowel ratios are more unreliable than fricatives, due to vocal tract variability between speakers and between vowels (i.e., how high the formants get).
- Room "silence" can include spikes in ratio that are not associated with speech onset, since any small increase in high frequency energy when nothing has changed in the lower frequencies can drastically affect ratio.
- The Audapter manual does not describe ratio correctly. Segments with a lot of energy in high frequencies have a HIGH ratio.
- These are good for:
- Using RMS heuristics
- These are good for:
- Finding vowels, particularly when framed by stops (or silence)
- Reasonable values
- With UW's setup, we tend to use values like 0.02 to find vowel onsets and offsets. You can of course be more aggressive to work within the confines of your particular target phrase.
- Precautions
- Many of the legacy heuristics are not descriptive, e.g. INTENSITY_FALL does not check for a fall.
- Speakers with strong prevoicing can trigger vowel statuses during the syllable onset in words like "bed", because the voicing during the closure also has an increase in RMS.
- Speakers with strong voicing during closure can also fail to trigger end-of-vowel statuses, because the voicing during closure keeps RMS relatively high.
- The stretch/span heuristics are somewhat unintuitive but can be manageable with audapter_viewer. They are somewhat reliable but seem to be susceptible to natural variability and voice quality issues. This is because the slope can be very negative and fall for a while in the middle of a vowel if something happens to the voice quality, all while still keeping relatively high RMS values. The heuristics that take both slope and RMS into consideration are more reliable in this respect.
- These are good for:
Last updated 7/21/2023 by RPK