A method note, written around a candidate match. CLR-001 calibrated our subtype detector against a canonical Phish reference (UIC '98 Bathtub Gin); the natural follow-up is to point that detector at unseen tape from a different band and see whether it surfaces analogs that hold up under listening review. We ran it across forty-eight Goose performances from late 2025 onward. This one, an eighteen-minute Prince cover from 4/24/2026, head in D♯ minor, came back with all three detectable subtypes firing, with the harmonic-distance metric pinned at the ceiling (1.000) for a sustained five-and-a-half-minute stretch. Sits in the same A + B + C bucket as the UIC reference. We re-listened. We're writing it up.
This is a methods follow-up to CLR-001. That dispatch took a long-canonical Type II, UIC '98 Bathtub Gin, and used it to anchor our detector chain: this is what an A + B + C performance looks like, here is each metric on it, here is the threshold each one had to cross. Useful by itself, but the more interesting question is whether a detector calibrated against one band's reference cases generalises to a different band's catalog. That's a standard validation move in any pattern-detection project. This dispatch is that move.
The protocol. Take the same detector chain we anchored against UIC '98. Aim it at a corner of the catalog the original calibration didn't see, Goose 2025–2026, forty-eight performances we ingested via Archive.org for an ongoing discovery feed. Run the four subtype detectors over each one. Look at the candidates that fire all three. Then sit down and listen to them, the way the dispatches always end: with ears.
One came back as a triple-fire: an eighteen-minute Prince cover from 4/24/2026, head in D♯ minor. We'd been listening with our ears, but not to this. The detector hit it harder than any of the Phish references it was anchored against, sustained harmonic-distance pinned at 1.000 from 6:30 to 12:00, chord-vocabulary divergence reaching 0.61. Those are bigger numbers than UIC '98, on those two axes, in our small sample. (UIC '98 still leads on the sustained-drone metric.) What does that mean? Probably nothing, on its own. Two recordings is two recordings. What we can say with a straight face is: the detector calibrated against a 1998 Phish reference put a 2026 Goose Prince cover next to it, in the same algorithmic bucket, on a regular tour night. That's the kind of pattern a method should produce if the method works. So we wrote it up.
Couple of caveats up front. This is forty-eight performances of one band over six months, against a reference set of twelve Bathtub Gins. Phish has played 2,000+ shows; Goose has played several hundred. Neither sample is anything like comprehensive. Three of the forty-eight Goose performances came back as 2-of-3 partials, close but missing one detector, which we list later as runners-up. There are absolutely more triple-fires sitting on tape we haven't pulled yet, in either band's catalog. None of this is a ranking. The dispatch walks through what the detector saw on this recording, minute by minute, with the figures styled the same way as CLR-001 so you can read the two side by side. Hit play and follow along.
Each cell is the dominant chord during one thirty-second window. Color = chord identity. The thin white ticks underneath are minute marks. Read it left-to-right with the audio. The composed Prince head sits in D♯ minor for the first two and a half minutes, D♯ minor and F♯ major cells alternating. After that the ribbon shifts into a tight C♯ minor / C♯ major palette and stays there for most of the next fifteen minutes, with brief excursions through E major and F♯ major before returning to D♯ minor at the end. Five distinct chords across the eighteen-minute performance, none of which dominate the head. Click anywhere to seek the audio there.
Loudness over time. Different shape from a Phish Type II, the dynamic climax doesn't come during the deepest harmonic departure. It arrives near the end of the jam, around 16:00, after the band has already started working back toward home. The stretch where the harmonic-distance metric is highest (6:30 – 12:00) is actually quieter than the head was. The crescendo is a release, not a peak in itself.
Before the detector-signal plot below. IWD4U lights up the first three on chord-only metrics. The fourth, originally written as “not yet measurable,” now has a candidate non-chroma metric from § 04 below: longest contiguous jam stretch with no multi-feature regime change. IWD4U sustains a single texture for 10 minutes (2:30 to 12:30), passing the provisional ≥ 8-minute threshold. So by this candidate metric IWD4U may be a four-of-four. The metric is provisional and a small-sample finding (n=10 BTGs, replicated in § 04). Threshold not yet calibrated.
The jam ventures to a non-home key. Detected by checking, window by window, how far the tonal center has drifted from the head's key.
The jam stays near the home key but uses a chord palette the head never visits. Measured by the divergence between the head's and jam's chord histograms.
The band holds individual chords for ten to thirty-plus seconds. Measured by the ratio of average chord-segment length in the jam compared to the head.
A non-chroma metric: the longest contiguous jam stretch with no multi-feature regime change (MFCC novelty + onset density + centroid + spectral flux). The probe is the one used in § 04 below. Type II performances stay textured-committed for ≥ 8 minutes; baselines do not (4/4 vs 0/2 in our 10-performance sample).
The pink line is Subtype A (key-departure). The blue line is Subtype B (chord-vocabulary divergence). Each has a dashed threshold drawn. The bottom strip shades pink / blue / orange when each subtype is firing.
A typographic transcription of the time-series above. Clock at left. Subtype state in the middle. Field note at right. The two intensity peaks fall at roughly 11:30 and 19:30, a textbook two-summit dynamic arc.
A reader of an early draft of this dispatch listened back and reported hearing things changing in the recording that the score didn't describe. That observation is a useful test of the framework. Our four-subtype taxonomy (A, B, C, plus the not-yet-built D) is built on chroma features alone. Chroma is a 12-bin pitch-class energy vector, harmony in, harmony out. By construction it cannot see timbre, brightness, instrument density, or rhythmic feel. So we ran a non-chroma probe over the same recording and looked for moments where a non-chroma feature moves while the binary A/B/C firing pills stay flat.
The probe is four standard audio features computed at the same 30-second resolution as the chroma data: MFCC novelty (frame-to-frame distance in MFCC space, a textural / timbral motion signal), onset density (note onsets per second, a playing-density signal), spectral centroid (the perceptual brightness of the sound, sensitive to instrument color and distortion), and spectral flux (total spectral change per window). Five boundaries in the recording show a non-chroma feature jumping more than one standard deviation while A/B/C does not change at all. The clearest one is at 12:00 to 12:30.
Two regimes inside one supposedly-flat A+B firing zone. From 4:00 to 12:00 the band plays dense and lower-register: high MFCC novelty (texture moves a lot frame to frame), high onset density (5 to 6 onsets per second), centroid in the 1300 to 1800 Hz range. From 12:30 to 17:00 all three flip together: novelty drops by a third, onset density drops by a third, centroid jumps to 2000 to 2400 Hz. Same chord (C♯ minor, mostly), same firing state (A=1, B=1), completely different texture. The chord-classifier shows kd dropping from 0.94 to 0.70 across the same boundary, so we know something changes; what we did not know before this probe was that three independent non-chroma features all pivot at the same window.
The other ABC-independent jumps the probe surfaced: a textural shift at 6:00 to 6:30 (MFCC novelty z=1.14), onset-density jumps at 6:30 to 7:00 (z=1.33) and 7:00 to 7:30 (z=1.19), and a centroid drop at 16:30 to 17:00 (z=1.36) as the band starts working back toward home. None of these move the existing A/B/C pills. All of them are things a careful listener notices.
What this was, on first writing: a single-performance demonstration that the chroma-only framework collapses real motion into a binary state, and that off-the-shelf timbre / density / brightness features expose at least some of the motion that gets collapsed.
Update, same day: we ran the probe on nine more BTG performances (the full reference set + partials + baselines from CLR-001's table). The expected result, “Type II performances have more non-chroma motion,” turned out backwards. Baselines have more regime changes per minute (0.41) than A+B+C performances (0.18). Direction surprise. The metric that does separate them is the inverse: longest contiguous jam stretch with no multi-feature regime change. Type II performances lock into a single textural feel for an extended period; baselines cycle between feels.
| Performance | Class | Jam length | Longest sustained-texture stretch | ≥ 8 min? |
|---|---|---|---|---|
| UIC '98 BTG | A+B+C | 22.6m | 13.5m | ✓ |
| VB '98 BTG | A+B+C | 13.5m | 11.5m | ✓ |
| ★IWD4U 2026-04-24 | A+B+C | 15.5m | 10.0m | ✓ |
| Cincinnati '03 BTG | C only † | 25.2m | 10.0m | ✓ |
| Bethel '24 BTG | A+B+C | 17.3m | 8.0m | ✓ |
| CMAC '14 BTG | A+B partial | 9.9m | 8.0m | ✓ |
| Sphere '24 BTG | baseline | 12.8m | 7.5m | ✗ |
| Worcester '95 BTG | C only | 9.6m | 6.0m | ✗ |
| Berkeley '10 BTG | A+B partial | 9.6m | 5.5m | ✗ |
| Alpine '24 BTG | baseline | 10.6m | 5.0m | ✗ |
†Cincinnati '03 is community-tagged Type II but our chord-only chain only catches its drone (C). It passes this metric, agreeing with the listener consensus against our chroma-only chain. That's a useful signal that the metric is pointing in the right direction.
At threshold ≥ 8 minutes, the metric agrees with all four A+B+C performances and disagrees with both baselines. Partials split. Subtype D candidate, by this name and shape: textural commitment, an axis somewhat orthogonal to A/B/C, capturing how long the band locks into a single feel during the jam. It is provisional, n=10 is small, threshold is not formally calibrated, and the metric is a sum of off-the-shelf features (MFCC novelty + onset density + centroid + spectral flux) that almost certainly leaves room for refinement. Logged in the research notebook as RQ-D.1 through RQ-D.4.
Era effect worth flagging: 1990s Phish A+B+C performances (UIC, VB) sustain texture longer (13.5m, 11.5m) than modern Phish A+B+C (Bethel '24, IWD4U at 8.0m, 10.0m). Modern jams may be textured-by-design even when triple-firing on harmony.
Where the metric breaks. We extended the probe to non-BTG performances, including the 1994 Bangor and Bozeman Tweezers (community-canonical “outside the box” cases) and the 2024-07-21 Great Woods Tweezer (the other uncatalogued match in CLR-001 § 05). Bozeman '94 passes (13.5min sustained); Great Woods '24 passes (11.8min). Bangor '94 fails (7.5min), even though listeners agree it is unmistakably Type II. Bangor's profile turns out to be: high regime-change rate (0.52/min, baseline-like), and low harmonic concentration (the dominant pitch class shifts across 10 different keys with no clear home base, a chord-stability profile no other performance in the sample shares). So: the metric we built captures one mode of Subtype D and misses another. Bangor is doing something neither our chroma chain nor our non-chroma probe captures cleanly. RQ-D.6 (per-window MERT, the pre-trained transformer audio embedding the project uses elsewhere for jam similarity) is the natural next probe; it might unify the two modes, or reveal a third axis we haven't named yet. Logged in the research notebook as RQ-D.10 (the Tweezer extension) and RQ-D.11 (the falsified D-shift hypothesis).
Four performances in our working sample cross all three thresholds. Three are the Phish Bathtub Gin reference cases CLR-001 was calibrated against, UIC '98 (the canonical anchor), Virginia Beach '98, Bethel '24. The fourth is this. Not “four out of two thousand”, that's not a number we have. It's four out of the ~120 performances we've ingested for method development, and the reference cases were specifically chosen because they were already known triple-fires. The point of the comparison is to look at the metric values side by side and see how the Goose performance shapes up against the cases the detector was anchored on. Different shape, same classification.
| Performance | Length | Head key | Jam key | Key dep. | Chord div. | Drone ratio | Notes |
|---|---|---|---|---|---|---|---|
| 1998 · 08 · 09 Virginia BeachPhish · Bathtub Gin · jamcharts | 15:02 | G min | C maj | 0.24 | 0.12‡ | 2.72 | 1998-era jamchart classic |
| 1998 · 11 · 09 UICPhish · Bathtub Gin · canonical reference | 24:06 | G maj | C maj | 0.34 | 0.21 | 3.98 | The reference. → CLR-001 |
| 2024 · 08 · 11 BethelPhish · Bathtub Gin · recent | 18:49 | C maj | F maj | 0.43 | 0.11‡ | 1.57 | Modern A+B+C, twenty-six years later |
| ★2026 · 04 · 24 IWD4UGoose · Prince cover · this dispatch | 18:00 | D♯ min | C♯ maj | 0.63 | 0.30 | 1.61 | This dispatch. Hits both A and B harder than the canonicals; weaker on drone. |
‡ Two of these chord-vocabulary cells (Virginia Beach 0.12, Bethel 0.11) sit below the headline 0.18 threshold on the aggregate measurement, but the per-window aggregator catches their B firings during specific passages. We surfaced the call here for the same reason CLR-001 surfaced it for Bethel: borderline measurement-method choices should be visible in the dispatch, not buried.
Read across the “Key dep.” column: 0.24, 0.34, 0.43, 0.63. Read “Chord div.”: 0.12, 0.21, 0.11, 0.30. The Goose performance lands at the high end of both, in this small reference set. Read “Drone ratio”: 2.72, 3.98, 1.57, 1.61, drone is where IWD4U is weakest. What this means is intentionally narrow: against the four performances we're comparing, the Goose recording sits high on the harmonic axes and moderate on the textural one. Same classification, different shape. Wider claims would need a wider sample.
When we ran the calibrated detector chain over the forty-eight Goose 2025-26 performances we'd ingested, IWD4U was the only triple-fire. Three more came back as 2-of-3 partials, performances where two detectors crossed threshold and one didn't. We list them here for transparency, not as a leaderboard. Each is a candidate, not a finding, they need their own close listening review before we'd say anything stronger. (And the sample itself is a working slice: forty-eight performances over six months, against a detector calibrated on twelve Bathtub Gins. More tape will land on the bench, and the picture will move.)
2026 · 04 · 10 Asheville, 15:45
Aggregate harmonic-distance pegged at 1.000; highest chord-vocab divergence among the forty-eight (0.50). Misses on drone, the band keeps changing chords throughout. A + B, no C. A candidate, pending listening review.
2026 · 04 · 24 to 11:24 · same night as IWD4U
From the same show as IWD4U, two slots later. D maj → G min, strong on the harmonic axes. A + B, no C. Worth flagging that the 4/24 show has multiple candidates in it; whether that pattern means anything (or whether it's coincidence in a small sample) is a question for a later dispatch.
2026 · 04 · 10 Asheville, 14:36
Highest drone ratio in our forty-eight-show Goose 2025-26 sample (2.29). Modest key shift, B falls just below the 0.18 line. A + C, no B. Different texture from IWD4U, a candidate worth listening to with the drone hypothesis in mind.
A working notebook, written one performance at a time. Each entry develops or stress-tests a measurement; some will pan out, some won't, and the failures get filed too. The point is the method, not the catalog. A short list of what's on the bench next:
We post when something's worth filing, not on a schedule. If you want to hear about new dispatches, send a note: listen@zabriskie.app.