understanding birdcalls

self-supervised learning of bird behaviors and vocalizations

What do birds chirp about? This used to be an idle curiosity as I would stare at my backyard bird feeder, awestruck by the complexity of their feeding, pecking, and vocalizations. What if we could use recent advances in self-supervised learning to associate vocalizations to behaviors, revealing patterns that are otherwise imperceptible to the human eye?

This is not a new idea. Organizations like Project CETI and Earth Species Project are making good headway in similar efforts to decode animal communication. In this project, I attempt to do some smaller scale research in my backyard.

Research Plan

Meaning must be grounded in the real world. As a result, to understand birdcalls, we must first collect a large multimodal dataset.

Step 1: Capture a 4D light and sound field. Using a synchronized camera and microphone array, I record the full spatial audio-visual scene of the backyard. The goal is to be as rich as possible at the data capture stage. The more structure we can extract from sensors, the less the model has to infer from scratch. Postprocessing extracts per-animal tracks in both image space (bounding boxes, poses) and audio space (beamformed source locations, separated vocalizations), building a continuous, synchronized record of who is where, doing what, and saying what, at each moment in time.

Step 2: Train a self-supervised model. With a sufficiently rich multimodal dataset, the hope is that a self-supervised model can discover structure that is invisible to human observers. I expect that two pieces will be important here: (1) training a performant self-supervised model that minimizes some form of prediction error and (2) using mechanistic interpretibility methods like sparse autoencoders and circuit discovery to understand what the model has learned. Could we recover concepts like “family”, “squirrel”, “food”, and derivative higher-order meanings like “hunger = want food”?

Goal: Discover something new. Decoding birdcalls would be a remarkable outcome, but the model isn’t necessarily limited to that. It might instead surface patterns in flock movement, social hierarchies, or territorial behavior. The goal of this project is to use the above data-driven self-supervised approach to discover any insight that is beyond the limits of current human knowledge.

Updates

Feb 27, 2026 — Understanding Birdcalls: Data Collection Prototype

This is the first update for the understanding birdcalls project. See the project page for more context.

Progress

So far, I’ve built a portable data collection platform and written some initial recording and processing scripts for it.

Hardware

The entire platform is built around the miniDSP UMA-16 v2, a 16-channel microphone array in a 4x4 uniform rectangular arrangement. It ships with a 1080p RGB camera, which lets us use the device as an acoustic camera, with images from the camera and beamformed sound from the microphones. While this is mostly plug-and-play with a laptop, I decided to connect it to a Raspberry Pi 5 with a touchscreen and mount it on a tripod for extra portability.

Full details and costs are below:

Data collection prototype

Component Description Price
miniDSP UMA-16 v2 Acoustic camera with a 16-channel USB mic array and a 1080p camera. $199.99
Raspberry Pi 5 (8GB) Main compute board responsible for running the recording pipeline. Data processing can be done offline if necessary. $89.99
Raspberry Pi Touch Display 2 Touchscreen for monitoring and controlling recording sessions in the field without needing a separate laptop. $59.99
Anker 737 Power Bank 140W, 24,000mAh portable battery. Powers everything outdoors for extended untethered sessions. $109.99
GeeekPi PD Power Expansion Board Sits between the power bank and RPi 5, providing USB-C PD negotiation (the RPi 5 requires an unconventional 5V-5A supply), an always-on switch, and startup/shutdown control. $29.99
Samsung 256GB PRO Ultimate microSDXC High-speed storage. Estimated to hold up to 8 hours of recording at a time. $49.99
Raspberry Pi 5 Active Cooler Keeps the RPi 5 from throttling during long outdoor recording sessions. $9.99
Amazon Basics 64-inch Tripod Mounts and positions the rig in the field with adjustable height and angle. $15.19
M2.5 Standoff Kit (114pc) Mechanical hardware for stacking and securing the boards together. $11.95
Total $577.07

With taxes, tariffs, and shipping fees, my actual out-of-pocket came out to around $862.

To case the Raspberry Pi 5 along with the touchscreen and the power delivery board, I’ve used variants of the models by Chaddles McGee. Unfortunately, I can’t share my edits due to licensing, but huge thanks to them for making their models public.

Software

The code powering the device is available at https://github.com/tchittesh/nature-sense.

Currently, the data collection pipeline has three stages: calibrate, record, and reprocess.

Calibration (calibrate.py) is best run with the acoustic camera connected to a laptop. It displays a checkerboard pattern on the laptop screen and uses it to calibrate the intrinsics and distortion of the camera.

Recording (record.py) captures synchronized 16-channel audio from the miniDSP UMA-16 at 48 kHz and video at 60 FPS. Audio is stored in HDF5; video in MP4. A sync.csv logs per-frame timestamps and detects sample rate drift — one early session showed ~1.5% clock slew, which sync.csv catches and corrects for in postprocessing.

Reprocessing (reprocess.py) runs time-domain acoustic beamforming over a 20×20 grid (5m × 3m at 2m depth) targeting the 4000 Hz band, which is the frequency I produce on my phone for testing purposes. There is a visualization option that overlays the heatmap on the original video with a green crosshair at the most likely sound source.

A first test session captured about 50 seconds of synchronized audio and video (~169 MB audio, ~241 MB video). The phone, which is producing a 4000 Hz tone, is being tracked via beamforming, although there is a fair bit of noise, presumably due to indoor reflections and the limited region we’re running the beamforming search over.

Next Steps

Here’s what I want to tackle next:

  • attach RPi and power bank to tripod for more portability
  • collect and process actual birdcall data!

Some thoughts in the back of my mind:

  • How can I calibrate the extrinsics of the camera with respect to the microphone array? Will it be important to also finetune the microphone locations (from the CAD expected values) and gains as part of calibration?
  • Will reflections be a problem in my backyard? If so, how can I deal with them?
  • How will acoustic beamforming for sound separation compare with learned approaches like SAM-Audio, biodenoising, and BioCPPNet?

One thing I realized is that we don’t actually need to solve the full multi-source, unknown-frequency, unknown-location sound separation problem using beamforming alone. We can instead use the camera to detect and track all bird instances visually, and then use beamforming more simply to extract the directional audio corresponding to each track. This camera-guided approach sidesteps a hard blind source separation problem and replaces it with a much more tractable targeted one.