getscript — Technical Overview | Voxly

What is getscript?

getscript is a free, open-source CLI that fetches transcripts from Apple Podcasts episodes and YouTube videos. It's the only open-source tool that can programmatically access Apple Podcasts transcripts via FairPlay/AMSMescal authentication. Output goes to stdout, designed for piping into grep, jq, awk, or redirecting to files.

pip install getscript

# Fetch an Apple Podcasts transcript
getscript "https://podcasts.apple.com/us/podcast/the-daily/id1200361736?i=1000753754819"

# Fetch a YouTube transcript
getscript "https://youtube.com/watch?v=VIDEO_ID"

# JSON output
getscript EPISODE_ID --json | jq '.segments[].text'

# Search Apple Podcasts interactively (requires fzf)
getscript --search "artificial intelligence" --apple

Source code: github.com/outerbanks73/cli-tools · Changelog · Contributing

Architecture

getscript is a Python 3.10+ package with zero heavy dependencies. Transcript fetching uses Apple's private AMP API (via a compiled Obj-C helper with AMSMescal/FairPlay authentication) for Apple Podcasts and youtube-transcript-api for YouTube.

getscript/
├── cli.py          # Entry point, argument parsing
├── detect.py       # URL/ID source detection (Apple vs YouTube)
├── apple.py        # Apple Podcasts (macOS, Obj-C bearer token, AMSMescal/FairPlay)
├── youtube.py      # YouTube transcript fetching (proxy, cookies)
├── output.py       # Formatters: text, JSON, Markdown, TTML
├── upload.py       # Shared library submission
├── config.py       # XDG config/cache, env vars
├── search.py       # iTunes Search API, YouTube API v3
├── picker.py       # Interactive fzf selection
├── progress.py     # TTY-aware spinner
└── completions.py  # bash/zsh/fish completions

Design Principles

Silence is golden. No banners, no welcome messages. Primary data goes to stdout, everything else to stderr.
Composable. Output works with Unix pipes. JSON, Markdown, and timestamped text formats available.
Fast startup. Heavy imports are lazy-loaded. Target: <100ms to first output.
Non-blocking uploads. Shared library submissions happen after output is written. Failures produce a stderr warning — never affect the transcript or exit code.

Shared Transcript Library

Every transcript fetched by getscript is automatically submitted to the Voxly shared transcript library. This creates a network effect: as more people use the tool, the library grows, and enrichments (AI summaries, embeddings, entity extraction) become available to all users — including the free tier.

Submission Pipeline

Submissions don't go directly into the canonical library. They go through a quarantine and verification pipeline:

CLI fetch → ingest-transcript Edge Function → transcript_submissions (quarantine)
                                                         │
                                              server-side re-fetch & verify
                                                         │
                                                    ┌────┴────┐
                                                    │         │
                                               accepted    rejected
                                                    │
                                            transcripts (canonical)
                                                    │
                                          transcript_provenance

Verification

A background worker processes the submission queue. For each pending submission, it:

Re-fetches the transcript independently from the original source (YouTube captions API, Apple AMP API)
Computes a content hash of the re-fetched text and compares it to the submitted content hash
If the hashes match (or are close after normalizing whitespace), the submission is accepted and promoted to the canonical library
If the hashes diverge significantly, the submission is rejected and the device is flagged
If the source can't be re-fetched (geo-blocked, deleted, rate-limited), the submission stays pending with a lower confidence score

This means the CLI-submitted text is treated as a hint, not as truth. The server independently verifies every submission.

Deduplication

Source URLs are hashed (SHA-256) to create a source_hash. When a submission arrives for a source that already exists in the canonical library, it's marked as a duplicate and a provenance record is created linking the new device to the existing transcript. The transcript is not re-inserted — but the new submission increases the confidence score.

Provenance Tracking

Every canonical transcript maintains a full provenance chain:

Who submitted it — anonymous device IDs (UUID v4), linked to user accounts if the submitter is logged in
When — first seen, last seen timestamps per device
How many — count of distinct devices that submitted the same source
Content integrity — SHA-256 content hash per submission, enabling cross-device comparison
Verification method — whether the transcript was verified by server re-fetch, consensus, or is unverified

Device Trust Scoring

Over time, each device builds a trust profile based on its submission history:

Accepted-to-rejected ratio (devices with consistent high-quality submissions earn higher trust)
Content hash consistency vs server re-fetch (do they submit unmodified transcripts?)
Volume patterns (distinguishes normal users from bots)

Trusted devices (e.g., 50+ accepted submissions, 0 rejections) can eventually bypass re-fetch verification, reducing server load while maintaining integrity.

Privacy Model

getscript is designed for zero-friction usage. No account, no login, no API key required.

Device ID: A random UUID v4 generated on first run, stored at ~/.config/getscript/device.json. Not tied to any personal information.
IP addresses: Stored only as SHA-256 hashes for rate limiting. Raw IPs are never persisted.
Transcript content: Only publicly available transcripts (YouTube captions, Apple Podcasts) are submitted. The content is already public.
Opt-out: Use --no-upload or set GETSCRIPT_UPLOAD=0 to disable submissions entirely.

Server Infrastructure

The ingest pipeline runs on Supabase infrastructure:

Edge Function (ingest-transcript) — Deno-based, handles payload validation, content hashing, dedup checks, and quarantine insertion. Rate limited to 30 requests/minute per IP.
PostgreSQL — transcript_submissions (quarantine), transcripts (canonical), transcript_provenance (tracking). All tables have Row Level Security enabled.
pgvector — Embedding storage for semantic search over the canonical library.
Background worker — Processes the submission queue: re-fetches transcripts, verifies content hashes, promotes or rejects, updates provenance.

Data Flow Diagram

┌──────────────────────────────────────────────────────────────┐
│  getscript CLI                                               │
│                                                              │
│  1. Detect source (Apple/YouTube) from URL or ID             │
│  2. Fetch transcript from origin (AMP API / captions API)    │
│  3. Format and write to stdout                               │
│  4. Submit to ingest-transcript Edge Function (async)        │
│     └─ payload: device_id, source_type, source_id,           │
│        source_url, title, full_text, segments, cli_version   │
└──────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────┐
│  Supabase Edge Function: ingest-transcript                   │
│                                                              │
│  • Validate payload (source_type, word count bounds)         │
│  • Compute source_hash (SHA-256 of source_url)               │
│  • Compute content_hash (SHA-256 of full_text)               │
│  • Hash IP for rate limiting                                 │
│  • Check for duplicate submission (same device + source)     │
│  • Check for existing canonical transcript                   │
│  • INSERT into transcript_submissions (quarantine)           │
│  • If canonical exists: mark duplicate, record provenance    │
└──────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────┐
│  Verification Worker (background)                            │
│                                                              │
│  • Poll pending submissions                                  │
│  • Re-fetch transcript from origin source                    │
│  • Compare content_hash(submitted) vs content_hash(refetch)  │
│  • Accept: promote to canonical transcripts table            │
│  • Reject: flag device, log rejection reason                 │
│  • Record provenance for accepted submissions                │
│  • Trigger enrichment pipeline (embeddings, summaries)       │
└──────────────────────────────────────────────────────────────┘

Database Schema

transcript_submissions

Quarantine table. Every CLI submission lands here first.

Column	Type	Description
`id`	uuid	Primary key
`device_id`	uuid	Anonymous device fingerprint
`source_type`	text	youtube_transcript or podcast
`source_id`	text	Video ID or episode ID
`source_url`	text	Full source URL
`source_hash`	text	SHA-256 of source_url
`content_hash`	text	SHA-256 of full_text
`status`	text	pending / accepted / rejected / duplicate
`verification_method`	text	refetch_match / consensus / unverified
`confidence`	float	0.0 to 1.0
`ip_hash`	text	SHA-256 of submitter IP

transcript_provenance

Junction table tracking every device that submitted a given canonical transcript.

Column	Type	Description
`transcript_id`	uuid	FK to canonical transcript
`device_id`	uuid	Contributing device
`content_hash`	text	Hash submitted by this device
`first_seen_at`	timestamptz	First submission
`last_seen_at`	timestamptz	Most recent submission

Installation

# Install from PyPI
pip install getscript

# Or via Homebrew
brew install outerbanks73/tap/getscript

# Or install from source
git clone https://github.com/outerbanks73/cli-tools.git
cd cli-tools
pip install .

# Install man page
sudo cp man/getscript.1 /usr/local/share/man/man1/
man getscript

Requires Python 3.10+. Apple Podcasts transcripts require macOS 15.5+ with Xcode CLI tools.

Configuration

Config file: ~/.config/getscript/config.json

{
  "youtube_api_key": "YOUR_KEY",
  "output_format": "text",
  "timestamps": false,
  "search_limit": 10,
  "no_upload": false
}

Environment variables:

Variable	Description
`GETSCRIPT_YOUTUBE_API_KEY`	YouTube API key (for --search)
`GETSCRIPT_PROXY`	Proxy URL for YouTube
`GETSCRIPT_COOKIE_FILE`	Netscape cookie file
`GETSCRIPT_UPLOAD`	Set to 0 to disable submissions
`NO_COLOR`	Disable colors

Priority: config file < environment variables < CLI flags.