getscript — Technical Overview
Architecture, shared library design, and privacy model
What is getscript?
getscript is a free, open-source CLI that fetches transcripts from Apple Podcasts episodes and YouTube videos. It's the only open-source tool that can programmatically access Apple Podcasts transcripts via FairPlay/AMSMescal authentication. Output goes to stdout, designed for piping into grep, jq, awk, or redirecting to files.
pip install getscript # Fetch an Apple Podcasts transcript getscript "https://podcasts.apple.com/us/podcast/the-daily/id1200361736?i=1000753754819" # Fetch a YouTube transcript getscript "https://youtube.com/watch?v=VIDEO_ID" # JSON output getscript EPISODE_ID --json | jq '.segments[].text' # Search Apple Podcasts interactively (requires fzf) getscript --search "artificial intelligence" --apple
Source code: github.com/outerbanks73/cli-tools · Changelog · Contributing
Architecture
getscript is a Python 3.10+ package with zero heavy dependencies. Transcript fetching uses Apple's private AMP API (via a compiled Obj-C helper with AMSMescal/FairPlay authentication) for Apple Podcasts and youtube-transcript-api for YouTube.
getscript/ ├── cli.py # Entry point, argument parsing ├── detect.py # URL/ID source detection (Apple vs YouTube) ├── apple.py # Apple Podcasts (macOS, Obj-C bearer token, AMSMescal/FairPlay) ├── youtube.py # YouTube transcript fetching (proxy, cookies) ├── output.py # Formatters: text, JSON, Markdown, TTML ├── upload.py # Shared library submission ├── config.py # XDG config/cache, env vars ├── search.py # iTunes Search API, YouTube API v3 ├── picker.py # Interactive fzf selection ├── progress.py # TTY-aware spinner └── completions.py # bash/zsh/fish completions
Design Principles
- Silence is golden. No banners, no welcome messages. Primary data goes to stdout, everything else to stderr.
- Composable. Output works with Unix pipes. JSON, Markdown, and timestamped text formats available.
- Fast startup. Heavy imports are lazy-loaded. Target: <100ms to first output.
- Non-blocking uploads. Shared library submissions happen after output is written. Failures produce a stderr warning — never affect the transcript or exit code.
Shared Transcript Library
Every transcript fetched by getscript is automatically submitted to the Voxly shared transcript library. This creates a network effect: as more people use the tool, the library grows, and enrichments (AI summaries, embeddings, entity extraction) become available to all users — including the free tier.
Submission Pipeline
Submissions don't go directly into the canonical library. They go through a quarantine and verification pipeline:
CLI fetch → ingest-transcript Edge Function → transcript_submissions (quarantine)
│
server-side re-fetch & verify
│
┌────┴────┐
│ │
accepted rejected
│
transcripts (canonical)
│
transcript_provenanceVerification
A background worker processes the submission queue. For each pending submission, it:
- Re-fetches the transcript independently from the original source (YouTube captions API, Apple AMP API)
- Computes a content hash of the re-fetched text and compares it to the submitted content hash
- If the hashes match (or are close after normalizing whitespace), the submission is accepted and promoted to the canonical library
- If the hashes diverge significantly, the submission is rejected and the device is flagged
- If the source can't be re-fetched (geo-blocked, deleted, rate-limited), the submission stays pending with a lower confidence score
This means the CLI-submitted text is treated as a hint, not as truth. The server independently verifies every submission.
Deduplication
Source URLs are hashed (SHA-256) to create a source_hash. When a submission arrives for a source that already exists in the canonical library, it's marked as a duplicate and a provenance record is created linking the new device to the existing transcript. The transcript is not re-inserted — but the new submission increases the confidence score.
Provenance Tracking
Every canonical transcript maintains a full provenance chain:
- Who submitted it — anonymous device IDs (UUID v4), linked to user accounts if the submitter is logged in
- When — first seen, last seen timestamps per device
- How many — count of distinct devices that submitted the same source
- Content integrity — SHA-256 content hash per submission, enabling cross-device comparison
- Verification method — whether the transcript was verified by server re-fetch, consensus, or is unverified
Device Trust Scoring
Over time, each device builds a trust profile based on its submission history:
- Accepted-to-rejected ratio (devices with consistent high-quality submissions earn higher trust)
- Content hash consistency vs server re-fetch (do they submit unmodified transcripts?)
- Volume patterns (distinguishes normal users from bots)
Trusted devices (e.g., 50+ accepted submissions, 0 rejections) can eventually bypass re-fetch verification, reducing server load while maintaining integrity.
Privacy Model
getscript is designed for zero-friction usage. No account, no login, no API key required.
- Device ID: A random UUID v4 generated on first run, stored at
~/.config/getscript/device.json. Not tied to any personal information. - IP addresses: Stored only as SHA-256 hashes for rate limiting. Raw IPs are never persisted.
- Transcript content: Only publicly available transcripts (YouTube captions, Apple Podcasts) are submitted. The content is already public.
- Opt-out: Use
--no-uploador setGETSCRIPT_UPLOAD=0to disable submissions entirely.
Server Infrastructure
The ingest pipeline runs on Supabase infrastructure:
- Edge Function (
ingest-transcript) — Deno-based, handles payload validation, content hashing, dedup checks, and quarantine insertion. Rate limited to 30 requests/minute per IP. - PostgreSQL — transcript_submissions (quarantine), transcripts (canonical), transcript_provenance (tracking). All tables have Row Level Security enabled.
- pgvector — Embedding storage for semantic search over the canonical library.
- Background worker — Processes the submission queue: re-fetches transcripts, verifies content hashes, promotes or rejects, updates provenance.
Data Flow Diagram
┌──────────────────────────────────────────────────────────────┐
│ getscript CLI │
│ │
│ 1. Detect source (Apple/YouTube) from URL or ID │
│ 2. Fetch transcript from origin (AMP API / captions API) │
│ 3. Format and write to stdout │
│ 4. Submit to ingest-transcript Edge Function (async) │
│ └─ payload: device_id, source_type, source_id, │
│ source_url, title, full_text, segments, cli_version │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Supabase Edge Function: ingest-transcript │
│ │
│ • Validate payload (source_type, word count bounds) │
│ • Compute source_hash (SHA-256 of source_url) │
│ • Compute content_hash (SHA-256 of full_text) │
│ • Hash IP for rate limiting │
│ • Check for duplicate submission (same device + source) │
│ • Check for existing canonical transcript │
│ • INSERT into transcript_submissions (quarantine) │
│ • If canonical exists: mark duplicate, record provenance │
└──────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Verification Worker (background) │
│ │
│ • Poll pending submissions │
│ • Re-fetch transcript from origin source │
│ • Compare content_hash(submitted) vs content_hash(refetch) │
│ • Accept: promote to canonical transcripts table │
│ • Reject: flag device, log rejection reason │
│ • Record provenance for accepted submissions │
│ • Trigger enrichment pipeline (embeddings, summaries) │
└──────────────────────────────────────────────────────────────┘Database Schema
transcript_submissions
Quarantine table. Every CLI submission lands here first.
| Column | Type | Description |
|---|---|---|
id | uuid | Primary key |
device_id | uuid | Anonymous device fingerprint |
source_type | text | youtube_transcript or podcast |
source_id | text | Video ID or episode ID |
source_url | text | Full source URL |
source_hash | text | SHA-256 of source_url |
content_hash | text | SHA-256 of full_text |
status | text | pending / accepted / rejected / duplicate |
verification_method | text | refetch_match / consensus / unverified |
confidence | float | 0.0 to 1.0 |
ip_hash | text | SHA-256 of submitter IP |
transcript_provenance
Junction table tracking every device that submitted a given canonical transcript.
| Column | Type | Description |
|---|---|---|
transcript_id | uuid | FK to canonical transcript |
device_id | uuid | Contributing device |
content_hash | text | Hash submitted by this device |
first_seen_at | timestamptz | First submission |
last_seen_at | timestamptz | Most recent submission |
Installation
# Install from PyPI pip install getscript # Or via Homebrew brew install outerbanks73/tap/getscript # Or install from source git clone https://github.com/outerbanks73/cli-tools.git cd cli-tools pip install . # Install man page sudo cp man/getscript.1 /usr/local/share/man/man1/ man getscript
Requires Python 3.10+. Apple Podcasts transcripts require macOS 15.5+ with Xcode CLI tools.
Configuration
Config file: ~/.config/getscript/config.json
{
"youtube_api_key": "YOUR_KEY",
"output_format": "text",
"timestamps": false,
"search_limit": 10,
"no_upload": false
}Environment variables:
| Variable | Description |
|---|---|
GETSCRIPT_YOUTUBE_API_KEY | YouTube API key (for --search) |
GETSCRIPT_PROXY | Proxy URL for YouTube |
GETSCRIPT_COOKIE_FILE | Netscape cookie file |
GETSCRIPT_UPLOAD | Set to 0 to disable submissions |
NO_COLOR | Disable colors |
Priority: config file < environment variables < CLI flags.