getscript — Technical Overview

Architecture, shared library design, and privacy model

What is getscript?

getscript is a free, open-source CLI that fetches transcripts from Apple Podcasts episodes and YouTube videos. It's the only open-source tool that can programmatically access Apple Podcasts transcripts via FairPlay/AMSMescal authentication. Output goes to stdout, designed for piping into grep, jq, awk, or redirecting to files.

pip install getscript

# Fetch an Apple Podcasts transcript
getscript "https://podcasts.apple.com/us/podcast/the-daily/id1200361736?i=1000753754819"

# Fetch a YouTube transcript
getscript "https://youtube.com/watch?v=VIDEO_ID"

# JSON output
getscript EPISODE_ID --json | jq '.segments[].text'

# Search Apple Podcasts interactively (requires fzf)
getscript --search "artificial intelligence" --apple

Source code: github.com/outerbanks73/cli-tools · Changelog · Contributing

Architecture

getscript is a Python 3.10+ package with zero heavy dependencies. Transcript fetching uses Apple's private AMP API (via a compiled Obj-C helper with AMSMescal/FairPlay authentication) for Apple Podcasts and youtube-transcript-api for YouTube.

getscript/
├── cli.py          # Entry point, argument parsing
├── detect.py       # URL/ID source detection (Apple vs YouTube)
├── apple.py        # Apple Podcasts (macOS, Obj-C bearer token, AMSMescal/FairPlay)
├── youtube.py      # YouTube transcript fetching (proxy, cookies)
├── output.py       # Formatters: text, JSON, Markdown, TTML
├── upload.py       # Shared library submission
├── config.py       # XDG config/cache, env vars
├── search.py       # iTunes Search API, YouTube API v3
├── picker.py       # Interactive fzf selection
├── progress.py     # TTY-aware spinner
└── completions.py  # bash/zsh/fish completions

Design Principles

  • Silence is golden. No banners, no welcome messages. Primary data goes to stdout, everything else to stderr.
  • Composable. Output works with Unix pipes. JSON, Markdown, and timestamped text formats available.
  • Fast startup. Heavy imports are lazy-loaded. Target: <100ms to first output.
  • Non-blocking uploads. Shared library submissions happen after output is written. Failures produce a stderr warning — never affect the transcript or exit code.

Shared Transcript Library

Every transcript fetched by getscript is automatically submitted to the Voxly shared transcript library. This creates a network effect: as more people use the tool, the library grows, and enrichments (AI summaries, embeddings, entity extraction) become available to all users — including the free tier.

Submission Pipeline

Submissions don't go directly into the canonical library. They go through a quarantine and verification pipeline:

CLI fetch → ingest-transcript Edge Function → transcript_submissions (quarantine)
                                                         │
                                              server-side re-fetch & verify
                                                         │
                                                    ┌────┴────┐
                                                    │         │
                                               accepted    rejected
                                                    │
                                            transcripts (canonical)
                                                    │
                                          transcript_provenance

Verification

A background worker processes the submission queue. For each pending submission, it:

  1. Re-fetches the transcript independently from the original source (YouTube captions API, Apple AMP API)
  2. Computes a content hash of the re-fetched text and compares it to the submitted content hash
  3. If the hashes match (or are close after normalizing whitespace), the submission is accepted and promoted to the canonical library
  4. If the hashes diverge significantly, the submission is rejected and the device is flagged
  5. If the source can't be re-fetched (geo-blocked, deleted, rate-limited), the submission stays pending with a lower confidence score

This means the CLI-submitted text is treated as a hint, not as truth. The server independently verifies every submission.

Deduplication

Source URLs are hashed (SHA-256) to create a source_hash. When a submission arrives for a source that already exists in the canonical library, it's marked as a duplicate and a provenance record is created linking the new device to the existing transcript. The transcript is not re-inserted — but the new submission increases the confidence score.

Provenance Tracking

Every canonical transcript maintains a full provenance chain:

  • Who submitted it — anonymous device IDs (UUID v4), linked to user accounts if the submitter is logged in
  • When — first seen, last seen timestamps per device
  • How many — count of distinct devices that submitted the same source
  • Content integrity — SHA-256 content hash per submission, enabling cross-device comparison
  • Verification method — whether the transcript was verified by server re-fetch, consensus, or is unverified

Device Trust Scoring

Over time, each device builds a trust profile based on its submission history:

  • Accepted-to-rejected ratio (devices with consistent high-quality submissions earn higher trust)
  • Content hash consistency vs server re-fetch (do they submit unmodified transcripts?)
  • Volume patterns (distinguishes normal users from bots)

Trusted devices (e.g., 50+ accepted submissions, 0 rejections) can eventually bypass re-fetch verification, reducing server load while maintaining integrity.

Privacy Model

getscript is designed for zero-friction usage. No account, no login, no API key required.

  • Device ID: A random UUID v4 generated on first run, stored at ~/.config/getscript/device.json. Not tied to any personal information.
  • IP addresses: Stored only as SHA-256 hashes for rate limiting. Raw IPs are never persisted.
  • Transcript content: Only publicly available transcripts (YouTube captions, Apple Podcasts) are submitted. The content is already public.
  • Opt-out: Use --no-upload or set GETSCRIPT_UPLOAD=0 to disable submissions entirely.

Server Infrastructure

The ingest pipeline runs on Supabase infrastructure:

  • Edge Function (ingest-transcript) — Deno-based, handles payload validation, content hashing, dedup checks, and quarantine insertion. Rate limited to 30 requests/minute per IP.
  • PostgreSQL — transcript_submissions (quarantine), transcripts (canonical), transcript_provenance (tracking). All tables have Row Level Security enabled.
  • pgvector — Embedding storage for semantic search over the canonical library.
  • Background worker — Processes the submission queue: re-fetches transcripts, verifies content hashes, promotes or rejects, updates provenance.

Data Flow Diagram

┌──────────────────────────────────────────────────────────────┐
│  getscript CLI                                               │
│                                                              │
│  1. Detect source (Apple/YouTube) from URL or ID             │
│  2. Fetch transcript from origin (AMP API / captions API)    │
│  3. Format and write to stdout                               │
│  4. Submit to ingest-transcript Edge Function (async)        │
│     └─ payload: device_id, source_type, source_id,           │
│        source_url, title, full_text, segments, cli_version   │
└──────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────┐
│  Supabase Edge Function: ingest-transcript                   │
│                                                              │
│  • Validate payload (source_type, word count bounds)         │
│  • Compute source_hash (SHA-256 of source_url)               │
│  • Compute content_hash (SHA-256 of full_text)               │
│  • Hash IP for rate limiting                                 │
│  • Check for duplicate submission (same device + source)     │
│  • Check for existing canonical transcript                   │
│  • INSERT into transcript_submissions (quarantine)           │
│  • If canonical exists: mark duplicate, record provenance    │
└──────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌──────────────────────────────────────────────────────────────┐
│  Verification Worker (background)                            │
│                                                              │
│  • Poll pending submissions                                  │
│  • Re-fetch transcript from origin source                    │
│  • Compare content_hash(submitted) vs content_hash(refetch)  │
│  • Accept: promote to canonical transcripts table            │
│  • Reject: flag device, log rejection reason                 │
│  • Record provenance for accepted submissions                │
│  • Trigger enrichment pipeline (embeddings, summaries)       │
└──────────────────────────────────────────────────────────────┘

Database Schema

transcript_submissions

Quarantine table. Every CLI submission lands here first.

ColumnTypeDescription
iduuidPrimary key
device_iduuidAnonymous device fingerprint
source_typetextyoutube_transcript or podcast
source_idtextVideo ID or episode ID
source_urltextFull source URL
source_hashtextSHA-256 of source_url
content_hashtextSHA-256 of full_text
statustextpending / accepted / rejected / duplicate
verification_methodtextrefetch_match / consensus / unverified
confidencefloat0.0 to 1.0
ip_hashtextSHA-256 of submitter IP

transcript_provenance

Junction table tracking every device that submitted a given canonical transcript.

ColumnTypeDescription
transcript_iduuidFK to canonical transcript
device_iduuidContributing device
content_hashtextHash submitted by this device
first_seen_attimestamptzFirst submission
last_seen_attimestamptzMost recent submission

Installation

# Install from PyPI
pip install getscript

# Or via Homebrew
brew install outerbanks73/tap/getscript

# Or install from source
git clone https://github.com/outerbanks73/cli-tools.git
cd cli-tools
pip install .

# Install man page
sudo cp man/getscript.1 /usr/local/share/man/man1/
man getscript

Requires Python 3.10+. Apple Podcasts transcripts require macOS 15.5+ with Xcode CLI tools.

Configuration

Config file: ~/.config/getscript/config.json

{
  "youtube_api_key": "YOUR_KEY",
  "output_format": "text",
  "timestamps": false,
  "search_limit": 10,
  "no_upload": false
}

Environment variables:

VariableDescription
GETSCRIPT_YOUTUBE_API_KEYYouTube API key (for --search)
GETSCRIPT_PROXYProxy URL for YouTube
GETSCRIPT_COOKIE_FILENetscape cookie file
GETSCRIPT_UPLOADSet to 0 to disable submissions
NO_COLORDisable colors

Priority: config file < environment variables < CLI flags.