yt-dlp
ytdlp
Download video and audio media with metadata, subtitles, thumbnails, and description sidecars.
Browse the plugins that ArchiveBox, abx-dl, and other tools in the abx-ecosystem provide to extract content from websites.
ytdlpDownload video and audio media with metadata, subtitles, thumbnails, and description sidecars.
yt-dlp
ytdlp
Download video and audio media with metadata, subtitles, thumbnails, and description sidecars.
No plugin dependencies declared.
YTDLP_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=ytdlp 'https://example.com'
Runtime plugins execute while archiving a URL.
YTDLP_ENABLED
Enable video/audio downloading with yt-dlp
true
YTDLP_BINARY
Path to yt-dlp binary
"yt-dlp"
YTDLP_NODE_BINARY
Path to Node.js binary for yt-dlp JS runtime
"node"
YTDLP_TIMEOUT
Timeout for yt-dlp downloads in seconds
3600
YTDLP_COOKIES_FILE
Path to cookies file
""
YTDLP_MAX_SIZE
Maximum file size for yt-dlp downloads
"750m"
YTDLP_CHECK_SSL_VALIDITY
Whether to verify SSL certificates
true
YTDLP_ARGS
Default yt-dlp arguments
[
"--restrict-filenames",
"--trim-filenames=128",
"--write-description",
"--write-info-json",
"--write-thumbnail",
"--write-sub",
"--write-auto-subs",
"--convert-subs=srt",
"--yes-playlist",
"--continue",
"--no-abort-on-error",
"--ignore-errors",
"--geo-bypass",
"--add-metadata",
"--no-progress",
"--remote-components=ejs:github",
"-o",
"%(title)s.%(ext)s"
]
YTDLP_ARGS_EXTRA
Extra arguments to append to yt-dlp command
[]
gallerydlDownload image and media galleries along with metadata sidecars from supported sites.
gallery-dl
gallerydl
Download image and media galleries along with metadata sidecars from supported sites.
No plugin dependencies declared.
GALLERYDL_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=gallerydl 'https://example.com'
Runtime plugins execute while archiving a URL.
GALLERYDL_ENABLED
Enable gallery downloading with gallery-dl
true
GALLERYDL_BINARY
Path to gallery-dl binary
"gallery-dl"
GALLERYDL_TIMEOUT
Timeout for gallery downloads in seconds
3600
GALLERYDL_COOKIES_FILE
Path to cookies file
""
GALLERYDL_CHECK_SSL_VALIDITY
Whether to verify SSL certificates
true
GALLERYDL_ARGS
Default gallery-dl arguments
[
"--write-metadata",
"--write-info-json"
]
GALLERYDL_ARGS_EXTRA
Extra arguments to append to gallery-dl command
[]
forumdlDownload forum threads and exports in JSONL, WARC, and mailbox-style archive formats.
forum-dl
forumdl
Download forum threads and exports in JSONL, WARC, and mailbox-style archive formats.
No plugin dependencies declared.
FORUMDL_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=forumdl 'https://example.com'
Runtime plugins execute while archiving a URL.
FORUMDL_ENABLED
Enable forum downloading with forum-dl
true
FORUMDL_BINARY
Path to forum-dl binary
"forum-dl"
FORUMDL_TIMEOUT
Timeout for forum downloads in seconds
3600
FORUMDL_OUTPUT_FORMAT
Output format for forum downloads
"jsonl"
FORUMDL_ARGS
Default forum-dl arguments
[]
FORUMDL_ARGS_EXTRA
Extra arguments to append to forum-dl command
[]
gitClone git repositories from supported repository URLs into the snapshot output directory.
Git
git
Clone git repositories from supported repository URLs into the snapshot output directory.
No plugin dependencies declared.
GIT_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=git 'https://example.com'
Runtime plugins execute while archiving a URL.
GIT_ENABLED
Enable git repository cloning
true
GIT_BINARY
Path to git binary
"git"
GIT_TIMEOUT
Timeout for git operations in seconds
120
GIT_DOMAINS
Comma-separated list of domains to treat as git repositories
"github.com,gitlab.com,bitbucket.org,gist.github.com,codeberg.org,gitea.com,git.sr.ht"
GIT_ARGS
Default git arguments
[
"clone",
"--depth=1",
"--recursive"
]
GIT_ARGS_EXTRA
Extra arguments to append to git command
[]
wgetArchive pages and their requisites with wget, optionally writing WARC captures.
wget
wget
Archive pages and their requisites with wget, optionally writing WARC captures.
No plugin dependencies declared.
WGET_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=wget 'https://example.com'
Runtime plugins execute while archiving a URL.
WGET_ENABLED
Enable wget archiving
true
WGET_WARC_ENABLED
Save WARC archive file
true
WGET_BINARY
Path to wget binary
"wget"
WGET_TIMEOUT
Timeout for wget in seconds
60
WGET_USER_AGENT
User agent string for wget
""
WGET_COOKIES_FILE
Path to cookies file
""
WGET_CHECK_SSL_VALIDITY
Whether to verify SSL certificates
true
WGET_ARGS
Default wget arguments
[
"--no-verbose",
"--adjust-extension",
"--convert-links",
"--force-directories",
"--backup-converted",
"--span-hosts",
"--no-parent",
"--page-requisites",
"--restrict-file-names=windows",
"--tries=2",
"-e",
"robots=off"
]
WGET_ARGS_EXTRA
Extra arguments to append to wget command
[]
archivedotorgSubmit URLs to the Internet Archive Wayback Machine and save the resulting archive link.
Archive.org
archivedotorg
Submit URLs to the Internet Archive Wayback Machine and save the resulting archive link.
No plugin dependencies declared.
No binary dependencies declared.
ARCHIVEDOTORG_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=archivedotorg 'https://example.com'
Runtime plugins execute while archiving a URL.
ARCHIVEDOTORG_ENABLED
Submit URLs to archive.org Wayback Machine
true
ARCHIVEDOTORG_TIMEOUT
Timeout for archive.org submission in seconds
60
ARCHIVEDOTORG_USER_AGENT
User agent string
""
faviconFetch and save the site favicon or touch icon.
Favicon
favicon
Fetch and save the site favicon or touch icon.
No plugin dependencies declared.
No binary dependencies declared.
FAVICON_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=favicon 'https://example.com'
Runtime plugins execute while archiving a URL.
FAVICON_ENABLED
Enable favicon downloading
true
FAVICON_TIMEOUT
Timeout for favicon fetch in seconds
30
FAVICON_USER_AGENT
User agent string
""
modalcloserAutomatically dismiss dialogs, cookie banners, and framework modals while the page is being archived.
Modal Closer
modalcloser
Automatically dismiss dialogs, cookie banners, and framework modals while the page is being archived.
No output mimetypes declared.
MODALCLOSER_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=modalcloser 'https://example.com'
Runtime plugins execute while archiving a URL.
MODALCLOSER_ENABLED
Enable automatic modal and dialog closing
true
MODALCLOSER_TIMEOUT
Delay before auto-closing dialogs (ms)
1250
MODALCLOSER_POLL_INTERVAL
How often to check for CSS modals (ms)
500
consolelogCapture browser console messages emitted while the page loads.
Console Log
consolelog
Capture browser console messages emitted while the page loads.
CONSOLELOG_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=consolelog 'https://example.com'
Runtime plugins execute while archiving a URL.
CONSOLELOG_ENABLED
Enable console log capture
true
CONSOLELOG_TIMEOUT
Timeout for console log capture in seconds
30
dnsRecord DNS activity observed while loading the page in Chrome.
DNS
dns
Record DNS activity observed while loading the page in Chrome.
DNS_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=dns 'https://example.com'
Runtime plugins execute while archiving a URL.
DNS_ENABLED
Enable DNS traffic recording during page load
true
DNS_TIMEOUT
Timeout for DNS recording in seconds
30
sslCapture TLS certificate and connection metadata for the loaded page.
SSL
ssl
Capture TLS certificate and connection metadata for the loaded page.
SSL_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=ssl 'https://example.com'
Runtime plugins execute while archiving a URL.
SSL_ENABLED
Enable SSL certificate capture
true
SSL_TIMEOUT
Timeout for SSL capture in seconds
30
responsesCapture HTTP response metadata for requests made during page load.
Responses
responses
Capture HTTP response metadata for requests made during page load.
RESPONSES_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=responses 'https://example.com'
Runtime plugins execute while archiving a URL.
RESPONSES_ENABLED
Enable HTTP response capture
true
RESPONSES_TIMEOUT
Timeout for response capture in seconds
30
redirectsCapture the redirect chain encountered while loading the page.
Redirects
redirects
Capture the redirect chain encountered while loading the page.
REDIRECTS_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=redirects 'https://example.com'
Runtime plugins execute while archiving a URL.
REDIRECTS_ENABLED
Enable redirect chain capture
true
REDIRECTS_TIMEOUT
Timeout for redirect capture in seconds
30
staticfileDetect and download static-file responses directly when a URL resolves to a non-HTML asset.
Static File
staticfile
Detect and download static-file responses directly when a URL resolves to a non-HTML asset.
STATICFILE_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=staticfile 'https://example.com'
Runtime plugins execute while archiving a URL.
STATICFILE_ENABLED
Enable static file detection
true
STATICFILE_TIMEOUT
Timeout for static file detection in seconds
30
headersCapture HTTP headers for the main document response.
Headers
headers
Capture HTTP headers for the main document response.
HEADERS_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=headers 'https://example.com'
Runtime plugins execute while archiving a URL.
HEADERS_ENABLED
Enable HTTP headers capture
true
HEADERS_TIMEOUT
Timeout for headers capture in seconds
30
chromeLaunch and manage a shared Chromium session for browser-driven plugins.
Chrome
chrome
Launch and manage a shared Chromium session for browser-driven plugins.
No plugin dependencies declared.
CHROME_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=chrome 'https://example.com'
Runtime plugins execute while archiving a URL.
CHROME_ENABLED
Enable Chromium browser integration for archiving
true
CHROME_BINARY
Path to Chromium binary
"chromium"
CHROME_NODE_BINARY
Path to Node.js binary (for Puppeteer)
"node"
CHROME_TIMEOUT
Timeout for Chrome operations in seconds
60
CHROME_HEADLESS
Run Chrome in headless mode
true
CHROME_SANDBOX
Enable Chrome sandbox (disable in Docker with --no-sandbox)
true
CHROME_RESOLUTION
Browser viewport resolution (width,height)
"1440,2000"
CHROME_USER_DATA_DIR
Path to Chrome user data directory for persistent sessions (derived from ACTIVE_PERSONA if not set)
""
CHROME_USER_AGENT
User agent string for Chrome
""
CHROME_ARGS
Default Chrome command-line arguments (static flags only, dynamic args like --user-data-dir are added at runtime)
[
"--no-first-run",
"--no-default-browser-check",
"--disable-default-apps",
"--disable-sync",
"--disable-infobars",
"--disable-blink-features=AutomationControlled",
"--disable-component-update",
"--disable-domain-reliability",
"--disable-breakpad",
"--disable-client-side-phishing-detection",
"--disable-hang-monitor",
"--disable-speech-synthesis-api",
"--disable-speech-api",
"--disable-print-preview",
"--disable-notifications",
"--disable-desktop-notifications",
"--disable-popup-blocking",
"--disable-prompt-on-repost",
"--disable-external-intent-requests",
"--disable-session-crashed-bubble",
"--disable-search-engine-choice-screen",
"--disable-datasaver-prompt",
"--ash-no-nudges",
"--hide-crash-restore-bubble",
"--suppress-message-center-popups",
"--noerrdialogs",
"--no-pings",
"--silent-debugger-extension-api",
"--deny-permission-prompts",
"--safebrowsing-disable-auto-update",
"--metrics-recording-only",
"--password-store=basic",
"--use-mock-keychain",
"--disable-cookie-encryption",
"--font-render-hinting=none",
"--force-color-profile=srgb",
"--disable-partial-raster",
"--disable-skia-runtime-opts",
"--disable-2d-canvas-clip-aa",
"--enable-webgl",
"--hide-scrollbars",
"--export-tagged-pdf",
"--generate-pdf-document-outline",
"--disable-lazy-loading",
"--disable-renderer-backgrounding",
"--disable-background-networking",
"--disable-background-timer-throttling",
"--disable-backgrounding-occluded-windows",
"--disable-ipc-flooding-protection",
"--disable-extensions-http-throttling",
"--disable-field-trial-config",
"--disable-back-forward-cache",
"--autoplay-policy=no-user-gesture-required",
"--disable-gesture-requirement-for-media-playback",
"--lang=en-US,en;q=0.9",
"--log-level=2",
"--enable-logging=stderr"
]
CHROME_ARGS_EXTRA
Extra arguments to append to Chrome command (for user customization)
[]
CHROME_PAGELOAD_TIMEOUT
Timeout for page navigation/load in seconds
60
CHROME_WAIT_FOR
Page load completion condition (domcontentloaded, load, networkidle0, networkidle2)
"networkidle2"
CHROME_DELAY_AFTER_LOAD
Extra delay in seconds after page load completes before archiving (useful for JS-heavy SPAs)
0
CHROME_CHECK_SSL_VALIDITY
Whether to verify SSL certificates (disable for self-signed certs)
true
seoCapture SEO-related metadata such as meta tags and Open Graph fields.
SEO
seo
Capture SEO-related metadata such as meta tags and Open Graph fields.
SEO_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=seo 'https://example.com'
Runtime plugins execute while archiving a URL.
SEO_ENABLED
Enable SEO metadata capture
true
SEO_TIMEOUT
Timeout for SEO capture in seconds
30
accessibilityCapture the browser accessibility tree for the archived page.
Accessibility
accessibility
Capture the browser accessibility tree for the archived page.
ACCESSIBILITY_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=accessibility 'https://example.com'
Runtime plugins execute while archiving a URL.
ACCESSIBILITY_ENABLED
Enable accessibility tree capture
true
ACCESSIBILITY_TIMEOUT
Timeout for accessibility capture in seconds
30
infiniscrollExpand infinite-scroll pages and load additional content before downstream capture plugins run.
Infinite Scroll
infiniscroll
Expand infinite-scroll pages and load additional content before downstream capture plugins run.
No output mimetypes declared.
INFINISCROLL_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=infiniscroll 'https://example.com'
Runtime plugins execute while archiving a URL.
INFINISCROLL_ENABLED
Enable infinite scroll page expansion
true
INFINISCROLL_TIMEOUT
Maximum timeout for scrolling in seconds
120
INFINISCROLL_SCROLL_DELAY
Delay between scrolls in milliseconds
2000
INFINISCROLL_SCROLL_DISTANCE
Distance to scroll per step in pixels
1600
INFINISCROLL_SCROLL_LIMIT
Maximum number of scroll steps
10
INFINISCROLL_MIN_HEIGHT
Minimum page height to scroll to in pixels
16000
INFINISCROLL_EXPAND_DETAILS
Expand
true
claudechromeUse Claude computer-use to interact with pages in Chrome via CDP screenshots and the Anthropic API.
Claude Chrome
claudechrome
Use Claude computer-use to interact with pages in Chrome via CDP screenshots and the Anthropic API.
CLAUDECHROME_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=claudechrome 'https://example.com'
Runtime plugins execute while archiving a URL.
CLAUDECHROME_ENABLED
Enable Claude for Chrome browser extension for AI-driven page interaction
false
CLAUDECHROME_PROMPT
Prompt for Claude to execute on the page. Claude can click buttons, fill forms, download files, and interact with any page element.
"Look at the current page. If there are any \"expand\", \"show more\", \"load more\", or similar buttons/links, click them all to reveal hidden content. Report what you did."
CLAUDECHROME_TIMEOUT
Timeout for Claude for Chrome operations in seconds
120
CLAUDECHROME_MODEL
Claude model to use (e.g. sonnet, opus, haiku). Availability depends on your plan.
"sonnet"
CLAUDECHROME_MAX_ACTIONS
Maximum number of agentic loop iterations (screenshots + actions) per page
15
ANTHROPIC_API_KEY
Anthropic API key for Claude for Chrome authentication
""
singlefileSave a complete page as a single self-contained HTML file using the SingleFile extension or CLI.
SingleFile
singlefile
Save a complete page as a single self-contained HTML file using the SingleFile extension or CLI.
SINGLEFILE_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=singlefile 'https://example.com'
Runtime plugins execute while archiving a URL.
SINGLEFILE_ENABLED
Enable SingleFile archiving
true
SINGLEFILE_BINARY
Path to single-file binary
"single-file"
SINGLEFILE_NODE_BINARY
Path to Node.js binary
"node"
SINGLEFILE_CHROME_BINARY
Path to Chromium binary
""
SINGLEFILE_TIMEOUT
Timeout for SingleFile in seconds
60
SINGLEFILE_USER_AGENT
User agent string
""
SINGLEFILE_COOKIES_FILE
Path to cookies file
""
SINGLEFILE_CHECK_SSL_VALIDITY
Whether to verify SSL certificates
true
SINGLEFILE_CHROME_ARGS
Chrome command-line arguments for SingleFile
[]
SINGLEFILE_ARGS
Default single-file arguments
[
"--browser-headless"
]
SINGLEFILE_ARGS_EXTRA
Extra arguments to append to single-file command
[]
screenshotCapture a PNG screenshot of the rendered page.
Screenshot
screenshot
Capture a PNG screenshot of the rendered page.
SCREENSHOT_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=screenshot 'https://example.com'
Runtime plugins execute while archiving a URL.
SCREENSHOT_ENABLED
Enable screenshot capture
true
SCREENSHOT_TIMEOUT
Timeout for screenshot capture in seconds
60
SCREENSHOT_RESOLUTION
Screenshot resolution (width,height)
"1440,2000"
pdfRender the current page to PDF using the shared Chrome session.
pdf
Render the current page to PDF using the shared Chrome session.
PDF_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=pdf 'https://example.com'
Runtime plugins execute while archiving a URL.
PDF_ENABLED
Enable PDF generation
true
PDF_TIMEOUT
Timeout for PDF generation in seconds
60
PDF_RESOLUTION
PDF page resolution (width,height)
"1440,2000"
domSave the fully rendered DOM HTML from the live page.
DOM
dom
Save the fully rendered DOM HTML from the live page.
DOM_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=dom 'https://example.com'
Runtime plugins execute while archiving a URL.
DOM_ENABLED
Enable DOM capture
true
DOM_TIMEOUT
Timeout for DOM capture in seconds
60
titleCapture the final document title from the rendered page.
Title
title
Capture the final document title from the rendered page.
TITLE_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=title 'https://example.com'
Runtime plugins execute while archiving a URL.
TITLE_ENABLED
Enable title extraction
true
TITLE_TIMEOUT
Timeout for title extraction in seconds
30
readabilityExtract article HTML, text, and metadata using Mozilla Readability.
Readability
readability
Extract article HTML, text, and metadata using Mozilla Readability.
No plugin dependencies declared.
READABILITY_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=readability 'https://example.com'
Runtime plugins execute while archiving a URL.
READABILITY_ENABLED
Enable Readability text extraction
true
READABILITY_BINARY
Path to readability-extractor binary
"readability-extractor"
READABILITY_TIMEOUT
Timeout for Readability in seconds
30
READABILITY_ARGS
Default Readability arguments
[]
READABILITY_ARGS_EXTRA
Extra arguments to append to Readability command
[]
defuddleExtract cleaned article HTML, text, and metadata from archived HTML using Defuddle.
Defuddle
defuddle
Extract cleaned article HTML, text, and metadata from archived HTML using Defuddle.
No plugin dependencies declared.
DEFUDDLE_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=defuddle 'https://example.com'
Runtime plugins execute while archiving a URL.
DEFUDDLE_ENABLED
Enable Defuddle text extraction
true
DEFUDDLE_BINARY
Path to defuddle binary
"defuddle"
DEFUDDLE_TIMEOUT
Timeout for Defuddle in seconds
30
DEFUDDLE_ARGS
Default Defuddle arguments
[]
DEFUDDLE_ARGS_EXTRA
Extra arguments to append to Defuddle command
[]
mercuryExtract article HTML, text, and metadata using the Postlight Mercury parser.
Mercury
mercury
Extract article HTML, text, and metadata using the Postlight Mercury parser.
No plugin dependencies declared.
MERCURY_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=mercury 'https://example.com'
Runtime plugins execute while archiving a URL.
MERCURY_ENABLED
Enable Mercury text extraction
true
MERCURY_BINARY
Path to Mercury/Postlight parser binary
"postlight-parser"
MERCURY_TIMEOUT
Timeout for Mercury in seconds
30
MERCURY_ARGS
Default Mercury parser arguments
[]
MERCURY_ARGS_EXTRA
Extra arguments to append to Mercury parser command
[]
claudecodeextractUse Claude Code to generate clean Markdown from snapshot extractor outputs.
Claude Code Extract
claudecodeextract
Use Claude Code to generate clean Markdown from snapshot extractor outputs.
CLAUDECODEEXTRACT_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=claudecodeextract 'https://example.com'
Runtime plugins execute while archiving a URL.
CLAUDECODEEXTRACT_ENABLED
Enable Claude Code AI extraction
false
CLAUDECODEEXTRACT_TIMEOUT
Timeout for Claude Code extraction in seconds
120
CLAUDECODEEXTRACT_PROMPT
Custom prompt for Claude Code extraction. Use this to define what Claude should extract or generate from the snapshot.
"Read all the previously extracted outputs in this snapshot directory (readability/, mercury/, defuddle/, htmltotext/, dom/, singlefile/, etc.). Using the best available source, generate a clean, well-formatted Markdown representation of the page content. Save the output as content.md in your output directory."
CLAUDECODEEXTRACT_MODEL
Claude model to use for extraction (e.g. sonnet, opus, haiku)
"sonnet"
CLAUDECODEEXTRACT_MAX_TURNS
Maximum number of agentic turns for extraction
10
htmltotextConvert archived HTML from other extractors into plain text for indexing and analysis.
HTML to Text
htmltotext
Convert archived HTML from other extractors into plain text for indexing and analysis.
No plugin dependencies declared.
No binary dependencies declared.
HTMLTOTEXT_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=htmltotext 'https://example.com'
Runtime plugins execute while archiving a URL.
HTMLTOTEXT_ENABLED
Enable HTML to text conversion
true
HTMLTOTEXT_TIMEOUT
Timeout for HTML to text conversion in seconds
30
trafilaturaExtract article content from archived HTML into text, markdown, HTML, CSV, JSON, and XML formats.
Trafilatura
trafilatura
Extract article content from archived HTML into text, markdown, HTML, CSV, JSON, and XML formats.
No plugin dependencies declared.
TRAFILATURA_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=trafilatura 'https://example.com'
Runtime plugins execute while archiving a URL.
TRAFILATURA_ENABLED
Enable Trafilatura extraction
true
TRAFILATURA_BINARY
Path to trafilatura binary
"trafilatura"
TRAFILATURA_TIMEOUT
Timeout for Trafilatura in seconds
30
TRAFILATURA_OUTPUT_TXT
Write plain text output (content.txt)
true
TRAFILATURA_OUTPUT_MARKDOWN
Write markdown output (content.md)
true
TRAFILATURA_OUTPUT_HTML
Write HTML output (content.html)
true
TRAFILATURA_OUTPUT_CSV
Write CSV output (content.csv)
false
TRAFILATURA_OUTPUT_JSON
Write JSON output (content.json)
false
TRAFILATURA_OUTPUT_XML
Write XML output (content.xml)
false
TRAFILATURA_OUTPUT_XMLTEI
Write XML TEI output (content.xmltei)
false
papersdlFetch downloadable academic papers from paper URLs and DOI targets.
papers-dl
papersdl
Fetch downloadable academic papers from paper URLs and DOI targets.
No plugin dependencies declared.
PAPERSDL_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=papersdl 'https://example.com'
Runtime plugins execute while archiving a URL.
PAPERSDL_ENABLED
Enable paper downloading with papers-dl
true
PAPERSDL_BINARY
Path to papers-dl binary
"papers-dl"
PAPERSDL_TIMEOUT
Timeout for paper downloads in seconds
300
PAPERSDL_ARGS
Default papers-dl arguments
[
"fetch"
]
PAPERSDL_ARGS_EXTRA
Extra arguments to append to papers-dl command
[]
parse_html_urlsParse HTML documents and emit discovered links as JSONL snapshot records.
Parse HTML URLs
parse_html_urls
Parse HTML documents and emit discovered links as JSONL snapshot records.
No plugin dependencies declared.
No binary dependencies declared.
PARSE_HTML_URLS_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=parse_html_urls 'https://example.com'
Runtime plugins execute while archiving a URL.
PARSE_HTML_URLS_ENABLED
Enable HTML URL parsing
true
parse_txt_urlsParse plain text documents and emit discovered URLs as JSONL snapshot records.
Parse Text URLs
parse_txt_urls
Parse plain text documents and emit discovered URLs as JSONL snapshot records.
No plugin dependencies declared.
No binary dependencies declared.
PARSE_TXT_URLS_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=parse_txt_urls 'https://example.com'
Runtime plugins execute while archiving a URL.
PARSE_TXT_URLS_ENABLED
Enable plain text URL parsing
true
parse_rss_urlsParse RSS and Atom feeds and emit discovered entry URLs as JSONL snapshot records.
Parse RSS URLs
parse_rss_urls
Parse RSS and Atom feeds and emit discovered entry URLs as JSONL snapshot records.
No plugin dependencies declared.
No binary dependencies declared.
PARSE_RSS_URLS_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=parse_rss_urls 'https://example.com'
Runtime plugins execute while archiving a URL.
PARSE_RSS_URLS_ENABLED
Enable RSS/Atom feed URL parsing
true
parse_netscape_urlsParse Netscape bookmark HTML exports and emit discovered URLs as JSONL snapshot records.
Parse Netscape URLs
parse_netscape_urls
Parse Netscape bookmark HTML exports and emit discovered URLs as JSONL snapshot records.
No plugin dependencies declared.
No binary dependencies declared.
PARSE_NETSCAPE_URLS_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=parse_netscape_urls 'https://example.com'
Runtime plugins execute while archiving a URL.
PARSE_NETSCAPE_URLS_ENABLED
Enable Netscape bookmarks HTML URL parsing
true
parse_jsonl_urlsParse JSONL bookmark exports and emit discovered URLs as JSONL snapshot records.
Parse JSONL URLs
parse_jsonl_urls
Parse JSONL bookmark exports and emit discovered URLs as JSONL snapshot records.
No plugin dependencies declared.
No binary dependencies declared.
PARSE_JSONL_URLS_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=parse_jsonl_urls 'https://example.com'
Runtime plugins execute while archiving a URL.
PARSE_JSONL_URLS_ENABLED
Enable JSON Lines URL parsing
true
parse_dom_outlinksExtract crawlable links from the rendered DOM and emit them as JSONL records.
Parse DOM Outlinks
parse_dom_outlinks
Extract crawlable links from the rendered DOM and emit them as JSONL records.
PARSE_DOM_OUTLINKS_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=parse_dom_outlinks 'https://example.com'
Runtime plugins execute while archiving a URL.
PARSE_DOM_OUTLINKS_ENABLED
Enable DOM outlinks parsing from archived pages
true
PARSE_DOM_OUTLINKS_TIMEOUT
Timeout for DOM outlinks parsing in seconds
30
search_backend_sqliteIndex archived snapshot content into a SQLite FTS database for local search.
SQLite Search
search_backend_sqlite
Index archived snapshot content into a SQLite FTS database for local search.
No plugin dependencies declared.
No binary dependencies declared.
archivebox add 'https://example.com'
abx-dl dl --plugins=search_backend_sqlite 'https://example.com'
Runtime plugins execute while archiving a URL.
SEARCH_BACKEND_SQLITE_DB
SQLite FTS database filename
"search.sqlite3"
SEARCH_BACKEND_SQLITE_SEPARATE_DATABASE
Use separate database file for FTS index
true
SEARCH_BACKEND_SQLITE_TOKENIZERS
FTS5 tokenizer configuration
"porter unicode61 remove_diacritics 2"
search_backend_sonicIndex archived snapshot content into a Sonic search backend.
Sonic Search
search_backend_sonic
Index archived snapshot content into a Sonic search backend.
No plugin dependencies declared.
No binary dependencies declared.
No output mimetypes declared.
archivebox add 'https://example.com'
abx-dl dl --plugins=search_backend_sonic 'https://example.com'
Runtime plugins execute while archiving a URL.
SEARCH_BACKEND_SONIC_HOST_NAME
Sonic server hostname
"127.0.0.1"
SEARCH_BACKEND_SONIC_PORT
Sonic server port
1491
SEARCH_BACKEND_SONIC_PASSWORD
Sonic server password
"SecretPassword"
SEARCH_BACKEND_SONIC_COLLECTION
Sonic collection name
"archivebox"
SEARCH_BACKEND_SONIC_BUCKET
Sonic bucket name
"snapshots"
claudecodecleanupUse Claude Code to deduplicate and clean up redundant snapshot extractor outputs.
Claude Code Cleanup
claudecodecleanup
Use Claude Code to deduplicate and clean up redundant snapshot extractor outputs.
CLAUDECODECLEANUP_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=claudecodecleanup 'https://example.com'
Runtime plugins execute while archiving a URL.
CLAUDECODECLEANUP_ENABLED
Enable Claude Code AI cleanup of snapshot files
false
CLAUDECODECLEANUP_TIMEOUT
Timeout for Claude Code cleanup in seconds
120
CLAUDECODECLEANUP_PROMPT
Custom prompt for Claude Code cleanup. Defines what Claude should clean up and how to determine which duplicates to keep.
"Analyze all the extractor output directories in this snapshot. Look for duplicate or redundant outputs across plugins (e.g. multiple HTML extractions, multiple text extractions, multiple URL extraction outputs, etc.). For each group of similar outputs, inspect the content and determine which version is the best quality. Delete the inferior/redundant versions, keeping only the best one. Also remove any unnecessary temporary files, empty directories, or incomplete outputs. Write a summary of what you cleaned up to cleanup_report.txt in your output directory."
CLAUDECODECLEANUP_MODEL
Claude model to use for cleanup (e.g. sonnet, opus, haiku)
"sonnet"
CLAUDECODECLEANUP_MAX_TURNS
Maximum number of agentic turns for cleanup
15
hashesGenerate a hash manifest for files produced in the snapshot directory.
Hashes
hashes
Generate a hash manifest for files produced in the snapshot directory.
No plugin dependencies declared.
No binary dependencies declared.
HASHES_ENABLED=true archivebox add 'https://example.com'
abx-dl dl --plugins=hashes 'https://example.com'
Runtime plugins execute while archiving a URL.
HASHES_ENABLED
Enable merkle tree hash generation
true
HASHES_TIMEOUT
Timeout for merkle tree generation in seconds
30
npmInstall binaries from npm packages and expose Node module paths.
npm
npm
Install binaries from npm packages and expose Node module paths.
No plugin dependencies declared.
No output mimetypes declared.
archivebox init --setup
abx-dl plugins --install npm
Setup plugins install dependencies or prepare shared runtime state.
config.json schema.claudecodeRun Claude Code AI agent on snapshots to extract, analyze, or transform archived content.
Claude Code
claudecode
Run Claude Code AI agent on snapshots to extract, analyze, or transform archived content.
No plugin dependencies declared.
CLAUDECODE_ENABLED=true archivebox init --setup
abx-dl plugins --install claudecode
Setup plugins install dependencies or prepare shared runtime state.
CLAUDECODE_ENABLED
Enable Claude Code AI agent integration. Controls the crawl-time Claude binary install hook; child plugins still need the claudecode plugin installed and a working Claude binary.
false
CLAUDECODE_BINARY
Path to Claude Code CLI binary
"claude"
CLAUDECODE_TIMEOUT
Timeout for Claude Code operations in seconds
120
ANTHROPIC_API_KEY
Anthropic API key for Claude Code authentication
""
CLAUDECODE_MODEL
Claude model to use (e.g. sonnet, opus, haiku)
"sonnet"
CLAUDECODE_MAX_TURNS
Maximum number of agentic turns per invocation
10
search_backend_ripgrepSearch archived snapshot files directly with ripgrep instead of maintaining an index.
ripgrep Search
search_backend_ripgrep
Search archived snapshot files directly with ripgrep instead of maintaining an index.
No plugin dependencies declared.
No output mimetypes declared.
archivebox init --setup
abx-dl plugins --install search_backend_ripgrep
Setup plugins install dependencies or prepare shared runtime state.
RIPGREP_BINARY
Path to ripgrep binary
"rg"
RIPGREP_TIMEOUT
Search timeout in seconds
90
RIPGREP_ARGS
Default ripgrep arguments
[
"--files-with-matches",
"--no-messages",
"--ignore-case"
]
RIPGREP_ARGS_EXTRA
Extra arguments to append to ripgrep command
[]
puppeteerInstall and manage Chromium through the Puppeteer toolchain.
Puppeteer
puppeteer
Install and manage Chromium through the Puppeteer toolchain.
No plugin dependencies declared.
No output mimetypes declared.
archivebox init --setup
abx-dl plugins --install puppeteer
Setup plugins install dependencies or prepare shared runtime state.
config.json schema.ublockInstall the uBlock Origin extension to block ads, trackers, and other page clutter during archiving.
uBlock Origin
ublock
Install the uBlock Origin extension to block ads, trackers, and other page clutter during archiving.
No output mimetypes declared.
UBLOCK_ENABLED=true archivebox init --setup
abx-dl plugins --install ublock
Setup plugins install dependencies or prepare shared runtime state.
UBLOCK_ENABLED
Enable uBlock Origin browser extension for ad blocking
true
istilldontcareaboutcookiesInstall the I Still Don't Care About Cookies extension to dismiss cookie banners during archiving.
I Still Don't Care About Cookies
istilldontcareaboutcookies
Install the I Still Don't Care About Cookies extension to dismiss cookie banners during archiving.
No output mimetypes declared.
ISTILLDONTCAREABOUTCOOKIES_ENABLED=true archivebox init --setup
abx-dl plugins --install istilldontcareaboutcookies
Setup plugins install dependencies or prepare shared runtime state.
on_Crawl__81_install_istilldontcareaboutcookies_extension.js
ISTILLDONTCAREABOUTCOOKIES_ENABLED
Enable I Still Don't Care About Cookies browser extension
true
twocaptchaInstall and configure the 2Captcha extension to solve CAPTCHAs during browser-based archiving.
2Captcha
twocaptcha
Install and configure the 2Captcha extension to solve CAPTCHAs during browser-based archiving.
No output mimetypes declared.
TWOCAPTCHA_ENABLED=true archivebox init --setup
abx-dl plugins --install twocaptcha
Setup plugins install dependencies or prepare shared runtime state.
TWOCAPTCHA_ENABLED
Enable 2captcha browser extension for automatic CAPTCHA solving
true
TWOCAPTCHA_API_KEY
2captcha API key for CAPTCHA solving service (get from https://2captcha.com)
""
TWOCAPTCHA_RETRY_COUNT
Number of times to retry CAPTCHA solving on error
3
TWOCAPTCHA_RETRY_DELAY
Delay in seconds between CAPTCHA solving retries
5
TWOCAPTCHA_TIMEOUT
Timeout for CAPTCHA solving in seconds
60
TWOCAPTCHA_AUTO_SUBMIT
Automatically submit forms after CAPTCHA is solved
false
pipInstall Python-based binaries into a managed virtual environment.
pip
pip
Install Python-based binaries into a managed virtual environment.
No plugin dependencies declared.
No output mimetypes declared.
archivebox init --setup
abx-dl plugins --install pip
Setup plugins install dependencies or prepare shared runtime state.
config.json schema.brewInstall binaries through the Homebrew package manager.
Homebrew
brew
Install binaries through the Homebrew package manager.
No plugin dependencies declared.
No output mimetypes declared.
archivebox init --setup
abx-dl plugins --install brew
Setup plugins install dependencies or prepare shared runtime state.
config.json schema.aptInstall binaries through the Debian and Ubuntu APT package manager.
APT
apt
Install binaries through the Debian and Ubuntu APT package manager.
No plugin dependencies declared.
No output mimetypes declared.
archivebox init --setup
abx-dl plugins --install apt
Setup plugins install dependencies or prepare shared runtime state.
config.json schema.customInstall binaries using an arbitrary custom shell command.
Custom
custom
Install binaries using an arbitrary custom shell command.
No plugin dependencies declared.
No binary dependencies declared.
No output mimetypes declared.
archivebox init --setup
abx-dl plugins --install custom
Setup plugins install dependencies or prepare shared runtime state.
config.json schema.envDiscover binaries that are already available on the system PATH.
Environment
env
Discover binaries that are already available on the system PATH.
No plugin dependencies declared.
No binary dependencies declared.
No output mimetypes declared.
archivebox init --setup
abx-dl plugins --install env
Setup plugins install dependencies or prepare shared runtime state.
config.json schema.baseProvide shared utilities, helpers, and test support used by other plugins.
Base
base
Provide shared utilities, helpers, and test support used by other plugins.
No plugin dependencies declared.
No binary dependencies declared.
No output mimetypes declared.
archivebox add 'https://example.com'
abx-dl plugins base
Utility plugins are typically consumed indirectly, so the example shows the closest inspection workflow.
config.json schema.mediaProvide a shared namespace for media-related plugin outputs and helpers.
Media
media
Provide a shared namespace for media-related plugin outputs and helpers.
No plugin dependencies declared.
No binary dependencies declared.
No output mimetypes declared.
archivebox add 'https://example.com'
abx-dl plugins media
Utility plugins are typically consumed indirectly, so the example shows the closest inspection workflow.
config.json schema.