Configurable website scraper library in TypeScript. Consumers provide a DownloadOptions config (which includes a ProcessingLifeCycle) and instantiate a downloader to recursively scrape websites to local disk.
- Configurable processing pipeline with hook arrays at every stage
- Single-thread and multi-thread (native
worker_threads) downloaders - HTML, CSS, SVG, and sitemap parsing with automatic link discovery
- CSS
url()extraction and rewriting srcset, Open Graph meta tags, inline styles, and SVGxlink:hrefsupport- Automatic URL-to-relative-path rewriting so saved sites work offline
- Streaming download support for large binary resources
- PQueue-based concurrency with runtime adjustment
- URL deduplication with configurable search-param stripping
- Configurable retry with exponential backoff, jitter, and
Retry-Afterheader support - Local
file://source support for re-processing previously saved sites - Pluggable logging with dedicated categories (
skip,retry,error,notFound, etc.)
npm install website-scrap-engineRequires Node.js >= 18.17.0.
The downloader takes a path (or file:// URL) to a module that default-exports a DownloadOptions object. This pattern allows worker threads to independently load the same configuration.
Step 1: Create an options module (e.g. my-options.js)
import {lifeCycle, options, resource} from 'website-scrap-engine';
const {defaultLifeCycle} = lifeCycle;
const {defaultDownloadOptions} = options;
const {ResourceType} = resource;
const lc = defaultLifeCycle();
// Example: skip binary resources deeper than depth 2
lc.processBeforeDownload.push((res) => {
if (res.depth > 2 && res.type === ResourceType.Binary) return;
return res;
});
export default defaultDownloadOptions({
...lc,
localRoot: '/path/to/save',
maxDepth: 3,
initialUrl: ['https://example.com'],
});Step 2: Create and run the downloader
import path from 'path';
import {downloader} from 'website-scrap-engine';
const {SingleThreadDownloader} = downloader;
const d = new SingleThreadDownloader(
'file://' + path.resolve('my-options.js')
);
d.start();
d.onIdle().then(() => d.dispose());For CPU-intensive workloads, use MultiThreadDownloader instead (see Multi-Thread Processing).
You can also pass override options as the second argument to the downloader constructor, which are merged into the options module's export:
new SingleThreadDownloader('file://' + path.resolve('my-options.js'), {
localRoot: '/different/path',
concurrency: 8,
});The library provides adapter functions in lifeCycle.adapter for common customization patterns:
| Adapter | Stage | Description |
|---|---|---|
skipProcess(fn) |
linkRedirect | Skip URLs matching a predicate |
dropResource(fn) |
processBeforeDownload | Mark matching resources as discard-only (replace link but don't download) |
preProcess(fn) |
processBeforeDownload | Inspect/modify resources before download |
requestRedirect(fn) |
processBeforeDownload | Rewrite the download URL |
localUrlMounts(mounts) |
download | Read matched HTTP(S) URL prefixes from local directories before HTTP fallback |
redirectFilter(fn) |
processAfterDownload | Rewrite or discard redirect URLs |
processHtml(fn) |
processAfterDownload | Transform the parsed HTML (cheerio $) |
processHtmlAsync(fn) |
processAfterDownload | Async version of processHtml |
wrapLegacyGenerateSavePath(fn) |
generateSavePath | Adapt an old full save-path generator to the hook array |
import {lifeCycle} from 'website-scrap-engine';
const lc = lifeCycle.defaultLifeCycle();
// Skip all URLs containing "/api/"
lc.linkRedirect.push(lifeCycle.adapter.skipProcess(
(url) => url.includes('/api/')
));
// Drop images from download but still rewrite their links
lc.processBeforeDownload.push(lifeCycle.adapter.dropResource(
(res) => res.type === ResourceType.Binary && res.url.endsWith('.png')
));Use lifeCycle.adapter.localUrlMounts() to mount static local directories over
HTTP(S) URL prefixes during the download stage. Add it before
downloadResource so local hits short-circuit network fetches while misses can
fall back to HTTP.
const lc = lifeCycle.defaultLifeCycle();
lc.download.unshift(lifeCycle.adapter.localUrlMounts([
{
root: '/mnt/e/example-overlay',
urlPrefix: 'https://example.com/assets/',
priority: 10,
notFound: 'fallback',
},
]));Matching is static prefix routing: higher priority wins, and equal priority
uses the longest URL prefix. For dynamic routing, regexp matching, or custom
overlay logic, use a lifecycle function to rewrite res.downloadLink before
download. Mounted source directories must not overlap the scrape output path;
that aliasing is undefined behavior and is not checked.
Use lifeCycle.generateSavePath to transform where resources are written before
the Resource object is created. The first hook receives the built-in save path;
each later hook receives the previous hook's output. Return undefined to
discard the resource before download.
const lc = lifeCycle.defaultLifeCycle();
lc.generateSavePath.push((savePath, ctx) => {
if (ctx.uri.hostname() === 'cdn.example.com') {
return savePath.replace('cdn.example.com', 'assets');
}
return savePath;
});Resources are processed through a sequential pipeline of hook arrays. Each stage is an array of functions executed in order. Returning void/undefined from any function discards the resource from that stage onward.
init (once per downloader/worker startup)
|
v
URL
|
v
1. linkRedirect -----> skip or redirect URLs before processing
|
v
2. detectResourceType -> determine type (Html, Css, Binary, Svg, SiteMap, etc.)
|
v
3. generateSavePath --> compute/transform the local save path
|
v
4. createResource ----> build a Resource with relative replacement paths
|
v
5. processBeforeDownload -> filter/modify resources; link replacement in parent happens after this
|
v
6. download ----------> fetch resource via HTTP (loop ends early once body is set)
|
v
7. processAfterDownload -> parse content, discover child resources via submit() callback
|
v
8. saveToDisk --------> write to local filesystem
|
v
dispose (once per downloader shutdown / worker exit)
Consumers extend the pipeline by prepending or appending functions to any stage array via defaultLifeCycle(). See Usage for examples.
| Stage | Default handlers |
|---|---|
| linkRedirect | skipLinks - filters out non-HTTP URI schemes (mailto, javascript, data, etc.) |
| detectResourceType | detectResourceType - infers type from element/context |
| generateSavePath | none by default - an empty array uses the built-in URL-to-path mapping |
| createResource | createResource - builds Resource with resolved URL, save path, and replace path |
| download | downloadResource, downloadStreamingResource, readOrCopyLocalResource |
| processAfterDownload | processRedirectedUrl, processHtml, processHtmlMetaRefresh, processSvg, processCss, processSiteMap |
| saveToDisk | saveHtmlToDisk, saveResourceToDisk |
Defined in ResourceType enum:
| Type | Encoding | Description |
|---|---|---|
Binary |
null | Not parsed, saved as-is |
Html |
utf8 | Parsed with cheerio, links discovered and rewritten |
Css |
utf8 | CSS url() references extracted and rewritten |
CssInline |
utf8 | Inline <style> blocks and style attributes |
SiteMap |
utf8 | URLs discovered but not rewritten |
Svg |
utf8 | Parsed with cheerio (same as HTML) |
StreamingBinary |
null | Streamed directly to disk, for large files |
The scraper discovers linked resources from HTML using configurable source definitions. The defaults cover:
- Images:
img[src],img[srcset],picture source[srcset] - Styles:
link[rel="stylesheet"],<style>blocks,[style]attributes - Scripts:
script[src] - Links:
a[href],frame[src],iframe[src] - Media:
video[src],video[poster],audio[src],source[src],track[src] - SVG:
*[xlink:href],*[href] - Meta:
meta[property="og:image"],og:audio,og:videoand their variants - Other:
embed[src],object[data],input[src],[background],link[rel*="icon"],link[rel*="preload"]
Override via options.sources with an array of {selector, attr, type} definitions.
Resource(src/resource.ts) - Central data object carrying URL, save path, replacement path, body, and metadata.RawResourceis the serializable subset used for cross-thread communication.PipelineExecutor(interface insrc/life-cycle/pipeline-executor.ts, impl insrc/downloader/pipeline-executor-impl.ts) - Orchestrates life cycle execution.createAndProcessResource()runs stages 1-5 in one call.AbstractDownloader(src/downloader/main.ts) - Base class with PQueue-based concurrency, URL deduplication, and the download loop.SingleThreadDownloader(src/downloader/single.ts) - Runs all pipeline stages in the main thread.MultiThreadDownloader(src/downloader/multi.ts) - Downloads in main thread, sends to worker pool for post-processing.
Use multi-thread processing when post-download work (HTML/CSS parsing, link discovery) is CPU-intensive.
Main thread:
- Runs the download queue with PQueue concurrency control
- Executes stages 1-6 (linkRedirect through download)
- Transfers downloaded resources to worker threads
- Receives discovered child resources back and enqueues non-duplicates
Worker threads:
- Receive downloaded resources from the main thread
- Execute stages 7-8 (processAfterDownload + saveToDisk)
- Parse HTML/CSS/SVG, discover child resources
- Run stages 1-5 on discovered children to prepare them
- Send prepared child resources back to the main thread as
RawResource[]
Worker count defaults to Math.min(concurrency, workerCount). The worker pool uses a 2-pass water-fill algorithm to balance tasks across workers by load.
Custom worker implementations should use workerData.workerChannels.taskPort
for task/result messages and workerData.workerChannels.logPort for worker
logs. The default parentPort is reserved for lightweight worker control
messages.
On close, custom workers should close both transferred ports and send a
{type: 'closed'} control message on parentPort so the pool can drain queued
logs before disposal completes.
The library exposes dedicated logger categories through a pluggable logger
interface. The default logger writes to console; a log4js adapter is available
for file-based logging.
| Logger | Purpose |
|---|---|
skip |
Resources filtered/discarded at any pipeline stage |
skipExternal |
External resources skipped by scope |
retry |
HTTP retry attempts with backoff details |
error |
Download and processing errors |
notFound |
404 responses |
request / response |
HTTP request/response logging |
complete |
Successfully processed resources |
mkdir |
Directory creation |
adjustConcurrency |
Runtime concurrency changes |
Configure logging via options.createLogger and options.logSubDir.
- cheerio - HTML/SVG parsing and manipulation
- got - HTTP client with retry logic
- p-queue - Download concurrency control
- urijs - URL resolution and path generation
- srcset -
srcsetattribute parsing
ISC