v0.2.0
The offline mirror: crawl Douban into a local store you can query without the network.
This release adds the mirror: a crawler that builds a local copy of Douban on disk. It keeps both the raw page bytes and a normalized record per entity, paces each host politely, and resumes cleanly after an interruption. The seven lookup commands from v0.1.0 are unchanged.
The full guide is in Mirror Douban offline. The short version is three steps: seed the frontier with URLs to visit, crawl them, then export what you collected.
New commands
seed sitemap | ids | url | listfills the frontier from the sitemap (banded by entity type), a contiguous id range, or explicit URLs.crawldrains the pending frontier: it fetches each URL through the source that serves it, archives the raw bytes, records the normalized entity, and enqueues the entity links it finds. It is resumable and never silently caps.exportstreams the normalized records as JSONL, one object per line.inforeports the data dir, record and frontier counts by type, and disk use.queuelists frontier rows filtered by status and type.reset-failedrequeues transient failures for another pass.
What it captures
Two sources feed the mirror. Raw HTML serves books, music, games, drama, doulists, groups, people, and most surfaces. The signed Frodo app API serves the movie, TV, and celebrity detail that the desktop site seals behind a security challenge. Every URL is stored twice: the raw bytes are kept verbatim and gzipped so nothing is lost, and a normalized record goes into SQLite so the catalog is uniform and queryable.
Across every entity type the mirror now captures with nothing blocked and nothing failed in a clean run. The only skips are genuine 404s, such as the events product Douban retired.
The fixes that got coverage to zero blocked
- Frodo answers only the app. The host rejects any request whose User-Agent
does not identify the app, returning
invalid_apikeyregardless of the key or the IP. The crawler now always sends the app User-Agent for Frodo requests, so movie, TV, celebrity, and personage detail come back as data instead of a block. The key, secret, and User-Agent stay overridable via flags orDOUBAN_FRODO_KEY/DOUBAN_FRODO_SECRET/DOUBAN_FRODO_UAso they keep working if Douban rotates them. - Movie and TV share one URL space. A series id requested on the movie endpoint comes back asking for the TV endpoint. The crawler now retries the TV endpoint automatically, so series land as records rather than blocks.
- Sub-pages collapse to the subject. A subject's
/comments,/reviews, and/new_reviewpages all resolve to one canonical URL, so each entity is fetched once and the login-gated action pages are never requested. - A wall is recorded as blocked. An HTTP 403 is now recorded as
blockedrather thanfailed, soreset-faileddoes not churn on it forever.
Build and install
Pure Go, with pure-Go SQLite, so CGO_ENABLED=0 builds stay clean and the
binary has no runtime dependencies. Archives, Linux packages, and a GHCR image
ship with the release as before.