Skip to content
douban

v0.2.0

The offline mirror: crawl Douban into a local store you can query without the network.

This release adds the mirror: a crawler that builds a local copy of Douban on disk. It keeps both the raw page bytes and a normalized record per entity, paces each host politely, and resumes cleanly after an interruption. The seven lookup commands from v0.1.0 are unchanged.

The full guide is in Mirror Douban offline. The short version is three steps: seed the frontier with URLs to visit, crawl them, then export what you collected.

New commands

  • seed sitemap | ids | url | list fills the frontier from the sitemap (banded by entity type), a contiguous id range, or explicit URLs.
  • crawl drains the pending frontier: it fetches each URL through the source that serves it, archives the raw bytes, records the normalized entity, and enqueues the entity links it finds. It is resumable and never silently caps.
  • export streams the normalized records as JSONL, one object per line.
  • info reports the data dir, record and frontier counts by type, and disk use.
  • queue lists frontier rows filtered by status and type.
  • reset-failed requeues transient failures for another pass.

What it captures

Two sources feed the mirror. Raw HTML serves books, music, games, drama, doulists, groups, people, and most surfaces. The signed Frodo app API serves the movie, TV, and celebrity detail that the desktop site seals behind a security challenge. Every URL is stored twice: the raw bytes are kept verbatim and gzipped so nothing is lost, and a normalized record goes into SQLite so the catalog is uniform and queryable.

Across every entity type the mirror now captures with nothing blocked and nothing failed in a clean run. The only skips are genuine 404s, such as the events product Douban retired.

The fixes that got coverage to zero blocked

  • Frodo answers only the app. The host rejects any request whose User-Agent does not identify the app, returning invalid_apikey regardless of the key or the IP. The crawler now always sends the app User-Agent for Frodo requests, so movie, TV, celebrity, and personage detail come back as data instead of a block. The key, secret, and User-Agent stay overridable via flags or DOUBAN_FRODO_KEY / DOUBAN_FRODO_SECRET / DOUBAN_FRODO_UA so they keep working if Douban rotates them.
  • Movie and TV share one URL space. A series id requested on the movie endpoint comes back asking for the TV endpoint. The crawler now retries the TV endpoint automatically, so series land as records rather than blocks.
  • Sub-pages collapse to the subject. A subject's /comments, /reviews, and /new_review pages all resolve to one canonical URL, so each entity is fetched once and the login-gated action pages are never requested.
  • A wall is recorded as blocked. An HTTP 403 is now recorded as blocked rather than failed, so reset-failed does not churn on it forever.

Build and install

Pure Go, with pure-Go SQLite, so CGO_ENABLED=0 builds stay clean and the binary has no runtime dependencies. Archives, Linux packages, and a GHCR image ship with the release as before.