Showing preview only (304K chars total). Download the full file or copy to clipboard to get everything.
Repository: BurntSushi/imdb-rename
Branch: master
Commit: f4180e5d89b5
Files: 46
Total size: 289.3 KB
Directory structure:
gitextract_oovomjyk/
├── .github/
│ ├── FUNDING.yml
│ └── workflows/
│ └── ci.yml
├── .gitignore
├── COPYING
├── Cargo.toml
├── LICENSE-MIT
├── README.md
├── UNLICENSE
├── data/
│ ├── eval/
│ │ └── truth.toml
│ └── test/
│ └── small/
│ ├── title.akas.tsv
│ ├── title.basics.tsv
│ ├── title.episode.tsv
│ └── title.ratings.tsv
├── imdb-eval/
│ ├── COPYING
│ ├── Cargo.toml
│ ├── LICENSE-MIT
│ ├── README.md
│ ├── UNLICENSE
│ └── src/
│ ├── eval.rs
│ ├── logger.rs
│ └── main.rs
├── imdb-index/
│ ├── COPYING
│ ├── Cargo.toml
│ ├── LICENSE-MIT
│ ├── README.md
│ ├── UNLICENSE
│ └── src/
│ ├── error.rs
│ ├── index/
│ │ ├── aka.rs
│ │ ├── episode.rs
│ │ ├── id.rs
│ │ ├── mod.rs
│ │ ├── names.rs
│ │ ├── rating.rs
│ │ ├── tests.rs
│ │ └── writer.rs
│ ├── lib.rs
│ ├── record.rs
│ ├── scored.rs
│ ├── search.rs
│ └── util.rs
├── rustfmt.toml
└── src/
├── download.rs
├── logger.rs
├── main.rs
├── rename.rs
└── util.rs
================================================
FILE CONTENTS
================================================
================================================
FILE: .github/FUNDING.yml
================================================
github: [BurntSushi]
================================================
FILE: .github/workflows/ci.yml
================================================
name: ci
on:
pull_request:
push:
branches:
- master
schedule:
- cron: '00 01 * * *'
# The section is needed to drop write-all permissions that are granted on
# `schedule` event. By specifying any permission explicitly all others are set
# to none. By using the principle of least privilege the damage a compromised
# workflow can do (because of an injection or compromised third party tool or
# action) is restricted. Currently the worklow doesn't need any additional
# permission except for pulling the code. Adding labels to issues, commenting
# on pull-requests, etc. may need additional permissions:
#
# Syntax for this section:
# https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#permissions
#
# Reference for how to assign permissions on a job-by-job basis:
# https://docs.github.com/en/actions/using-jobs/assigning-permissions-to-jobs
#
# Reference for available permissions that we can enable if needed:
# https://docs.github.com/en/actions/security-guides/automatic-token-authentication#permissions-for-the-github_token
permissions:
# to fetch code (actions/checkout)
contents: read
jobs:
test:
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
include:
- build: stable
os: ubuntu-latest
rust: stable
- build: beta
os: ubuntu-latest
rust: beta
- build: nightly
os: ubuntu-latest
rust: nightly
- build: macos
os: macos-latest
rust: stable
- build: win-msvc
os: windows-latest
rust: stable
- build: win-gnu
os: windows-latest
rust: stable-x86_64-gnu
env:
RUSTFLAGS: -D warnings
RUST_BACKTRACE: 1
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Install Rust
uses: dtolnay/rust-toolchain@master
with:
toolchain: ${{ matrix.rust }}
- run: cargo build --all --verbose
- run: cargo doc --all --verbose
- run: cargo test --all --verbose
rustfmt:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Install Rust
uses: dtolnay/rust-toolchain@master
with:
toolchain: stable
components: rustfmt
- name: Check formatting
run: cargo fmt --all --check
================================================
FILE: .gitignore
================================================
/target
/imdb-eval/target
/imdb-index/target
**/*.rs.bk
tags
/tmp
================================================
FILE: COPYING
================================================
This project is dual-licensed under the Unlicense and MIT licenses.
You may use this code under the terms of either license.
================================================
FILE: Cargo.toml
================================================
[package]
name = "imdb-rename"
version = "0.1.6" #:version
authors = ["Andrew Gallant <jamslam@gmail.com>"]
description = """
A command line utility for searching IMDb and renaming your media files.
"""
documentation = "https://github.com/BurntSushi/imdb-rename"
homepage = "https://github.com/BurntSushi/imdb-rename"
repository = "https://github.com/BurntSushi/imdb-rename"
readme = "README.md"
keywords = ["imdb", "movie", "index", "search", "name"]
license = "Unlicense/MIT"
edition = "2021"
[workspace]
members = ["imdb-eval", "imdb-index"]
[dependencies]
anyhow = "1.0.75"
bstr = { version = "1.8.0", default-features = false, features = ["std"] }
clap = { version = "2.34.0", default-features = false }
flate2 = "1.0.28"
imdb-index = { version = "0.1.4", path = "imdb-index" }
lazy_static = "1.4.0"
log = { version = "0.4.20", features = ["std"] }
regex = "1.10.2"
tabwriter = "1.3.0"
ureq = { version = "2.9.1", default-features = false, features = ["tls"] }
walkdir = "2.4.0"
[profile.release]
debug = true
================================================
FILE: LICENSE-MIT
================================================
The MIT License (MIT)
Copyright (c) 2015 Andrew Gallant
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
================================================
FILE: README.md
================================================
imdb-rename
===========
A command line tool to rename media files based on titles from IMDb.
imdb-rename downloads the official IMDb data set and creates a local index to
use for fast fuzzy searching.
[](https://travis-ci.org/BurntSushi/imdb-rename)
[](https://ci.appveyor.com/project/BurntSushi/imdb-rename)
[](https://crates.io/crates/imdb-rename)
Dual-licensed under MIT or the [UNLICENSE](http://unlicense.org).
### Installation
**[Archives of precompiled binaries for imdb-rename are available for Windows,
macOS and Linux.](https://github.com/BurntSushi/imdb-rename/releases)**
Otherwise, users are expected to compile imdb-rename from source:
```
$ git clone https://github.com/BurntSushi/imdb-rename
$ cd imdb-rename
$ cargo build --release
$ ./target/release/imdb-rename --help
```
Alternatively, if you have
[Cargo installed](https://rustup.rs),
then you can install imdb-rename directly from
[crates.io](https://crates.io):
```
$ cargo install imdb-rename
```
imdb-rename's minimum supported Rust version is **1.28.0**.
#### Archlinux
An aur package is available: [imdb-rename](https://aur.archlinux.org/packages/imdb-rename/).
### Quick example
Ever since Season 1 of The Simpsons came out on DVD, I've been collecting them
and ripping them on to my hard drive. My process is somewhat manual, but I
wind up with a directory that looks like this:
```
S18E01.mkv S18E05.mkv S18E09.mkv S18E13.mkv S18E17.mkv S18E21.mkv
S18E02.mkv S18E06.mkv S18E10.mkv S18E14.mkv S18E18.mkv S18E22.mkv
S18E03.mkv S18E07.mkv S18E11.mkv S18E15.mkv S18E19.mkv
S18E04.mkv S18E08.mkv S18E12.mkv S18E16.mkv S18E20.mkv
```
It would be much nicer if these files had their proper episode titles.
imdb-rename can rename these files automatically using episode titles from
IMDb:
```
$ imdb-rename -q 'the simpsons {show}' *.mkv
```
This command ran a query with the `-q` flag to identify the TV show, provided
the files to rename, and... presto!
```
S18E01 - The Mook, the Chef, the Wife and Her Homer.mkv
S18E02 - Jazzy & The Pussycats.mkv
S18E03 - Please Homer, Don't Hammer 'Em.mkv
S18E04 - Treehouse of Horror XVII.mkv
S18E05 - G.I. (Annoyed Grunt).mkv
S18E06 - Moe'N'a Lisa.mkv
S18E07 - Ice Cream of Margie: With the Light Blue Hair.mkv
S18E08 - The Haw-Hawed Couple.mkv
S18E09 - Kill Gil, Vol. 1 & 2.mkv
S18E10 - The Wife Aquatic.mkv
S18E11 - Revenge Is a Dish Best Served Three Times.mkv
S18E12 - Little Big Girl.mkv
S18E13 - Springfield Up.mkv
S18E14 - Yokel Chords.mkv
S18E15 - Rome-old and Juli-eh.mkv
S18E16 - Homerazzi.mkv
S18E17 - Marge Gamer.mkv
S18E18 - The Boys of Bummer.mkv
S18E19 - Crook and Ladder.mkv
S18E20 - Stop or My Dog Will Shoot.mkv
S18E21 - 24 Minutes.mkv
S18E22 - You Kent Always Say What You Want.mkv
```
### Fancier example
imdb-rename isn't limited to just renaming TV episodes based on season/episode
numbers. It can also perform a fuzzy match based on the contents of the
file name. For example, given this file:
```
Thor.Ragnarok.2017.1080p.WEB-DL.DD5.1.H264-FGT.mkv
```
We can "clean it up" and rename it to a nice title like so:
```
$ imdb-rename Thor.Ragnarok.2017.1080p.WEB-DL.DD5.1.H264-FGT.mkv
```
which gives us:
```
Thor: Ragnarok (2017).mkv
```
### Freeform searching
We can also use imdb-rename to search IMDb, which is the default behavior
when a `-q/--query` is provided without any file names:
```
$ imdb-rename -q 'homey loves flanders'
# score id kind title year tv
1 1.000 tt0773646 tvEpisode Homer Loves Flanders 1994 S05E16 The Simpsons
2 0.646 tt2101691 tvEpisode Tiny Loves Flowers N/A S02E08 Dinosaur Train
3 0.568 tt3203408 tvEpisode Courtney Loves Love 2014 S01E05 Courtney Loves Dallas
4 0.561 tt1722576 short In Flanders Fields 2010
5 0.561 tt2253780 tvSeries In Vlaamse Velden 2014
6 0.555 tt4528474 video My Lovely Homeland 2011
7 0.551 tt0220646 tvMovie Moll Flanders 1975
[... results truncated ...]
```
Notice that our query had a typo in it. imdb-rename does its best to find the
most relevant results. It is also fast. Even though the above query searches
through all 6 million names in IMDb, it runs in under 100ms. This is thanks to
using an inverted index memory mapped from disk.
### How does it work?
imdb-rename works by downloading
[approved datasets from IMDb](https://www.imdb.com/interfaces/),
and creating an inverted index based on ngrams extracted
from the names in IMDb's data. The inverted index provides a
quick way to search and rank results using techniques from
[information retrieval](https://nlp.stanford.edu/IR-book/)
such as
[Okapi-BM25](https://en.wikipedia.org/wiki/Okapi_BM25).
### Motivation
My motivation for building this tool is somewhat idiosyncratic, but three-fold:
1. I find it very convenient to have a tool to rename media files
automatically. imdb-rename is my third iteration on this tool. The first was
an unpublished hodge podge of Python scripts and a MySQL database. The
second was a
[Go program with a PostgreSQL database](https://github.com/BurntSushi/goim).
The Go program served me well, but IMDb retired their old data format, which
required me to build a new tool to adapt.
2. I've been working on a low-level information retrieval library off-and-on
for a couple years, and initially built this tool on top of that library as
a form of dogfooding. It didn't work out as well as I'd hoped, so I scrapped
the generic library and built out a specific solution tailored to IMDb. I'm
no longer dogfooding directly, but I've established a useful baseline.
3. I want more people to learn about information retrieval, and I believe this
tool can serve to teach others. In particular, imdb-rename is a complete
end-to-end information retrieval system that is fast, solves a real problem,
is only a few thousand lines of code and comes with a built-in
evaluation that is easy to run.
This tool is perhaps a bit over engineered, but I had fun with it. Believe it
or not, parts of imdb-rename are intentionally simple at the cost of both query
speed and size on disk!
### Evaluation
It is possible to run an evaluation to compare the various parameters available
for searching. The evaluation system is available as a separate tool called
imdb-eval, which is included in this repository. To use it, we must first build
it:
```
$ git clone https://github.com/BurntSushi/imdb-rename
$ cd imdb-rename
$ cargo build --release --all
$ ./target/release/imdb-eval --help
```
Running an evaluation is simple. We can run an evaluation on all combinations
of scorer and similarity function, along with ngram sizes of 3 and 4 like so:
(This will use truth data that is built into the `imdb-eval` binary.)
```
$ ./target/release/imdb-eval --ngram-size 3 --ngram-size 4 | tee eval.csv
```
This will output the results of running a search on every item in the truth
data. The results include the rank of the expected answer. The results can be
summarized into a single score called the
[Mean Reciprocal Rank](https://en.wikipedia.org/wiki/Mean_reciprocal_rank)
(which is itself a specific instance of MAP, or mean average precision)
with the `--summarize` flag like so:
```
$ ./target/release/imdb-eval --summarize eval.csv
```
If you have [xsv](https://github.com/BurntSushi/xsv) installed, then the
results can be easily sorted and formatted:
```
$ ./target/release/imdb-eval --summarize eval.csv | xsv sort -R -s mrr | xsv table
```
If you want to tweak the truth data, then you might consider starting with the
bundled truth data (assuming you're at the root of the imdb-rename repository):
```
$ $EDITOR data/eval/truth.toml
$ ./target/release/imdb-eval --ngram-size 3 --ngram-size 4 --truth data/eval/truth.toml
```
### What does this tool not do?
imdb-rename is tool for renaming media files, and to the extent that searching
IMDb facilitates renaming files, it is also a search tool. There is no
intent to develop this further to explore all IMDb data, such as cast/crew
information.
Folks interested in building a different type of IMDb tool may be interested
in the [`imdb-index`](https://docs.rs/imdb-index) crate, which provides
programmatic access to the index created by imdb-rename.
### IMDb licensing
The data used by imdb-rename is retrieved from
[IMDb datasets](https://www.imdb.com/interfaces/).
In particular, imdb-rename will never scrape imdb.com, and only uses the data
provided by IMDb in the `tsv` files.
Additionally, imdb-rename must only be used for non-commercial and personal
uses.
================================================
FILE: UNLICENSE
================================================
This is free and unencumbered software released into the public domain.
Anyone is free to copy, modify, publish, use, compile, sell, or
distribute this software, either in source code form or as a compiled
binary, for any purpose, commercial or non-commercial, and by any
means.
In jurisdictions that recognize copyright laws, the author or authors
of this software dedicate any and all copyright interest in the
software to the public domain. We make this dedication for the benefit
of the public at large and to the detriment of our heirs and
successors. We intend this dedication to be an overt act of
relinquishment in perpetuity of all present and future rights to this
software under copyright law.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
For more information, please refer to <http://unlicense.org/>
================================================
FILE: data/eval/truth.toml
================================================
[[task]]
query = "the matrix"
answer = "tt0133093"
[[task]]
query = "homey the clown"
answer = "tt0701128"
[[task]]
query = "homer loves"
answer = "tt0773646"
[[task]]
query = "the matrix: revolutions"
answer = "tt0242653"
[[task]]
query = "troy"
answer = "tt0332452"
[[task]]
query = "o"
answer = "tt0184791"
[[task]]
query = "love and basketball"
answer = "tt0199725"
[[task]]
query = "the last one"
answer = "tt0583434"
[[task]]
query = "pre-destination"
answer = "tt2397535"
[[task]]
query = "1 magic christmas"
answer = "tt0089731"
[[task]]
query = "xmen the last stand"
answer = "tt0376994"
[[task]]
query = "todliche aura"
answer = "tt0583427"
[[task]]
query = "her"
answer = "tt1798709"
[[task]]
query = "its a wonderful life"
answer = "tt0038650"
[[task]]
query = "jason born"
answer = "tt4196776"
[[task]]
query = "cpt america first avenger"
answer = "tt0458339"
[[task]]
query = "batman vs superman dawn justice"
answer = "tt2975590"
[[task]]
query = "nightmare before christmas"
answer = "tt0107688"
[[task]]
query = "the man from earth"
answer = "tt0756683"
[[task]]
query = "amazing spiderman 2"
answer = "tt1872181"
[[task]]
query = "the revanant"
answer = "tt1663202"
[[task]]
query = "imaginarium of dr"
answer = "tt1054606"
[[task]]
query = "the dark night"
answer = "tt0468569"
[[task]]
query = "the simpsons"
answer = "tt0462538"
[[task]]
query = "into the bad lands"
answer = "tt3865236"
[[task]]
query = "south park bigger"
answer = "tt0158983"
[[task]]
query = "game of shadows sherlock"
answer = "tt1515091"
[[task]]
query = "ragnarok"
answer = "tt3501632"
[[task]]
query = "riddick"
answer = "tt0296572"
[[task]]
query = "voyage dawn treader"
answer = "tt0980970"
[[task]]
query = "phenomonon"
answer = "tt0117333"
[[task]]
query = "ratchet and clank"
answer = "tt2865120"
[[task]]
query = "spiderman homecoming"
answer = "tt2250912"
[[task]]
query = "sixth sense"
answer = "tt0167404"
[[task]]
query = "there will be blood"
answer = "tt0469494"
[[task]]
query = "gangs new york"
answer = "tt0217505"
[[task]]
query = "first avenger"
answer = "tt0458339"
[[task]]
query = "good shepherd"
answer = "tt0343737"
[[task]]
query = "gone with the wind"
answer = "tt0031381"
[[task]]
query = "bourne identity"
answer = "tt0258463"
[[task]]
query = "seinfeld"
answer = "tt0098904"
[[task]]
query = "lincoln"
answer = "tt0443272"
[[task]]
query = "sherlock"
answer = "tt1475582"
[[task]]
query = "skinner's badass song"
answer = "tt0777150"
[[task]]
query = "flying hellish"
answer = "tt0778451"
[[task]]
query = "springfield files"
answer = "tt0701263"
[[task]]
query = "shot mr burns"
answer = "tt0701295"
[[task]]
query = "camp krusty"
answer = "tt0701142"
[[task]]
query = "the monorail"
answer = "tt0701173"
[[task]]
query = "king homer"
answer = "tt0701144"
[[task]]
query = "mr. plow"
answer = "tt0701184"
================================================
FILE: data/test/small/title.akas.tsv
================================================
titleId ordering title region language types attributes isOriginalTitle
tt0096697 10 Simpsonovi SI \N imdbDisplay \N 0
tt0096697 11 Simpsonovi RS \N imdbDisplay \N 0
tt0096697 12 The Simpsons US \N \N \N 0
tt0096697 13 Gia Dinh Simpsons VN \N imdbDisplay \N 0
tt0096697 14 Simpsonovci SK \N \N \N 0
tt0096697 15 Os Simpsons BR \N \N \N 0
tt0096697 16 Simpsons SE \N imdbDisplay \N 0
tt0096697 17 Simpsoni HR \N \N \N 0
tt0096697 18 Simpsoni LV \N imdbDisplay \N 0
tt0096697 19 Die Simpsons XWG \N \N \N 0
tt0096697 1 Los Simpson MX \N \N \N 0
tt0096697 20 Simpsonovi CSHH \N imdbDisplay \N 0
tt0096697 21 Семейство Симпсън BG bg \N \N 0
tt0096697 22 Els Simpson ES ca imdbDisplay \N 0
tt0096697 23 The Simpsons GR \N \N \N 0
tt0096697 24 Сiмпсони UA \N \N \N 0
tt0096697 25 Simpsonid EE \N \N \N 0
tt0096697 26 Los Simpson ES \N imdbDisplay \N 0
tt0096697 27 Simpsonowie PL \N imdbDisplay \N 0
tt0096697 28 Os Simpsons PT \N \N \N 0
tt0096697 29 I Simpson IT \N \N \N 0
tt0096697 2 The Simpsons \N \N original \N 1
tt0096697 30 Les Simpson CA fr \N dubbed version 0
tt0096697 31 Simpsons NO \N \N \N 0
tt0096697 32 A Simpson család HU \N \N \N 0
tt0096697 33 Al shamshoon EG ar \N dubbed version 0
tt0096697 34 Die Simpsons DE \N imdbDisplay \N 0
tt0096697 35 Familia Simpson RO \N \N \N 0
tt0096697 36 Los Simpson PE \N imdbDisplay \N 0
tt0096697 37 Simpsonai LT \N imdbDisplay \N 0
tt0096697 38 Les Simpson FR \N \N \N 0
tt0096697 3 Los Simpson AR \N \N \N 0
tt0096697 4 Симпсоны RU \N \N \N 0
tt0096697 5 Los Simpson VE \N \N \N 0
tt0096697 6 Simpson Ailesi TR tr imdbDisplay \N 0
tt0096697 7 Simpsons DK \N \N \N 0
tt0096697 8 Simpsonit FI \N \N \N 0
tt0096697 9 Simpsonovi CZ \N imdbDisplay \N 0
================================================
FILE: data/test/small/title.basics.tsv
================================================
tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres
tt0348034 tvEpisode Simpsons Roasting on an Open Fire Simpsons Roasting on an Open Fire 0 1989 \N 30 Animation,Comedy
tt0701059 tvEpisode Bart the General Bart the General 0 1990 \N 30 Animation,Comedy
tt0701060 tvEpisode Bart the Murderer Bart the Murderer 0 1991 \N 30 Animation,Comedy
tt0701062 tvEpisode Bart vs. Thanksgiving Bart vs. Thanksgiving 0 1990 \N 23 Animation,Comedy
tt0701063 tvEpisode Bart's Dog Gets an F Bart's Dog Gets an F 0 1991 \N 23 Animation,Comedy
tt0701064 tvEpisode Bart's Friend Falls in Love Bart's Friend Falls in Love 0 1992 \N 30 Animation,Comedy
tt0701070 tvEpisode Black Widower Black Widower 0 1992 \N 30 Animation,Comedy
tt0701076 tvEpisode Brother, Can You Spare Two Dimes? Brother, Can You Spare Two Dimes? 0 1992 \N 30 Animation,Comedy
tt0701077 tvEpisode Brush with Greatness Brush with Greatness 0 1991 \N 30 Animation,Comedy
tt0701082 tvEpisode Colonel Homer Colonel Homer 0 1992 \N 30 Animation,Comedy
tt0701084 tvEpisode Dancin' Homer Dancin' Homer 0 1990 \N 30 Animation,Comedy
tt0701098 tvEpisode Flaming Moe's Flaming Moe's 0 1991 \N 30 Animation,Comedy
tt0701110 tvEpisode Homer Defined Homer Defined 0 1991 \N 30 Animation,Comedy
tt0701114 tvEpisode Homer at the Bat Homer at the Bat 0 1992 \N 30 Animation,Comedy
tt0701123 tvEpisode Homer's Night Out Homer's Night Out 0 1990 \N 30 Animation,Comedy
tt0701124 tvEpisode Homer's Odyssey Homer's Odyssey 0 1990 \N 30 Animation,Comedy
tt0701140 tvEpisode Itchy and Scratchy and Marge Itchy and Scratchy and Marge 0 1990 \N 23 Animation,Comedy
tt0701147 tvEpisode Krusty Gets Busted Krusty Gets Busted 0 1990 \N 30 Animation,Comedy
tt0701152 tvEpisode Life on the Fast Lane Life on the Fast Lane 0 1990 \N 30 Animation,Comedy
tt0701153 tvEpisode Like Father, Like Clown Like Father, Like Clown 0 1991 \N 30 Animation,Comedy
tt0701161 tvEpisode Lisa's Pony Lisa's Pony 0 1991 \N 30 Animation,Comedy
tt0701164 tvEpisode Lisa's Substitute Lisa's Substitute 0 1991 \N 30 Animation,Comedy
tt0701178 tvEpisode Moaning Lisa Moaning Lisa 0 1990 \N 30 Animation,Comedy
tt0701183 tvEpisode Mr. Lisa Goes to Washington Mr. Lisa Goes to Washington 0 1991 \N 30 Animation,Comedy
tt0701191 tvEpisode Oh Brother, Where Art Thou? Oh Brother, Where Art Thou? 0 1991 \N 23 Animation,Comedy
tt0701192 tvEpisode Old Money Old Money 0 1991 \N 23 Animation,Comedy
tt0701195 tvEpisode One Fish, Two Fish, Blowfish, Blue Fish One Fish, Two Fish, Blowfish, Blue Fish 0 1991 \N 23 Animation,Comedy
tt0701200 tvEpisode Radio Bart Radio Bart 0 1992 \N 30 Animation,Comedy
tt0701204 tvEpisode Separate Vocations Separate Vocations 0 1992 \N 30 Animation,Comedy
tt0701211 tvEpisode Simpson and Delilah Simpson and Delilah 0 1990 \N 23 Animation,Comedy
tt0701215 tvEpisode Some Enchanted Evening Some Enchanted Evening 0 1990 \N 30 Animation,Comedy
tt0701217 tvEpisode Stark Raving Dad Stark Raving Dad 0 1991 \N 30 Animation,Comedy
tt0701228 tvEpisode The Call of the Simpsons The Call of the Simpsons 0 1990 \N 30 Animation,Comedy
tt0701232 tvEpisode The Crepes of Wrath The Crepes of Wrath 0 1990 \N 30 Animation,Comedy
tt0701254 tvEpisode The Otto Show The Otto Show 0 1992 \N 30 Animation,Comedy
tt0701269 tvEpisode The Way We Was The Way We Was 0 1991 \N 23 Animation,Comedy
tt0701275 tvEpisode Three Men and a Comic Book Three Men and a Comic Book 0 1991 \N 30 Animation,Comedy
tt0701278 tvEpisode Treehouse of Horror Treehouse of Horror 0 1990 \N 30 Animation,Comedy
tt0756398 tvEpisode The Telltale Head The Telltale Head 0 1990 \N 30 Animation,Comedy
tt0756399 tvEpisode There's No Disgrace Like Home There's No Disgrace Like Home 0 1990 \N 30 Animation,Comedy
tt0756593 tvEpisode Bart the Genius Bart the Genius 0 1990 \N 30 Animation,Comedy
tt0757017 tvEpisode Bart Gets Hit by a Car Bart Gets Hit by a Car 0 1991 \N 23 Animation,Comedy
tt0757023 tvEpisode Two Cars in Every Garage and Three Eyes on Every Fish Two Cars in Every Garage and Three Eyes on Every Fish 0 1990 \N 23 Animation,Comedy
tt0759267 tvEpisode Treehouse of Horror II Treehouse of Horror II 0 1991 \N 30 Animation,Comedy
tt0763024 tvEpisode Bart Gets an F Bart Gets an F 0 1990 \N 30 Animation,Comedy
tt0763042 tvEpisode When Flanders Failed When Flanders Failed 0 1991 \N 30 Animation,Comedy
tt0766140 tvEpisode The War of the Simpsons The War of the Simpsons 0 1991 \N 30 Animation,Comedy
tt0767438 tvEpisode Bart the Daredevil Bart the Daredevil 0 1990 \N 23 Animation,Comedy
tt0767440 tvEpisode Blood Feud Blood Feud 0 1991 \N 30 Animation,Comedy
tt0767442 tvEpisode Dead Putting Society Dead Putting Society 0 1990 \N 30 Animation,Comedy
tt0767443 tvEpisode Homer vs. Lisa and the 8th Commandment Homer vs. Lisa and the 8th Commandment 0 1991 \N 23 Animation,Comedy
tt0767445 tvEpisode Principal Charming Principal Charming 0 1991 \N 23 Animation,Comedy
tt0768553 tvEpisode Bart the Lover Bart the Lover 0 1992 \N 30 Animation,Comedy
tt0768554 tvEpisode Dog of Death Dog of Death 0 1992 \N 30 Animation,Comedy
tt0768555 tvEpisode Homer Alone Homer Alone 0 1992 \N 30 Animation,Comedy
tt0768556 tvEpisode I Married Marge I Married Marge 0 1991 \N 30 Animation,Comedy
tt0768557 tvEpisode Lisa the Greek Lisa the Greek 0 1992 \N 30 Animation,Comedy
tt0768558 tvEpisode Saturdays of Thunder Saturdays of Thunder 0 1991 \N 30 Animation,Comedy
tt0769743 tvEpisode Burns Verkaufen der Kraftwerk Burns Verkaufen der Kraftwerk 0 1991 \N 30 Animation,Comedy
================================================
FILE: data/test/small/title.episode.tsv
================================================
tconst parentTconst seasonNumber episodeNumber
tt0348034 tt0096697 1 1
tt0701059 tt0096697 1 5
tt0701060 tt0096697 3 4
tt0701062 tt0096697 2 7
tt0701063 tt0096697 2 16
tt0701064 tt0096697 3 23
tt0701070 tt0096697 3 21
tt0701076 tt0096697 3 24
tt0701077 tt0096697 2 18
tt0701082 tt0096697 3 20
tt0701084 tt0096697 2 5
tt0701098 tt0096697 3 10
tt0701110 tt0096697 3 5
tt0701114 tt0096697 3 17
tt0701123 tt0096697 1 10
tt0701124 tt0096697 1 3
tt0701140 tt0096697 2 9
tt0701147 tt0096697 1 12
tt0701152 tt0096697 1 9
tt0701153 tt0096697 3 6
tt0701161 tt0096697 3 8
tt0701164 tt0096697 2 19
tt0701178 tt0096697 1 6
tt0701183 tt0096697 3 2
tt0701191 tt0096697 2 15
tt0701192 tt0096697 2 17
tt0701195 tt0096697 2 11
tt0701200 tt0096697 3 13
tt0701204 tt0096697 3 18
tt0701211 tt0096697 2 2
tt0701215 tt0096697 1 13
tt0701217 tt0096697 3 1
tt0701228 tt0096697 1 7
tt0701232 tt0096697 1 11
tt0701254 tt0096697 3 22
tt0701269 tt0096697 2 12
tt0701275 tt0096697 2 21
tt0701278 tt0096697 2 3
tt0756398 tt0096697 1 8
tt0756399 tt0096697 1 4
tt0756593 tt0096697 1 2
tt0757017 tt0096697 2 10
tt0757023 tt0096697 2 4
tt0759267 tt0096697 3 7
tt0763024 tt0096697 2 1
tt0763042 tt0096697 3 3
tt0766140 tt0096697 2 20
tt0767438 tt0096697 2 8
tt0767440 tt0096697 2 22
tt0767442 tt0096697 2 6
tt0767443 tt0096697 2 13
tt0767445 tt0096697 2 14
tt0768553 tt0096697 3 16
tt0768554 tt0096697 3 19
tt0768555 tt0096697 3 15
tt0768556 tt0096697 3 12
tt0768557 tt0096697 3 14
tt0768558 tt0096697 3 9
tt0769743 tt0096697 3 11
================================================
FILE: data/test/small/title.ratings.tsv
================================================
tconst averageRating numVotes
tt0000001 5.8 1356
tt0000002 6.5 157
tt0000003 6.6 939
tt0000004 6.4 93
tt0000005 6.2 1630
tt0000006 5.6 79
tt0000007 5.5 546
tt0000008 5.6 1454
tt0000009 5.4 62
tt0000010 6.9 4880
tt0000011 5.4 193
tt0000012 7.4 8102
tt0000013 5.7 1239
tt0000014 7.2 3542
tt0000015 6.2 606
tt0000016 5.9 922
tt0000017 4.8 181
tt0000018 5.5 389
tt0000019 6.7 12
tt0000020 5.1 219
tt0000022 5.1 703
tt0000023 5.7 875
tt0000024 5.8 18
tt0000025 5.0 14
tt0000026 5.7 1086
================================================
FILE: imdb-eval/COPYING
================================================
This project is dual-licensed under the Unlicense and MIT licenses.
You may use this code under the terms of either license.
================================================
FILE: imdb-eval/Cargo.toml
================================================
[package]
name = "imdb-eval"
version = "0.1.2"
authors = ["Andrew Gallant <jamslam@gmail.com>"]
description = """
A command line utility for evaluating the IMDb name index.
"""
documentation = "https://github.com/BurntSushi/imdb-rename"
homepage = "https://github.com/BurntSushi/imdb-rename"
repository = "https://github.com/BurntSushi/imdb-rename"
readme = "README.md"
keywords = ["imdb", "index", "search", "name", "evaluation"]
license = "Unlicense/MIT"
edition = "2021"
[dependencies]
anyhow = "1.0.75"
clap = { version = "2.34.0", default-features = false }
csv = "1.3.0"
imdb-index = { version = "0.1.4", path = "../imdb-index" }
lazy_static = "1.4.0"
log = { version = "0.4.20", features = ["std"] }
serde = { version = "1.0.193", features = ["derive"] }
toml = "0.8.8"
================================================
FILE: imdb-eval/LICENSE-MIT
================================================
The MIT License (MIT)
Copyright (c) 2015 Andrew Gallant
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
================================================
FILE: imdb-eval/README.md
================================================
imdb-eval
=========
A command line tool for evaluating imdb-rename's search functionality.
[](https://travis-ci.org/BurntSushi/imdb-rename)
[](https://ci.appveyor.com/project/BurntSushi/imdb-rename)
[](https://crates.io/crates/imdb-rename)
### Installation
No release binaries are provided for imdb-eval. Instead, users should compile
it from source:
```
$ git clone https://github.com/BurntSushi/imdb-rename
$ cd imdb-rename
$ cargo build --release --all
$ ./target/release/imdb-eval --help
```
For more details on how to use imdb-eval, please see imdb-rename's README.
================================================
FILE: imdb-eval/UNLICENSE
================================================
This is free and unencumbered software released into the public domain.
Anyone is free to copy, modify, publish, use, compile, sell, or
distribute this software, either in source code form or as a compiled
binary, for any purpose, commercial or non-commercial, and by any
means.
In jurisdictions that recognize copyright laws, the author or authors
of this software dedicate any and all copyright interest in the
software to the public domain. We make this dedication for the benefit
of the public at large and to the detriment of our heirs and
successors. We intend this dedication to be an overt act of
relinquishment in perpetuity of all present and future rights to this
software under copyright law.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
For more information, please refer to <http://unlicense.org/>
================================================
FILE: imdb-eval/src/eval.rs
================================================
use std::collections::BTreeMap;
use std::fmt;
use std::fs::File;
use std::io::Read;
use std::path::{Path, PathBuf};
use std::time::{Duration, Instant};
use std::vec;
use imdb_index::{
Index, IndexBuilder, MediaEntity, NameScorer, NgramType, Query, Searcher,
Similarity,
};
use lazy_static::lazy_static;
use serde::{Deserialize, Serialize};
/// The default truth data used in an evaluation. It's small enough that we
/// embed it directly into the binary.
const TRUTH_DATA: &str = include_str!("../../data/eval/truth.toml");
lazy_static! {
/// A structured representation of the default truth data.
static ref TRUTH: Truth = toml::from_str(TRUTH_DATA).unwrap();
}
/// The truth data for our evaluation.
///
/// The truth data consists of a set of information needs that we call "tasks."
#[derive(Clone, Debug, Deserialize)]
struct Truth {
#[serde(rename = "task")]
tasks: Vec<Task>,
}
/// A task or "information need" defined by the truth data. Each task
/// corresponds to a query that we feed to the name index, and each task has a
/// single correct answer.
#[derive(Clone, Debug, Deserialize)]
struct Task {
query: String,
answer: String,
}
impl Truth {
/// Load truth data from the given TOML file.
fn from_path<P: AsRef<Path>>(path: P) -> anyhow::Result<Truth> {
let path = path.as_ref();
let mut contents = String::new();
File::open(path)?.read_to_string(&mut contents)?;
Ok(toml::from_str(&contents)?)
}
}
/// A specification for running an evaluation. Fundamentally, a specification
/// describes the thing we want to evaluate, where the thing we want to
/// evaluate is a specific configuration of how we build *and* search an IMDb
/// index.
///
/// A specification describes both how the index should be built and how
/// queries should be generated. Specifications with equivalent index settings
/// may reuse the same on-disk index. For example, the ngram size and type are
/// index settings, but the similarity function, name scorer and result size
/// are all query time settings.
///
/// A specification cannot itself produce a complete query. Namely, a
/// specification requires an information need (called a "task") to construct
/// a query specific to that need. The results of that query are then compared
/// with that information need's answer to determine the score, which is,
/// invariably, a reflection of how well the configuration given by this
/// specification performs.
#[derive(Clone, Debug, Eq, PartialEq)]
pub struct Spec {
result_size: usize,
ngram_size: usize,
ngram_type: NgramType,
sim: Similarity,
scorer: Option<NameScorer>,
}
impl Spec {
/// Create a new spec using a default configuration.
pub fn new() -> Spec {
Spec {
result_size: 30,
ngram_size: 3,
ngram_type: NgramType::default(),
sim: Similarity::None,
scorer: Some(NameScorer::OkapiBM25),
}
}
/// Set the result size for this specification.
///
/// This returns an error if the given size is less than `1`.
pub fn with_result_size(
mut self,
result_size: usize,
) -> anyhow::Result<Spec> {
if result_size < 1 {
anyhow::bail!(
"result size {} is invalid, must be greater than 0",
result_size
);
}
self.result_size = result_size;
Ok(self)
}
/// Set the ngram size for this specification.
///
/// This returns an error if the given size is less than `2`.
pub fn with_ngram_size(
mut self,
ngram_size: usize,
) -> anyhow::Result<Spec> {
if ngram_size < 2 {
anyhow::bail!(
"ngram size {} is invalid, must be greater than 1",
ngram_size,
);
}
self.ngram_size = ngram_size;
Ok(self)
}
/// Set the ngram type for this specification.
pub fn with_ngram_type(mut self, ngram_type: NgramType) -> Spec {
self.ngram_type = ngram_type;
self
}
/// Set the similarity ranker function for this specification.
pub fn with_similarity(mut self, sim: Similarity) -> Spec {
self.sim = sim;
self
}
/// Set the name scorer for this specification.
///
/// Note that if the given scorer is `None`, then an evaluation will likely
/// be quite slow, since each information need will result in an exhaustive
/// search of the corpus.
pub fn with_scorer(mut self, scorer: Option<NameScorer>) -> Spec {
self.scorer = scorer;
self
}
/// Evaluate this specification against the built-in truth data.
pub fn evaluate<P1: AsRef<Path>, P2: AsRef<Path>>(
&self,
data_dir: P1,
eval_dir: P2,
) -> anyhow::Result<Evaluation> {
let searcher = Searcher::new(self.index(data_dir, eval_dir)?);
Ok(Evaluation {
evaluator: Evaluator { spec: self, searcher },
tasks: TRUTH.clone().tasks.into_iter(),
})
}
/// Evaluate this specification against a set of truth data at the given
/// file path.
pub fn evaluate_with<P1: AsRef<Path>, P2: AsRef<Path>, P3: AsRef<Path>>(
&self,
data_dir: P1,
eval_dir: P2,
truth_path: P3,
) -> anyhow::Result<Evaluation> {
let searcher = Searcher::new(self.index(data_dir, eval_dir)?);
Ok(Evaluation {
evaluator: Evaluator { spec: self, searcher },
tasks: Truth::from_path(truth_path)?.tasks.into_iter(),
})
}
/// Create a query derived from this specification and a particular
/// information need or "task."
fn query(&self, task: &Task) -> Query {
Query::new()
.name(&task.query)
.name_scorer(self.scorer.clone())
.similarity(self.sim.clone())
.size(self.result_size)
}
/// Either open or create an index suitable for this specification.
///
/// If no index exists in the expected sub-directory of `eval_dir`, then
/// a new index is created.
fn index<P1: AsRef<Path>, P2: AsRef<Path>>(
&self,
data_dir: P1,
eval_dir: P2,
) -> anyhow::Result<Index> {
let index_dir = self.index_dir(eval_dir.as_ref());
Ok(if index_dir.exists() {
Index::open(data_dir, index_dir)?
} else {
IndexBuilder::new()
.ngram_size(self.ngram_size)
.ngram_type(self.ngram_type)
.create(data_dir, index_dir)?
})
}
/// The sub-directory of `eval_dir` in which to store this specification's
/// index.
fn index_dir<P: AsRef<Path>>(&self, eval_dir: P) -> PathBuf {
eval_dir.as_ref().join(self.index_name())
}
/// The expected name of the index for this evaluation specification.
///
/// The name of the index is derived specifically from this specification's
/// index-time settings, such as the ngram size. This permits multiple
/// distinct specifications to reuse the same index.
fn index_name(&self) -> String {
format!("ngram-{}_ngram-type-{}", self.ngram_size, self.ngram_type)
}
}
impl Default for Spec {
fn default() -> Spec {
Spec::new()
}
}
impl fmt::Display for Spec {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
let scorer = match self.scorer {
None => "none".to_string(),
Some(ref scorer) => scorer.to_string(),
};
write!(
f,
"size-{}_ngram-{}_ngram-type-{}_sim-{}_scorer-{}",
self.result_size,
self.ngram_size,
self.ngram_type,
self.sim,
scorer,
)
}
}
/// A summary of the results of evaluating every information need or "task" for
/// a single evaluation specification. The summary boils the quality of the
/// specification down to two figures: the mean reciprocal rank and the ratio
/// of tasks that produced an answer.
///
/// The mean reciprocal rank measures the average precision of the
/// specification. That is, it measures how well we answer the following
/// question: "If your search produced the correct answer, how highly was it
/// ranked?"
///
/// The ratio of tasks that produced an answer measures how well we answer the
/// following question: "Of the searches ran, how many of them produced the
/// correct result at any rank?"
///
/// Implicit in the evaluation is the notion of a bounded number of results.
/// That is, every specification dictates the maximum number of results
/// returned by a search. If the answer isn't in that result set, then we stop
/// there and declare that the answer wasn't found.
///
/// The reason for using two different scores is so that they counter balance
/// each other. Namely, a specification that does really well on a smaller
/// number of results might end up with a higher MRR than other specifications,
/// but will have a lower ratio of successful searches.
#[derive(Debug, Deserialize, Serialize)]
pub struct Summary {
/// The specification name that this result is summarizing.
pub name: String,
/// Mean reciprocal rank.
pub mrr: f64,
/// The ratio of tasks that found an answer. The higher the better.
pub found: f64,
}
impl Summary {
/// Returns a group of summaries for all distinct specifications found
/// in the back of results given.
///
/// If no results are given, then no summaries are returned.
pub fn from_task_results(results: &[TaskResult]) -> Vec<Summary> {
let mut grouped: BTreeMap<&str, Vec<&TaskResult>> = BTreeMap::new();
for result in results {
grouped.entry(&result.name).or_insert(vec![]).push(result);
}
let mut summaries = vec![];
for results in grouped.values() {
summaries.push(Summary::from_same_task_results(results));
}
summaries
}
/// Returns a summary for a single group of task results. All the results
/// given must have the same name, otherwise this panics. This also panics
/// if the given results are empty.
fn from_same_task_results(results: &[&TaskResult]) -> Summary {
assert!(!results.is_empty());
assert!(results.iter().all(|r| results[0].name == r.name));
let mut precision_sum = 0.0;
let mut found = 0u64;
for r in results {
precision_sum += r.rank.map_or(0.0, |rank| 1.0 / (rank as f64));
if r.rank.is_some() {
found += 1;
}
}
Summary {
name: results[0].name.clone(),
mrr: precision_sum / (results.len() as f64),
found: (found as f64) / (results.len() as f64),
}
}
}
/// The result of evaluating a single information need or "task."
#[derive(Debug, Deserialize, Serialize)]
pub struct TaskResult {
/// The name of the evaluation's spec. This name includes all of the
/// parameters that influence the evaluation, such as ngram size,
/// similarity function, etc.
pub name: String,
/// The freeform text query, which represents a specific manifestation of
/// this information need. Generally speaking, this corresponds to the
/// query that an end user will type.
pub query: String,
/// The IMDb identifier corresponding to a singular answer expected by an
/// end user.
pub answer: String,
/// If the answer appears in the search results, then this corresponds to
/// the rank of that search result. The rank is determined by the answer's
/// absolute position in the list of ranked search results.
///
/// Ties in the ranked list are handled by assigning the maximum possible
/// rank to each search result with the same score. For example, if we
/// request 30 results and the answer is incidentally 10th in the list but
/// every search result has the same score of 1.0, then the rank of our
/// answer is 30. (Indeed, the rank of every search result is 30 in this
/// example.)
pub rank: Option<u64>,
/// The time it took to execute this query, in seconds.
pub duration_seconds: f64,
}
/// An evaluation is an iterator over all of the results of evaluating every
/// information need in the truth data.
#[derive(Debug)]
pub struct Evaluation<'s> {
/// The evaluator, which turns an information need into a `TaskResult`.
evaluator: Evaluator<'s>,
/// All of the tasks to evaluate.
tasks: vec::IntoIter<Task>,
}
impl<'s> Iterator for Evaluation<'s> {
type Item = anyhow::Result<TaskResult>;
fn next(&mut self) -> Option<anyhow::Result<TaskResult>> {
self.tasks.next().map(|task| self.evaluator.run(&task))
}
}
/// An evaluator is responsible for executing a single search for a single
/// information need. It records the evaluation of that search result in a
/// `TaskResult`.
#[derive(Debug)]
struct Evaluator<'s> {
/// The evaluation specification.
spec: &'s Spec,
/// A handle to a searcher for an IMDb index.
searcher: Searcher,
}
impl<'s> Evaluator<'s> {
/// Run this evaluator on a single information need and return the
/// evaluation.
fn run(&mut self, task: &Task) -> anyhow::Result<TaskResult> {
let start = Instant::now();
let rank = self.rank(task)?;
let duration = Instant::now().duration_since(start);
Ok(TaskResult {
name: self.spec.to_string(),
query: task.query.clone(),
answer: task.answer.clone(),
rank,
duration_seconds: fractional_seconds(&duration),
})
}
/// Execute the search for the given information need and determine the
/// rank of the expected answer for the given information need. If the
/// expected answer didn't appear in the search results, then `None` is
/// returned.
///
/// The rank of the answer is determined in exactly the way you might
/// expect: if the answer appears as the Nth result in a search, then its
/// rank is N. There is one tricky part of this, and it is specifically in
/// how we break ties. Stated succinctly, we always take the maximum
/// possible rank of a result. For example, given the following results,
/// where the first column is the score, the second column is the
/// result name, and the third column is the *intuitive* rank:
///
/// 1.0 a 1
/// 1.0 b 1
/// 1.0 c 1
/// 0.9 d 4
/// 0.8 e 5
/// 0.8 f 5
/// 0.7 g 7
///
/// Namely, records that are tied all get assigned the same rank, and the
/// next result with a lower score is assigned a rank equivalent to its
/// absolute position in the result list.
///
/// The problem with this ranking strategy is that it biases toward rankers
/// that have a naive score. In particular, so long as a search returns the
/// answer in the results, it could assign a score of `1.0` to every
/// result and get a maximal RR (Reciprocal Rank) evaluation.
///
/// Instead, we invert how results are ranked. The above example is instead
/// ranked like so:
///
/// 1.0 a 3
/// 1.0 b 3
/// 1.0 c 3
/// 0.9 d 4
/// 0.8 e 6
/// 0.8 f 6
/// 0.7 g 7
///
/// In other words, we assign the maximal possible rank instead of the
/// minimal possible rank.
///
/// There are other strategies, but in general, we want to reward high
/// precision rankers.
fn rank(&mut self, task: &Task) -> anyhow::Result<Option<u64>> {
let results = self.searcher.search(&self.spec.query(&task))?;
let mut rank = results.len() as u64;
let mut prev_score = None;
let mut ranked: Vec<(u64, MediaEntity)> = vec![];
for (i, scored) in results.into_iter().enumerate().rev() {
let (score, entity) = scored.into_pair();
if prev_score.map_or(true, |s| !approx_eq(s, score)) {
rank = i as u64 + 1;
prev_score = Some(score);
}
ranked.push((rank, entity));
}
ranked.reverse();
for (rank, entity) in ranked {
if entity.title().id == task.answer {
return Ok(Some(rank));
}
}
Ok(None)
}
}
/// Compares two floating point numbers for equality approximately for some
/// epsilon.
fn approx_eq(x1: f64, x2: f64) -> bool {
// We used a fixed error because it's good enough in practice.
(x1 - x2).abs() <= 0.0000000001
}
/// Returns the number of seconds in this duration in fraction form.
/// The number to the left of the decimal point is the number of seconds,
/// and the number to the right is the number of milliseconds.
fn fractional_seconds(d: &Duration) -> f64 {
let fractional = (d.subsec_nanos() as f64) / 1_000_000_000.0;
d.as_secs() as f64 + fractional
}
#[cfg(test)]
mod tests {
use imdb_index::{NameScorer, NgramType, Similarity};
use super::Spec;
#[test]
fn spec_printer() {
let spec = Spec {
result_size: 30,
ngram_size: 3,
ngram_type: NgramType::Window,
sim: Similarity::None,
scorer: Some(NameScorer::OkapiBM25),
};
let expected =
"size-30_ngram-3_ngram-type-window_sim-none_scorer-okapibm25";
assert_eq!(spec.to_string(), expected);
let spec = Spec {
result_size: 1,
ngram_size: 2,
ngram_type: NgramType::Edge,
sim: Similarity::Jaro,
scorer: None,
};
let expected = "size-1_ngram-2_ngram-type-edge_sim-jaro_scorer-none";
assert_eq!(spec.to_string(), expected);
}
}
================================================
FILE: imdb-eval/src/logger.rs
================================================
// This module defines a super simple logger that works with the `log` crate.
// We don't need anything fancy; just basic log levels and the ability to
// print to stderr. We therefore avoid bringing in extra dependencies just
// for this functionality.
use log::Log;
use anyhow::Result;
/// Initialize a simple logger.
pub fn init() -> Result<()> {
Ok(Logger::init()?)
}
/// The simplest possible logger that logs to stderr.
///
/// This logger does no filtering. Instead, it relies on the `log` crates
/// filtering via its global max_level setting.
#[derive(Debug)]
struct Logger(());
const LOGGER: &'static Logger = &Logger(());
impl Logger {
/// Create a new logger that logs to stderr and initialize it as the
/// global logger. If there was a problem setting the logger, then an
/// error is returned.
fn init() -> std::result::Result<(), log::SetLoggerError> {
log::set_logger(LOGGER)
}
}
impl Log for Logger {
fn enabled(&self, _: &log::Metadata) -> bool {
// We set the log level via log::set_max_level, so we don't need to
// implement filtering here.
true
}
fn log(&self, record: &log::Record) {
if !should_log(record) {
return;
}
eprintln!("{}: {}", record.level(), record.args());
}
fn flush(&self) {
// We use eprintln! which is flushed on every call.
}
}
fn should_log(record: &log::Record) -> bool {
let t = record.target();
t.starts_with("imdb_rename") || t.starts_with("imdb_index")
}
================================================
FILE: imdb-eval/src/main.rs
================================================
use std::env;
use std::io;
use std::path::{Path, PathBuf};
use std::process;
use std::result;
use std::str::FromStr;
use imdb_index::{NameScorer, NgramType, Similarity};
use lazy_static::lazy_static;
use crate::eval::Spec;
mod eval;
mod logger;
fn main() {
if let Err(err) = try_main() {
// A pipe error occurs when the consumer of this process's output has
// hung up. This is a normal event, and we should quit gracefully.
if is_pipe_error(&err) {
process::exit(0);
}
eprintln!("{:?}", err);
process::exit(1);
}
}
fn try_main() -> anyhow::Result<()> {
logger::init()?;
log::set_max_level(log::LevelFilter::Info);
let args = Args::from_matches(&app().get_matches())?;
if args.debug {
log::set_max_level(log::LevelFilter::Debug);
}
if let Some(ref summarize) = args.summarize {
return run_summarize(summarize);
} else if args.dry_run {
for spec in args.specs()? {
println!("{}", spec);
}
return Ok(());
}
run_eval(
&args.data_dir,
&args.eval_dir,
args.truth.as_ref().map(|p| p.as_path()),
args.specs()?,
)
}
/// Run an evaluation on the IMDb data in `data_dir`, and store any indexes
/// created for the evaluation in `eval_dir`. If a path to truth data is given,
/// then the information needs or "tasks" used for the evaluation are taken
/// from that file, otherwise, a built-in truth data set is used.
///
/// The specs given each describe the protocol for an evaluation. They each
/// represent a configuration for how an IMDb index is built and how queries
/// are constructed. The specification is fundamentally the thing we want to
/// evaluate. That is, we want to find the "best" specification.
fn run_eval(
data_dir: &Path,
eval_dir: &Path,
truth_path: Option<&Path>,
specs: Vec<Spec>,
) -> anyhow::Result<()> {
if !data_dir.exists() {
anyhow::bail!(
"data directory {} does not exist; please use \
imdb-rename to create it",
data_dir.display()
);
}
let mut wtr = csv::Writer::from_writer(io::stdout());
for spec in &specs {
let results = match truth_path {
None => spec.evaluate(data_dir, eval_dir)?,
Some(p) => spec.evaluate_with(data_dir, eval_dir, p)?,
};
for result in results {
wtr.serialize(result?)?;
wtr.flush()?;
}
}
Ok(())
}
/// Summarize the evaluation results at the given path.
fn run_summarize(summarize: &Path) -> anyhow::Result<()> {
let mut results: Vec<eval::TaskResult> = vec![];
let mut rdr = csv::Reader::from_path(summarize)?;
for result in rdr.deserialize() {
results.push(result?);
}
let mut wtr = csv::Writer::from_writer(io::stdout());
for summary in eval::Summary::from_task_results(&results) {
wtr.serialize(summary)?;
}
wtr.flush()?;
Ok(())
}
#[derive(Debug)]
struct Args {
data_dir: PathBuf,
debug: bool,
dry_run: bool,
eval_dir: PathBuf,
ngram_sizes: Vec<usize>,
ngram_types: Vec<NgramType>,
result_sizes: Vec<usize>,
scorers: Vec<Option<NameScorer>>,
similarities: Vec<Similarity>,
summarize: Option<PathBuf>,
truth: Option<PathBuf>,
}
impl Args {
/// Build a structured set of arguments from clap's matches.
fn from_matches(matches: &clap::ArgMatches) -> anyhow::Result<Args> {
let data_dir =
matches.value_of_os("data-dir").map(PathBuf::from).unwrap();
let eval_dir =
matches.value_of_os("eval-dir").map(PathBuf::from).unwrap();
let similarities = parse_many_lossy(
matches,
"sim",
vec![
Similarity::None,
Similarity::Levenshtein,
Similarity::Jaro,
Similarity::JaroWinkler,
],
)?;
let scorers = parse_many_lossy(
matches,
"scorer",
vec![
OptionalNameScorer::from(NameScorer::OkapiBM25),
OptionalNameScorer::from(NameScorer::TFIDF),
OptionalNameScorer::from(NameScorer::Jaccard),
OptionalNameScorer::from(NameScorer::QueryRatio),
],
)?
.into_iter()
.map(|s| s.0)
.collect();
let ngram_types =
parse_many_lossy(matches, "ngram-type", vec![NgramType::Window])?;
Ok(Args {
data_dir,
debug: matches.is_present("debug"),
dry_run: matches.is_present("dry-run"),
eval_dir,
ngram_sizes: parse_many_lossy(matches, "ngram-size", vec![3])?,
ngram_types,
result_sizes: parse_many_lossy(matches, "result-size", vec![30])?,
scorers,
similarities,
summarize: matches.value_of_os("summarize").map(PathBuf::from),
truth: matches.value_of_os("truth").map(PathBuf::from),
})
}
/// Build all evaluation specifications as indicated by command line
/// options.
fn specs(&self) -> anyhow::Result<Vec<Spec>> {
// We want to build all possible permutations. We do this by
// alternating between specs1 and specs2. Each additional parameter
// combinatorially explodes the previous set of specifications.
let (mut specs1, mut specs2) = (vec![], vec![]);
for &ngram_size in &self.ngram_sizes {
specs1.push(Spec::new().with_ngram_size(ngram_size)?);
}
for spec in specs1.drain(..) {
for &result_size in &self.result_sizes {
specs2.push(spec.clone().with_result_size(result_size)?);
}
}
for spec in specs2.drain(..) {
for sim in &self.similarities {
specs1.push(spec.clone().with_similarity(sim.clone()));
}
}
for spec in specs1.drain(..) {
for scorer in &self.scorers {
specs2.push(spec.clone().with_scorer(scorer.clone()));
}
}
for spec in specs2.drain(..) {
for ngram_type in &self.ngram_types {
specs1.push(spec.clone().with_ngram_type(ngram_type.clone()));
}
}
Ok(specs1)
}
}
fn app() -> clap::App<'static, 'static> {
use clap::{App, AppSettings, Arg};
lazy_static! {
// clap wants all of its strings tied to a particular lifetime, but
// we'd really like to determine some default values dynamically. Using
// a lazy_static here is one way of safely giving a static lifetime to
// a value that is computed at runtime.
//
// An alternative approach would be to compute all of our default
// values in the caller, and pass them into this function. It's nicer
// to defined what we need here though. Locality of reference and all
// that.
static ref DEFAULT_DATA_DIR: PathBuf =
env::temp_dir().join("imdb-rename");
static ref DEFAULT_EVAL_DIR: PathBuf =
env::temp_dir().join("imdb-rename-eval");
static ref POSSIBLE_SCORER_NAMES: Vec<&'static str> = {
let mut names = NameScorer::possible_names().to_vec();
names.insert(0, "none");
names
};
}
App::new("imdb-rename")
.author(clap::crate_authors!())
.version(clap::crate_version!())
.max_term_width(100)
.setting(AppSettings::UnifiedHelpMessage)
.arg(Arg::with_name("data-dir")
.long("data-dir")
.env("IMDB_RENAME_DATA_DIR")
.takes_value(true)
.default_value_os(DEFAULT_DATA_DIR.as_os_str())
.help("The location to store IMDb data files."))
.arg(Arg::with_name("debug")
.long("debug")
.help("Show debug messages. Use this when filing bugs."))
.arg(Arg::with_name("dry-run")
.long("dry-run")
.help("Show the evaluations that would be run and then exit \
without running them."))
.arg(Arg::with_name("eval-dir")
.long("eval-dir")
.env("IMDB_RENAME_EVAL_DIR")
.takes_value(true)
.default_value_os(DEFAULT_EVAL_DIR.as_os_str())
.help("The location to store evaluation index files."))
.arg(Arg::with_name("ngram-size")
.long("ngram-size")
.takes_value(true)
.multiple(true)
.number_of_values(1)
.help("Set the ngram size on which to perform an evaluation. \
An evaluation will be performed for each ngram size. \
If no ngram size is given, a default of 3 is used."))
.arg(Arg::with_name("ngram-type")
.long("ngram-type")
.takes_value(true)
.multiple(true)
.number_of_values(1)
.possible_values(NgramType::possible_names())
.help("Set the ngram type on which to perform an evaluation. \
An evaluation will be performed for each ngram type. \
If no ngram type is given, it defaults to 'window'."))
.arg(Arg::with_name("result-size")
.long("result-size")
.takes_value(true)
.multiple(true)
.number_of_values(1)
.help("Set the result size on which to perform an evaluation. \
An evaluation will be performed for each result size. \
If no result size is given, a default of 30 is used."))
.arg(Arg::with_name("scorer")
.long("scorer")
.takes_value(true)
.multiple(true)
.number_of_values(1)
.possible_values(&POSSIBLE_SCORER_NAMES)
.help("Set the name scorer function to use. An evaluation is \
performed for each name function given. By default, \
all name scorers are used, except for 'none'."))
.arg(Arg::with_name("sim")
.long("sim")
.takes_value(true)
.multiple(true)
.number_of_values(1)
.possible_values(Similarity::possible_names())
.help("Set the similarity ranker function to use. An evaluation \
is performed for each ranker function given. By default, \
all ranker functions are used, including 'none'."))
.arg(Arg::with_name("summarize")
.long("summarize")
.takes_value(true)
.number_of_values(1)
.help("Print summary statistics from an evaluation run."))
.arg(Arg::with_name("truth")
.long("truth")
.takes_value(true)
.help("A file path containing evaluation truth data. By default, \
an evaluation uses truth data embedded in imdb-rename."))
}
/// An optional name scorer is a `NameScorer` that may be absent.
///
/// We define a type for it to make parsing it easier.
#[derive(Debug)]
struct OptionalNameScorer(Option<NameScorer>);
impl FromStr for OptionalNameScorer {
type Err = imdb_index::Error;
fn from_str(
s: &str,
) -> result::Result<OptionalNameScorer, imdb_index::Error> {
let opt = if s == "none" { None } else { Some(s.parse()?) };
Ok(OptionalNameScorer(opt))
}
}
impl From<NameScorer> for OptionalNameScorer {
fn from(scorer: NameScorer) -> OptionalNameScorer {
OptionalNameScorer(Some(scorer))
}
}
/// Parse a sequence of values from clap.
fn parse_many_lossy<
E: std::error::Error + Send + Sync + 'static,
T: FromStr<Err = E>,
>(
matches: &clap::ArgMatches,
name: &str,
default: Vec<T>,
) -> anyhow::Result<Vec<T>> {
let strs = match matches.values_of_lossy(name) {
None => return Ok(default),
Some(strs) => strs,
};
let mut values = vec![];
for s in strs {
values.push(s.parse()?);
}
Ok(values)
}
/// Return true if and only if an I/O broken pipe error exists in the causal
/// chain of the given error.
fn is_pipe_error(err: &anyhow::Error) -> bool {
for cause in err.chain() {
if let Some(ioerr) = cause.downcast_ref::<io::Error>() {
if ioerr.kind() == io::ErrorKind::BrokenPipe {
return true;
}
}
}
false
}
================================================
FILE: imdb-index/COPYING
================================================
This project is dual-licensed under the Unlicense and MIT licenses.
You may use this code under the terms of either license.
================================================
FILE: imdb-index/Cargo.toml
================================================
[package]
name = "imdb-index"
version = "0.1.4" #:version
authors = ["Andrew Gallant <jamslam@gmail.com>"]
description = """
A library for indexing and searching IMDb using information retrieval.
"""
documentation = "https://github.com/BurntSushi/imdb-rename"
homepage = "https://github.com/BurntSushi/imdb-rename"
repository = "https://github.com/BurntSushi/imdb-rename"
readme = "README.md"
keywords = ["imdb", "movie", "index", "search"]
license = "Unlicense/MIT"
edition = "2021"
[dependencies]
csv = "1.3.0"
fnv = "1.0.7"
fst = "0.4.7"
lazy_static = "1.4.0"
log = { version = "0.4.20", features = ["std"] }
memmap = { package = "memmap2", version = "0.9.1" }
regex = "1.10.2"
serde = { version = "1.0.193", features = ["derive"] }
serde_json = "1.0.108"
strsim = "0.10.0"
================================================
FILE: imdb-index/LICENSE-MIT
================================================
The MIT License (MIT)
Copyright (c) 2015 Andrew Gallant
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
================================================
FILE: imdb-index/README.md
================================================
imdb-index
==========
A library for reading and writing an IMDb index, with a focus on IMDb titles.
In particular, this library can build a name index on all of IMDb's 6 million
names, which supports fast fuzzy searching and relevance ranking.
[](https://travis-ci.org/BurntSushi/imdb-rename)
[](https://ci.appveyor.com/project/BurntSushi/imdb-rename)
[](https://crates.io/crates/imdb-index)
Dual-licensed under MIT or the [UNLICENSE](http://unlicense.org).
### Documentation
https://docs.rs/imdb-index
================================================
FILE: imdb-index/UNLICENSE
================================================
This is free and unencumbered software released into the public domain.
Anyone is free to copy, modify, publish, use, compile, sell, or
distribute this software, either in source code form or as a compiled
binary, for any purpose, commercial or non-commercial, and by any
means.
In jurisdictions that recognize copyright laws, the author or authors
of this software dedicate any and all copyright interest in the
software to the public domain. We make this dedication for the benefit
of the public at large and to the detriment of our heirs and
successors. We intend this dedication to be an overt act of
relinquishment in perpetuity of all present and future rights to this
software under copyright law.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
For more information, please refer to <http://unlicense.org/>
================================================
FILE: imdb-index/src/error.rs
================================================
use std::fmt;
use std::path::{Path, PathBuf};
/// A type alias for handling errors throughout imdb-index.
pub type Result<T> = std::result::Result<T, Error>;
/// An error that can occur while interacting with an IMDb index.
#[derive(Debug)]
pub struct Error {
kind: ErrorKind,
}
impl Error {
/// Return a reference to the kind of this error.
pub fn kind(&self) -> &ErrorKind {
&self.kind
}
/// Transfer ownership of the kind of this error.
pub fn into_kind(self) -> ErrorKind {
self.kind
}
pub(crate) fn new(kind: ErrorKind) -> Error {
Error { kind }
}
pub(crate) fn unknown_title<T: AsRef<str>>(unk: T) -> Error {
Error { kind: ErrorKind::UnknownTitle(unk.as_ref().to_string()) }
}
pub(crate) fn unknown_scorer<T: AsRef<str>>(unk: T) -> Error {
Error { kind: ErrorKind::UnknownScorer(unk.as_ref().to_string()) }
}
pub(crate) fn unknown_ngram_type<T: AsRef<str>>(unk: T) -> Error {
Error { kind: ErrorKind::UnknownNgramType(unk.as_ref().to_string()) }
}
pub(crate) fn unknown_sim<T: AsRef<str>>(unk: T) -> Error {
Error { kind: ErrorKind::UnknownSimilarity(unk.as_ref().to_string()) }
}
pub(crate) fn unknown_directive<T: AsRef<str>>(unk: T) -> Error {
Error { kind: ErrorKind::UnknownDirective(unk.as_ref().to_string()) }
}
pub(crate) fn bug<T: AsRef<str>>(msg: T) -> Error {
Error { kind: ErrorKind::Bug(msg.as_ref().to_string()) }
}
pub(crate) fn config<T: AsRef<str>>(msg: T) -> Error {
Error { kind: ErrorKind::Config(msg.as_ref().to_string()) }
}
pub(crate) fn version(expected: u64, got: u64) -> Error {
Error { kind: ErrorKind::VersionMismatch { expected, got } }
}
pub(crate) fn csv(err: csv::Error) -> Error {
Error { kind: ErrorKind::Csv(err.to_string()) }
}
pub(crate) fn fst(err: fst::Error) -> Error {
Error { kind: ErrorKind::Fst(err.to_string()) }
}
pub(crate) fn io(err: std::io::Error) -> Error {
Error { kind: ErrorKind::Io { err, path: None } }
}
pub(crate) fn io_path<P: AsRef<Path>>(
err: std::io::Error,
path: P,
) -> Error {
Error {
kind: ErrorKind::Io {
err,
path: Some(path.as_ref().to_path_buf()),
},
}
}
pub(crate) fn number<E: std::error::Error + Send + Sync + 'static>(
err: E,
) -> Error {
Error { kind: ErrorKind::Number(Box::new(err)) }
}
}
impl std::error::Error for Error {
fn source(&self) -> Option<&(dyn std::error::Error + 'static)> {
match self.kind {
ErrorKind::Io { ref err, .. } => Some(err),
ErrorKind::Number(ref err) => Some(&**err),
_ => None,
}
}
}
impl fmt::Display for Error {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
self.kind.fmt(f)
}
}
/// The specific kind of error that can occur.
#[derive(Debug)]
pub enum ErrorKind {
/// An index version mismatch. This error occurs when the version of the
/// index is different from the version supported by this version of
/// imdb-index.
///
/// Generally speaking, the versions must be exactly equivalent, otherwise
/// this error is returned.
VersionMismatch {
/// The expected or supported index version.
expected: u64,
/// The actual version of the index on disk.
got: u64,
},
/// An error parsing the type of a title.
///
/// The data provided is the unrecognized title type.
UnknownTitle(String),
/// An error parsing the name of a scorer.
///
/// The data provided is the unrecognized name.
UnknownScorer(String),
/// An error parsing the name of an ngram type.
///
/// The data provided is the unrecognized name.
UnknownNgramType(String),
/// An error parsing the name of a similarity function.
///
/// The data provided is the unrecognized name.
UnknownSimilarity(String),
/// An error parsing the name of a directive from a free-form query.
///
/// The data provided is the unrecognized name.
UnknownDirective(String),
/// An unexpected error occurred while reading an index that should not
/// have occurred. Generally, these errors correspond to bugs in this
/// library.
Bug(String),
/// An error occurred while reading/writing the index config.
Config(String),
/// An error that occured while writing or reading CSV data.
Csv(String),
/// An error that occured while creating an FST index.
Fst(String),
/// An unexpected I/O error occurred.
Io {
/// The underlying I/O error.
err: std::io::Error,
/// A file path, if the I/O error occurred in the context of a named
/// file.
path: Option<PathBuf>,
},
/// An error occurred while parsing a number in a free-form query.
Number(Box<dyn std::error::Error + Send + Sync>),
/// Hints that destructuring should not be exhaustive.
///
/// This enum may grow additional variants, so this makes sure clients
/// don't count on exhaustive matching. (Otherwise, adding a new variant
/// could break existing code.)
#[doc(hidden)]
__Nonexhaustive,
}
impl fmt::Display for ErrorKind {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
match *self {
ErrorKind::VersionMismatch { expected, got } => write!(
f,
"index version mismatch: expected version {} \
but got version {}. Please rebuild the index.",
expected, got
),
ErrorKind::UnknownTitle(ref unk) => {
write!(f, "unrecognized title type: '{}'", unk)
}
ErrorKind::UnknownScorer(ref unk) => {
write!(f, "unrecognized scorer name: '{}'", unk)
}
ErrorKind::UnknownNgramType(ref unk) => {
write!(f, "unrecognized ngram type: '{}'", unk)
}
ErrorKind::UnknownSimilarity(ref unk) => {
write!(f, "unrecognized similarity function: '{}'", unk)
}
ErrorKind::UnknownDirective(ref unk) => {
write!(f, "unrecognized search directive: '{}'", unk)
}
ErrorKind::Bug(ref msg) => {
let report = "Please report this bug with a backtrace at \
https://github.com/BurntSushi/imdb-rename";
write!(f, "BUG: {}\n{}", msg, report)
}
ErrorKind::Config(ref msg) => write!(f, "config error: {}", msg),
ErrorKind::Csv(ref msg) => write!(f, "{}", msg),
ErrorKind::Fst(ref msg) => write!(f, "fst error: {}", msg),
ErrorKind::Io { path: None, .. } => write!(f, "I/O error"),
ErrorKind::Io { path: Some(ref p), .. } => {
write!(f, "{}", p.display())
}
ErrorKind::Number(_) => write!(f, "error parsing number"),
ErrorKind::__Nonexhaustive => panic!("invalid error"),
}
}
}
================================================
FILE: imdb-index/src/index/aka.rs
================================================
use std::io;
use std::iter;
use std::path::Path;
use memmap::Mmap;
use crate::error::{Error, Result};
use crate::index::{csv_file, csv_mmap, id};
use crate::record::AKA;
use crate::util::IMDB_AKAS;
/// A name of the AKA record index file.
///
/// This index represents a map from IMDb title id to a 64-bit integer. The
/// 64-bit integer encodes two pieces of information: the number of alternate
/// names for the title (high 16 bits) and the file offset at which the records
/// appear in title.akas.tsv (low 48 bits).
const AKAS: &str = "akas.fst";
/// A handle to the AKA name index.
///
/// The AKA index maps IMDb identifiers to a list of AKA records.
///
/// This index assumes that the underlying AKA CSV file is sorted by IMDb ID.
#[derive(Debug)]
pub struct Index {
akas: csv::Reader<io::Cursor<Mmap>>,
idx: id::IndexReader,
}
impl Index {
/// Open an AKA index using the corresponding data and index directories.
/// The data directory contains the IMDb data set while the index directory
/// contains the index data files.
pub fn open<P1: AsRef<Path>, P2: AsRef<Path>>(
data_dir: P1,
index_dir: P2,
) -> Result<Index> {
Ok(Index {
// We claim it is safe to open the following memory map because we
// don't mutate them and no other process (should) either.
akas: unsafe { csv_mmap(data_dir.as_ref().join(IMDB_AKAS))? },
idx: id::IndexReader::from_path(index_dir.as_ref().join(AKAS))?,
})
}
/// Create an AKA index by reading the AKA data from the given data
/// directory and writing the index to the corresponding index directory.
pub fn create<P1: AsRef<Path>, P2: AsRef<Path>>(
data_dir: P1,
index_dir: P2,
) -> Result<Index> {
let data_dir = data_dir.as_ref();
let index_dir = index_dir.as_ref();
let rdr = csv_file(data_dir.join(IMDB_AKAS))?;
let mut wtr = id::IndexSortedWriter::from_path(index_dir.join(AKAS))?;
let mut count = 0u64;
for result in AKAIndexRecords::new(rdr) {
let record = result?;
wtr.insert(&record.id, (record.count << 48) | record.offset)?;
count += record.count;
}
wtr.finish()?;
log::info!("{} alternate names indexed", count);
Index::open(data_dir, index_dir)
}
/// Return a (possibly empty) iterator over all AKA records for the given
/// IMDb ID.
pub fn find(&mut self, id: &[u8]) -> Result<AKARecordIter> {
match self.idx.get(id) {
None => Ok(AKARecordIter(None)),
Some(v) => {
let count = (v >> 48) as usize;
let offset = v & ((1 << 48) - 1);
let mut pos = csv::Position::new();
pos.set_byte(offset);
self.akas.seek(pos).map_err(Error::csv)?;
Ok(AKARecordIter(Some(self.akas.deserialize().take(count))))
}
}
}
}
/// An iterator over AKA records for a single IMDb title.
///
/// This iterator is constructed via the `aka::Index::find` method.
///
/// This iterator may yield no titles.
///
/// The lifetime `'r` refers to the lifetime of the underlying AKA index
/// reader.
pub struct AKARecordIter<'r>(
Option<iter::Take<csv::DeserializeRecordsIter<'r, io::Cursor<Mmap>, AKA>>>,
);
impl<'r> Iterator for AKARecordIter<'r> {
type Item = Result<AKA>;
fn next(&mut self) -> Option<Result<AKA>> {
let next = match self.0.as_mut().and_then(|it| it.next()) {
None => return None,
Some(next) => next,
};
match next {
Ok(next) => Some(Ok(next)),
Err(err) => Some(Err(Error::csv(err))),
}
}
}
/// An indexable AKA record.
///
/// Each indexable record represents a group of alternative titles in the
/// title.akas.tsv file.
#[derive(Clone, Debug, Eq, PartialEq)]
struct AKAIndexRecord {
id: Vec<u8>,
offset: u64,
count: u64,
}
/// A streaming iterator over indexable AKA records.
///
/// Each indexable record is a triple, and consists of an IMDb title ID,
/// the number of alternate titles for that title, and the file offset in the
/// CSV file at which those records begin.
///
/// The `R` type parameter refers to the underlying `io::Read` type of the
/// CSV reader.
#[derive(Debug)]
struct AKAIndexRecords<R> {
/// The underlying CSV reader.
rdr: csv::Reader<R>,
/// Scratch space for storing the byte record.
record: csv::ByteRecord,
/// Set to true when the iterator has been exhausted.
done: bool,
}
impl<R: io::Read> AKAIndexRecords<R> {
/// Create a new streaming iterator over indexable AKA records.
fn new(rdr: csv::Reader<R>) -> AKAIndexRecords<R> {
AKAIndexRecords { rdr, record: csv::ByteRecord::new(), done: false }
}
}
impl<R: io::Read> Iterator for AKAIndexRecords<R> {
type Item = Result<AKAIndexRecord>;
/// Advance to the next indexable record and return it. If no more
/// records exist, return `None`.
///
/// If there was a problem parsing or reading from the underlying CSV
/// data, then an error is returned.
fn next(&mut self) -> Option<Result<AKAIndexRecord>> {
macro_rules! itry {
($e:expr) => {
match $e {
Err(err) => return Some(Err(Error::csv(err))),
Ok(v) => v,
}
};
}
if self.done {
return None;
}
// Only initialize the record if this is our first go at it.
// Otherwise, previous call leaves next record in `AKAIndexRecord`.
if self.record.is_empty() {
if !itry!(self.rdr.read_byte_record(&mut self.record)) {
return None;
}
}
let mut irecord = AKAIndexRecord {
id: self.record[0].to_vec(),
offset: self.record.position().expect("position on row").byte(),
count: 1,
};
while itry!(self.rdr.read_byte_record(&mut self.record)) {
if irecord.id != &self.record[0] {
break;
}
irecord.count += 1;
}
// If we've read the last record then we're done!
if self.rdr.is_done() {
self.done = true;
}
Some(Ok(irecord))
}
}
#[cfg(test)]
mod tests {
use super::*;
use crate::util::csv_reader_builder;
#[test]
fn aka_index_records1() {
let data = r"titleId ordering title region language types attributes isOriginalTitle
tt0117019 1 Hommes à l'huile FR \N \N \N 0
tt0117019 2 Männer in Öl DE \N \N \N 0
tt0117019 3 Men in Oil XEU en festival \N 0
tt0117019 4 Männer in Öl: Annäherungsversuche an die Malerin Susanne Hay \N \N original \N 1
tt0117019 5 Men in Oil XWW en \N \N 0
tt0117020 1 Mendigos sin fronteras ES \N \N \N 0
tt0117021 1 Menno's Mind US \N \N \N 0
tt0117021 2 Menno's Mind \N \N original \N 1
tt0117021 3 The Matrix 2 RU \N video \N 0
tt0117021 4 Virtuális elme HU \N imdbDisplay \N 0
tt0117021 5 Power.com US \N video \N 0
tt0117021 6 La mente de Menno ES \N \N \N 0
tt0117021 7 Power.com CA en video \N 0
tt0117021 8 Terror im Computer DE \N \N \N 0
tt0117022 1 Menopause Song CA \N \N \N 0
tt0117023 1 Les menteurs FR \N \N \N 0";
let rdr = csv_reader_builder().from_reader(data.as_bytes());
let records: Vec<AKAIndexRecord> =
AKAIndexRecords::new(rdr).collect::<Result<_>>().unwrap();
assert_eq!(records.len(), 5);
assert_eq!(records[0].id, b"tt0117019");
assert_eq!(records[0].count, 5);
assert_eq!(records[1].id, b"tt0117020");
assert_eq!(records[1].count, 1);
assert_eq!(records[2].id, b"tt0117021");
assert_eq!(records[2].count, 8);
assert_eq!(records[3].id, b"tt0117022");
assert_eq!(records[3].count, 1);
assert_eq!(records[4].id, b"tt0117023");
assert_eq!(records[4].count, 1);
}
#[test]
fn aka_index_records2() {
let data = r"titleId ordering title region language types attributes isOriginalTitle
tt0117019 1 Hommes à l'huile FR \N \N \N 0
tt0117019 2 Männer in Öl DE \N \N \N 0
tt0117019 3 Men in Oil XEU en festival \N 0
tt0117019 4 Männer in Öl: Annäherungsversuche an die Malerin Susanne Hay \N \N original \N 1
tt0117019 5 Men in Oil XWW en \N \N 0
tt0117020 1 Mendigos sin fronteras ES \N \N \N 0
tt0117021 1 Menno's Mind US \N \N \N 0
tt0117021 2 Menno's Mind \N \N original \N 1
tt0117021 3 The Matrix 2 RU \N video \N 0
tt0117021 4 Virtuális elme HU \N imdbDisplay \N 0
tt0117021 5 Power.com US \N video \N 0
tt0117021 6 La mente de Menno ES \N \N \N 0
tt0117021 7 Power.com CA en video \N 0
tt0117021 8 Terror im Computer DE \N \N \N 0";
let rdr = csv_reader_builder().from_reader(data.as_bytes());
let records: Vec<AKAIndexRecord> =
AKAIndexRecords::new(rdr).collect::<Result<_>>().unwrap();
assert_eq!(records.len(), 3);
assert_eq!(records[0].id, b"tt0117019");
assert_eq!(records[0].count, 5);
assert_eq!(records[1].id, b"tt0117020");
assert_eq!(records[1].count, 1);
assert_eq!(records[2].id, b"tt0117021");
assert_eq!(records[2].count, 8);
}
#[test]
fn aka_index_records3() {
let data = r"titleId ordering title region language types attributes isOriginalTitle
tt0117021 1 Menno's Mind US \N \N \N 0
tt0117021 2 Menno's Mind \N \N original \N 1
tt0117021 3 The Matrix 2 RU \N video \N 0
tt0117021 4 Virtuális elme HU \N imdbDisplay \N 0
tt0117021 5 Power.com US \N video \N 0
tt0117021 6 La mente de Menno ES \N \N \N 0
tt0117021 7 Power.com CA en video \N 0
tt0117021 8 Terror im Computer DE \N \N \N 0";
let rdr = csv_reader_builder().from_reader(data.as_bytes());
let records: Vec<AKAIndexRecord> =
AKAIndexRecords::new(rdr).collect::<Result<_>>().unwrap();
assert_eq!(records.len(), 1);
assert_eq!(records[0].id, b"tt0117021");
assert_eq!(records[0].count, 8);
}
#[test]
fn aka_index_records4() {
let data = r"titleId ordering title region language types attributes isOriginalTitle
tt0117021 1 Menno's Mind US \N \N \N 0";
let rdr = csv_reader_builder().from_reader(data.as_bytes());
let records: Vec<AKAIndexRecord> =
AKAIndexRecords::new(rdr).collect::<Result<_>>().unwrap();
assert_eq!(records.len(), 1);
assert_eq!(records[0].id, b"tt0117021");
assert_eq!(records[0].count, 1);
}
}
================================================
FILE: imdb-index/src/index/episode.rs
================================================
use std::cmp;
use std::path::Path;
use std::u32;
use fst::{IntoStreamer, Streamer};
use memmap::Mmap;
use crate::error::{Error, Result};
use crate::index::csv_file;
use crate::record::Episode;
use crate::util::{fst_set_builder_file, fst_set_file, IMDB_EPISODE};
/// The name of the episode index file.
///
/// The episode index maps TV show ids to episodes. The index is constructed
/// in a way where either of the following things can be used as look up keys:
///
/// tvshow IMDb title ID
/// (tvshow IMDb title ID, season number)
///
/// In particular, the index itself stores the entire episode record, and it
/// can be re-constituted without re-visiting the original episode data file.
const SEASONS: &str = "episode.seasons.fst";
/// The name of the TV show index file.
///
/// The TV show index maps episode IMDb title IDs to tvshow IMDb title IDs.
/// This allows us to quickly look up the TV show corresponding to an episode
/// in search results.
///
/// The format of this index is an FST set, where each key corresponds to the
/// episode ID joined with the TV show ID by a `NUL` byte. This lets us do
/// a range query on the set when given the episode ID to find the TV show ID.
const TVSHOWS: &str = "episode.tvshows.fst";
/// An episode index that supports retrieving season and episode information
/// quickly.
#[derive(Debug)]
pub struct Index {
seasons: fst::Set<Mmap>,
tvshows: fst::Set<Mmap>,
}
impl Index {
/// Open an episode index from the given index directory.
pub fn open<P: AsRef<Path>>(index_dir: P) -> Result<Index> {
let index_dir = index_dir.as_ref();
// We claim it is safe to open the following memory map because we
// don't mutate them and no other process (should) either.
let seasons = unsafe { fst_set_file(index_dir.join(SEASONS))? };
let tvshows = unsafe { fst_set_file(index_dir.join(TVSHOWS))? };
Ok(Index { seasons, tvshows })
}
/// Create an episode index from the given IMDb data directory and write
/// it to the given index directory. If an episode index already exists,
/// then it is overwritten.
pub fn create<P1: AsRef<Path>, P2: AsRef<Path>>(
data_dir: P1,
index_dir: P2,
) -> Result<Index> {
let data_dir = data_dir.as_ref();
let index_dir = index_dir.as_ref();
let mut buf = vec![];
let mut seasons = fst_set_builder_file(index_dir.join(SEASONS))?;
let mut tvshows = fst_set_builder_file(index_dir.join(TVSHOWS))?;
let mut episodes = read_sorted_episodes(data_dir)?;
for episode in &episodes {
buf.clear();
write_episode(episode, &mut buf)?;
seasons.insert(&buf).map_err(Error::fst)?;
}
episodes.sort_by(|e1, e2| {
(&e1.id, &e1.tvshow_id).cmp(&(&e2.id, &e2.tvshow_id))
});
for episode in &episodes {
buf.clear();
write_tvshow(&episode, &mut buf)?;
tvshows.insert(&buf).map_err(Error::fst)?;
}
seasons.finish().map_err(Error::fst)?;
tvshows.finish().map_err(Error::fst)?;
log::info!("{} episodes indexed", episodes.len());
Index::open(index_dir)
}
/// Return a sequence of episodes for the given TV show IMDb identifier.
///
/// The episodes are sorted in order of season number and episode number.
/// Episodes without season/episode numbers are sorted after episodes with
/// numbers.
pub fn seasons(&self, tvshow_id: &[u8]) -> Result<Vec<Episode>> {
let mut upper = tvshow_id.to_vec();
upper.push(0xFF);
let mut episodes = vec![];
let mut stream =
self.seasons.range().ge(tvshow_id).le(upper).into_stream();
while let Some(episode_bytes) = stream.next() {
episodes.push(read_episode(episode_bytes)?);
}
Ok(episodes)
}
/// Return a sequence of episodes for the given TV show IMDb identifier and
/// season number.
///
/// The episodes are sorted in order of episode number. Episodes without
/// episode numbers are sorted after episodes with numbers.
pub fn episodes(
&self,
tvshow_id: &[u8],
season: u32,
) -> Result<Vec<Episode>> {
let mut lower = tvshow_id.to_vec();
lower.push(0x00);
lower.extend_from_slice(&season.to_be_bytes());
lower.extend_from_slice(&0u32.to_be_bytes());
let mut upper = tvshow_id.to_vec();
upper.push(0x00);
upper.extend_from_slice(&season.to_be_bytes());
upper.extend_from_slice(&u32::MAX.to_be_bytes());
let mut episodes = vec![];
let mut stream =
self.seasons.range().ge(lower).le(upper).into_stream();
while let Some(episode_bytes) = stream.next() {
episodes.push(read_episode(episode_bytes)?);
}
Ok(episodes)
}
/// Return the episode information for the given episode IMDb identifier.
///
/// If no episode information for the given ID exists, then `None` is
/// returned.
pub fn episode(&self, episode_id: &[u8]) -> Result<Option<Episode>> {
let mut upper = episode_id.to_vec();
upper.push(0xFF);
let mut stream =
self.tvshows.range().ge(episode_id).le(upper).into_stream();
while let Some(tvshow_bytes) = stream.next() {
return Ok(Some(read_tvshow(tvshow_bytes)?));
}
Ok(None)
}
}
fn read_sorted_episodes(data_dir: &Path) -> Result<Vec<Episode>> {
// We claim it is safe to open the following memory map because we don't
// mutate them and no other process (should) either.
let mut rdr = csv_file(data_dir.join(IMDB_EPISODE))?;
let mut records = vec![];
for result in rdr.deserialize() {
let record: Episode = result.map_err(Error::csv)?;
records.push(record);
}
records.sort_by(cmp_episode);
Ok(records)
}
fn cmp_episode(ep1: &Episode, ep2: &Episode) -> cmp::Ordering {
let k1 = (
&ep1.tvshow_id,
ep1.season.unwrap_or(u32::MAX),
ep1.episode.unwrap_or(u32::MAX),
&ep1.id,
);
let k2 = (
&ep2.tvshow_id,
ep2.season.unwrap_or(u32::MAX),
ep2.episode.unwrap_or(u32::MAX),
&ep2.id,
);
k1.cmp(&k2)
}
fn read_episode(bytes: &[u8]) -> Result<Episode> {
let nul = match bytes.iter().position(|&b| b == 0) {
Some(nul) => nul,
None => bug!("could not find nul byte"),
};
let tvshow_id = match String::from_utf8(bytes[..nul].to_vec()) {
Err(err) => bug!("tvshow_id invalid UTF-8: {}", err),
Ok(tvshow_id) => tvshow_id,
};
let mut i = nul + 1;
let season = from_optional_u32("season", &bytes[i..])?;
i += 4;
let episode = from_optional_u32("episode number", &bytes[i..])?;
i += 4;
let id = match String::from_utf8(bytes[i..].to_vec()) {
Err(err) => bug!("episode id invalid UTF-8: {}", err),
Ok(id) => id,
};
Ok(Episode { id, tvshow_id, season, episode })
}
fn write_episode(ep: &Episode, buf: &mut Vec<u8>) -> Result<()> {
if ep.tvshow_id.as_bytes().iter().any(|&b| b == 0) {
bug!("unsupported tvshow id (with NUL byte) for {:?}", ep);
}
buf.extend_from_slice(ep.tvshow_id.as_bytes());
buf.push(0x00);
buf.extend_from_slice(&to_optional_season(ep)?.to_be_bytes());
buf.extend_from_slice(&to_optional_epnum(ep)?.to_be_bytes());
buf.extend_from_slice(ep.id.as_bytes());
Ok(())
}
fn read_tvshow(bytes: &[u8]) -> Result<Episode> {
let nul = match bytes.iter().position(|&b| b == 0) {
Some(nul) => nul,
None => bug!("could not find nul byte"),
};
let id = match String::from_utf8(bytes[..nul].to_vec()) {
Err(err) => bug!("episode id invalid UTF-8: {}", err),
Ok(tvshow_id) => tvshow_id,
};
let mut i = nul + 1;
let season = from_optional_u32("season", &bytes[i..])?;
i += 4;
let episode = from_optional_u32("episode number", &bytes[i..])?;
i += 4;
let tvshow_id = match String::from_utf8(bytes[i..].to_vec()) {
Err(err) => bug!("tvshow_id invalid UTF-8: {}", err),
Ok(tvshow_id) => tvshow_id,
};
Ok(Episode { id, tvshow_id, season, episode })
}
fn write_tvshow(ep: &Episode, buf: &mut Vec<u8>) -> Result<()> {
if ep.id.as_bytes().iter().any(|&b| b == 0) {
bug!("unsupported episode id (with NUL byte) for {:?}", ep);
}
buf.extend_from_slice(ep.id.as_bytes());
buf.push(0x00);
buf.extend_from_slice(&to_optional_season(ep)?.to_be_bytes());
buf.extend_from_slice(&to_optional_epnum(ep)?.to_be_bytes());
buf.extend_from_slice(ep.tvshow_id.as_bytes());
Ok(())
}
fn from_optional_u32(
label: &'static str,
bytes: &[u8],
) -> Result<Option<u32>> {
if bytes.len() < 4 {
bug!("not enough bytes to read optional {}", label);
}
Ok(match u32::from_be_bytes(bytes[..4].try_into().unwrap()) {
u32::MAX => None,
x => Some(x),
})
}
fn to_optional_season(ep: &Episode) -> Result<u32> {
match ep.season {
None => Ok(u32::MAX),
Some(x) => {
if x == u32::MAX {
bug!("unsupported season number {} for {:?}", x, ep);
}
Ok(x)
}
}
}
fn to_optional_epnum(ep: &Episode) -> Result<u32> {
match ep.episode {
None => Ok(u32::MAX),
Some(x) => {
if x == u32::MAX {
bug!("unsupported episode number {} for {:?}", x, ep);
}
Ok(x)
}
}
}
#[cfg(test)]
mod tests {
use super::Index;
use crate::index::tests::TestContext;
use std::collections::HashMap;
#[test]
fn basics() {
let ctx = TestContext::new("small");
let idx = Index::create(ctx.data_dir(), ctx.index_dir()).unwrap();
let eps = idx.seasons(b"tt0096697").unwrap();
let mut counts: HashMap<u32, u32> = HashMap::new();
for ep in eps {
*counts.entry(ep.season.unwrap()).or_insert(0) += 1;
}
assert_eq!(counts.len(), 3);
assert_eq!(counts[&1], 13);
assert_eq!(counts[&2], 22);
assert_eq!(counts[&3], 24);
}
#[test]
fn by_season() {
let ctx = TestContext::new("small");
let idx = Index::create(ctx.data_dir(), ctx.index_dir()).unwrap();
let eps = idx.episodes(b"tt0096697", 2).unwrap();
let mut counts: HashMap<u32, u32> = HashMap::new();
for ep in eps {
*counts.entry(ep.season.unwrap()).or_insert(0) += 1;
}
println!("{:?}", counts);
assert_eq!(counts.len(), 1);
assert_eq!(counts[&2], 22);
}
#[test]
fn tvshow() {
let ctx = TestContext::new("small");
let idx = Index::create(ctx.data_dir(), ctx.index_dir()).unwrap();
let ep = idx.episode(b"tt0701063").unwrap().unwrap();
assert_eq!(ep.tvshow_id, "tt0096697");
}
}
================================================
FILE: imdb-index/src/index/id.rs
================================================
use std::fs::File;
use std::io;
use std::path::Path;
use memmap::Mmap;
use crate::error::{Error, Result};
use crate::util::{fst_map_builder_file, fst_map_file};
/// An index that maps arbitrary length identifiers to 64-bit integers.
///
/// An ID index is often useful for mapping human readable identifiers or
/// "natural keys" to other more convenient forms, such as file offsets.
#[derive(Debug)]
pub struct IndexReader {
idx: fst::Map<Mmap>,
}
impl IndexReader {
/// Open's an ID index reader from the given file path.
pub fn from_path<P: AsRef<Path>>(path: P) -> Result<IndexReader> {
// We claim it is safe to open the following memory map because we
// don't mutate them and no other process (should) either.
Ok(IndexReader { idx: unsafe { fst_map_file(path)? } })
}
/// Return the integer associated with the given ID, if it exists.
pub fn get(&self, key: &[u8]) -> Option<u64> {
self.idx.get(key)
}
}
/// An ID index writer that requires that identifiers are given in
/// lexicographically ascending order.
pub struct IndexSortedWriter<W> {
wtr: fst::MapBuilder<W>,
}
impl IndexSortedWriter<io::BufWriter<File>> {
/// Create an index writer that writes the index to the given file path.
pub fn from_path<P: AsRef<Path>>(
path: P,
) -> Result<IndexSortedWriter<io::BufWriter<File>>> {
Ok(IndexSortedWriter { wtr: fst_map_builder_file(path)? })
}
}
impl<W: io::Write> IndexSortedWriter<W> {
/// Associate the given identifier with the given integer.
///
/// If the given key is not strictly lexicographically greater than the
/// previous key, then an error is returned.
pub fn insert(&mut self, key: &[u8], value: u64) -> Result<()> {
self.wtr.insert(key, value).map_err(Error::fst)?;
Ok(())
}
/// Finish writing the index.
///
/// This must be called, otherwise the index will likely be unreadable.
pub fn finish(self) -> Result<()> {
self.wtr.finish().map_err(Error::fst)?;
Ok(())
}
}
================================================
FILE: imdb-index/src/index/mod.rs
================================================
use std::fs;
use std::io;
use std::path::{Path, PathBuf};
use std::thread;
use std::time::Instant;
use memmap::Mmap;
use serde::{Deserialize, Serialize};
use crate::error::{Error, Result};
use crate::record::{Episode, Rating, Title, TitleKind};
use crate::scored::SearchResults;
use crate::util::{
create_file, csv_file, csv_mmap, open_file, NiceDuration, IMDB_BASICS,
};
pub use self::aka::AKARecordIter;
pub use self::names::{NameQuery, NameScorer, NgramType};
mod aka;
mod episode;
mod id;
mod names;
mod rating;
#[cfg(test)]
mod tests;
mod writer;
/// The version of the index format on disk.
///
/// Generally speaking, if the version of the index on disk doesn't exactly
/// match the version expected by this code, then the index won't be read.
/// The caller must then re-generate the index.
///
/// This version represents all indexing structures on disk in this module.
const VERSION: u64 = 1;
/// The name of the title file index.
///
/// This index represents a map from the IMDb title ID to the file offset
/// corresponding to that record in title.basics.tsv.
const TITLE: &str = "title.fst";
/// The name of the file containing the index configuration.
///
/// The index configuration is a JSON file with some meta data about this
/// index, such as its version.
const CONFIG: &str = "config.json";
/// A media entity is a title with optional episode and rating records.
///
/// A media entity makes it convenient to deal with the complete information
/// of an IMDb media record. This is the default value returned by search
/// routines such as what the [`Searcher`](struct.Searcher.html) provides, and
/// can also be cheaply constructed by an [`Index`](struct.Index.html) given a
/// [`Title`](struct.Title.html) or an IMDb ID.
#[derive(Clone, Debug)]
pub struct MediaEntity {
title: Title,
episode: Option<Episode>,
rating: Option<Rating>,
}
impl MediaEntity {
/// Return a reference to the underlying `Title`.
pub fn title(&self) -> &Title {
&self.title
}
/// Return a reference to the underlying `Episode`, if it exists.
pub fn episode(&self) -> Option<&Episode> {
self.episode.as_ref()
}
/// Return a reference to the underlying `Rating`, if it exists.
pub fn rating(&self) -> Option<&Rating> {
self.rating.as_ref()
}
}
/// An index into IMDb titles and their associated data.
///
/// This index consists of a set of on disk index data structures in addition
/// to the uncompressed IMDb `tsv` files. The on disk index structures are used
/// to provide access to the records in the `tsv` files efficiently.
///
/// With this index, one can do the following things:
///
/// * Return a ranked list
/// [`Title`](struct.Title.html)
/// records matching a fuzzy name query.
/// * Access any `Title` record by ID in constant time.
/// * Access all
/// [`AKA`](struct.AKA.html)
/// records for any `Title` in constant time.
/// * Access the
/// [`Rating`](struct.Rating.html)
/// for any `Title` in constant time.
/// * Access the complete set of
/// [`Episode`](struct.Episode.html)
/// records for any TV show in constant time.
/// * Access the specific `Episode` given its ID in constant time.
#[derive(Debug)]
pub struct Index {
/// The directory containing the IMDb tsv files.
data_dir: PathBuf,
/// The directory containing this crate's index structures.
index_dir: PathBuf,
/// A seekable reader for `title.basics.tsv`. The index structures
/// typically return offsets that can be used to seek this reader to the
/// beginning of any `Title` record.
csv_basic: csv::Reader<io::Cursor<Mmap>>,
/// The name index. This is what provides fuzzy queries.
idx_names: names::IndexReader,
/// The AKA index.
idx_aka: aka::Index,
/// The episode index.
idx_episode: episode::Index,
/// The rating index.
idx_rating: rating::Index,
/// The title index.
idx_title: id::IndexReader,
}
#[derive(Debug, Deserialize, Serialize)]
struct Config {
version: u64,
}
impl Index {
/// Open an existing index using default settings. If the index does not
/// exist, or if there was a problem opening it, then this returns an
/// error.
///
/// Generally, this method is cheap to call. It opens some file
/// descriptors, but otherwise does no work.
///
/// `data_dir` should be the directory containing decompressed IMDb
/// `tsv` files. See: https://www.imdb.com/interfaces/
///
/// `index_dir` should be the directory containing a previously created
/// index using `Index::create`.
pub fn open<P1: AsRef<Path>, P2: AsRef<Path>>(
data_dir: P1,
index_dir: P2,
) -> Result<Index> {
IndexBuilder::new().open(data_dir, index_dir)
}
/// Create a new index using default settings.
///
/// Calling this method is expensive, and one should expect this to take
/// dozens of seconds or more to complete.
///
/// `data_dir` should be the directory containing decompressed IMDb tsv`
/// `files. See: https://www.imdb.com/interfaces/
///
/// `index_dir` should be the directory containing a previously created
/// index using `Index::create`.
///
/// This will overwrite any previous index that may have existed in
/// `index_dir`.
pub fn create<P1: AsRef<Path>, P2: AsRef<Path>>(
data_dir: P1,
index_dir: P2,
) -> Result<Index> {
IndexBuilder::new().create(data_dir, index_dir)
}
/// Attempt to clone this index, returning a distinct `Index`.
///
/// This is as cheap to call as `Index::open` and returns an error if there
/// was a problem reading the underlying index.
///
/// This is useful when one wants to query the same `Index` on disk from
/// multiple threads.
pub fn try_clone(&self) -> Result<Index> {
Index::open(&self.data_dir, &self.index_dir)
}
/// Search this index for `Title` records whose name matches the given
/// query.
///
/// The query controls the following things:
///
/// * The name to search for.
/// * The maximum number of results returned.
/// * The scorer to use to rank results.
///
/// The name can be any string. It is normalized and broken down into
/// component pieces, which are then used to quickly search all existing
/// titles quickly and fuzzily.
///
/// This returns an error if there was a problem reading the index or the
/// underlying CSV data.
pub fn search(
&mut self,
query: &names::NameQuery,
) -> Result<SearchResults<Title>> {
let mut results = SearchResults::new();
// The name index gives us back scores with offsets. The offset can be
// used to seek our `Title` CSV reader to the corresponding record and
// read it in constant time.
for result in self.idx_names.search(query) {
let title = match self.read_record(*result.value())? {
None => continue,
Some(title) => title,
};
results.push(result.map(|_| title));
}
Ok(results)
}
/// Returns the `MediaEntity` for the given IMDb ID.
///
/// An entity includes an [`Episode`](struct.Episode.html) and
/// [`Rating`](struct.Rating.html) records if they exist for the title.
///
/// This returns an error if there was a problem reading the underlying
/// index. If no such title exists for the given ID, then `None` is
/// returned.
pub fn entity(&mut self, id: &str) -> Result<Option<MediaEntity>> {
match self.title(id)? {
None => Ok(None),
Some(title) => self.entity_from_title(title).map(Some),
}
}
/// Returns the `MediaEntity` for the given `Title`.
///
/// This is like the `entity` method, except it takes a `Title` record as
/// given.
pub fn entity_from_title(&mut self, title: Title) -> Result<MediaEntity> {
let episode = match title.kind {
TitleKind::TVEpisode => self.episode(&title.id)?,
_ => None,
};
let rating = self.rating(&title.id)?;
Ok(MediaEntity { title, episode, rating })
}
/// Returns the `Title` record for the given IMDb ID.
///
/// This returns an error if there was a problem reading the underlying
/// index. If no such title exists for the given ID, then `None` is
/// returned.
pub fn title(&mut self, id: &str) -> Result<Option<Title>> {
match self.idx_title.get(id.as_bytes()) {
None => Ok(None),
Some(offset) => self.read_record(offset),
}
}
/// Returns an iterator over all `AKA` records for the given IMDb ID.
///
/// If no AKA records exist for the given ID, then an empty iterator is
/// returned.
///
/// If there was a problem reading the index, then an error is returned.
pub fn aka_records(&mut self, id: &str) -> Result<AKARecordIter> {
self.idx_aka.find(id.as_bytes())
}
/// Returns the `Rating` associated with the given IMDb ID.
///
/// If no rating exists for the given ID, then this returns `None`.
///
/// If there was a problem reading the index, then an error is returned.
pub fn rating(&mut self, id: &str) -> Result<Option<Rating>> {
self.idx_rating.rating(id.as_bytes())
}
/// Returns all of the episodes for the given TV show. The TV show should
/// be identified by its IMDb ID.
///
/// If the given ID isn't a TV show or if the TV show doesn't have any
/// episodes, then an empty list is returned.
///
/// The episodes returned are sorted in order of their season and episode
/// numbers. Episodes without a season or episode number are sorted after
/// episodes with a season or episode number.
///
/// If there was a problem reading the index, then an error is returned.
pub fn seasons(&mut self, tvshow_id: &str) -> Result<Vec<Episode>> {
self.idx_episode.seasons(tvshow_id.as_bytes())
}
/// Returns all of the episodes for the given TV show and season number.
/// The TV show should be identified by its IMDb ID, and the season should
/// be identified by its number. (Season numbers generally start at `1`.)
///
/// If the given ID isn't a TV show or if the TV show doesn't have any
/// episodes for the given season, then an empty list is returned.
///
/// The episodes returned are sorted in order of their episode numbers.
/// Episodes without an episode number are sorted after episodes with an
/// episode number.
///
/// If there was a problem reading the index, then an error is returned.
pub fn episodes(
&mut self,
tvshow_id: &str,
season: u32,
) -> Result<Vec<Episode>> {
self.idx_episode.episodes(tvshow_id.as_bytes(), season)
}
/// Return the episode corresponding to the given IMDb ID.
///
/// If the ID doesn't correspond to an episode, then `None` is returned.
///
/// If there was a problem reading the index, then an error is returned.
pub fn episode(&mut self, episode_id: &str) -> Result<Option<Episode>> {
self.idx_episode.episode(episode_id.as_bytes())
}
/// Returns the data directory that this index returns results for.
pub fn data_dir(&self) -> &Path {
&self.data_dir
}
/// Returns the directory containing this index's files.
pub fn index_dir(&self) -> &Path {
&self.index_dir
}
/// Read the CSV `Title` record beginning at the given file offset.
///
/// If no such record exists, then this returns `None`.
///
/// If there was a problem reading the underlying CSV data, then an error
/// is returned.
///
/// If the given offset does not point to the start of a record in the CSV
/// data, then the behavior of this method is unspecified.
fn read_record(&mut self, offset: u64) -> Result<Option<Title>> {
let mut pos = csv::Position::new();
pos.set_byte(offset);
self.csv_basic.seek(pos).map_err(Error::csv)?;
let mut record = csv::StringRecord::new();
if !self.csv_basic.read_record(&mut record).map_err(Error::csv)? {
Ok(None)
} else {
let headers = self.csv_basic.headers().map_err(Error::csv)?;
Ok(record.deserialize(Some(headers)).map_err(Error::csv)?)
}
}
}
/// A builder for opening or creating an `Index`.
#[derive(Debug)]
pub struct IndexBuilder {
ngram_type: NgramType,
ngram_size: usize,
}
impl IndexBuilder {
/// Create a new builder with a default configuration.
pub fn new() -> IndexBuilder {
IndexBuilder { ngram_type: NgramType::default(), ngram_size: 3 }
}
/// Use the current configuration to open an existing index. If the index
/// does not exist, or if there was a problem opening it, then this returns
/// an error.
///
/// Generally, this method is cheap to call. It opens some file
/// descriptors, but otherwise does no work.
///
/// `data_dir` should be the directory containing decompressed IMDb tsv`
/// `files. See: https://www.imdb.com/interfaces/
///
/// `index_dir` should be the directory containing a previously created
/// index using `Index::create`.
///
/// Note that settings for index creation are ignored.
pub fn open<P1: AsRef<Path>, P2: AsRef<Path>>(
&self,
data_dir: P1,
index_dir: P2,
) -> Result<Index> {
let data_dir = data_dir.as_ref();
let index_dir = index_dir.as_ref();
log::debug!("opening index {}", index_dir.display());
let config_file = open_file(index_dir.join(CONFIG))?;
let config: Config = serde_json::from_reader(config_file)
.map_err(|e| Error::config(e.to_string()))?;
if config.version != VERSION {
return Err(Error::version(VERSION, config.version));
}
Ok(Index {
data_dir: data_dir.to_path_buf(),
index_dir: index_dir.to_path_buf(),
// We claim it is safe to open the following memory map because we
// don't mutate them and no other process (should) either.
csv_basic: unsafe { csv_mmap(data_dir.join(IMDB_BASICS))? },
idx_names: names::IndexReader::open(index_dir)?,
idx_aka: aka::Index::open(data_dir, index_dir)?,
idx_episode: episode::Index::open(index_dir)?,
idx_rating: rating::Index::open(index_dir)?,
idx_title: id::IndexReader::from_path(index_dir.join(TITLE))?,
})
}
/// Use the current configuration to create a new index.
///
/// Calling this method is expensive, and one should expect this to take
/// dozens of seconds or more to complete.
///
/// `data_dir` should be the directory containing decompressed IMDb tsv`
/// `files. See: https://www.imdb.com/interfaces/
///
/// `index_dir` should be the directory containing a previously created
/// index using `Index::create`.
///
/// This will overwrite any previous index that may have existed in
/// `index_dir`.
pub fn create<P1: AsRef<Path>, P2: AsRef<Path>>(
&self,
data_dir: P1,
index_dir: P2,
) -> Result<Index> {
let data_dir = data_dir.as_ref();
let index_dir = index_dir.as_ref();
fs::create_dir_all(index_dir)
.map_err(|e| Error::io_path(e, index_dir))?;
log::info!("creating index at {}", index_dir.display());
// Creating the rating and episode indices are completely independent
// from the name/AKA indexes, so do them in a background thread. The
// episode index takes long enough to build to justify this.
let job = {
let data_dir = data_dir.to_path_buf();
let index_dir = index_dir.to_path_buf();
thread::spawn(move || -> Result<()> {
let start = Instant::now();
rating::Index::create(&data_dir, &index_dir)?;
log::info!(
"created rating index (took {})",
NiceDuration::since(start)
);
let start = Instant::now();
episode::Index::create(&data_dir, &index_dir)?;
log::info!(
"created episode index (took {})",
NiceDuration::since(start)
);
Ok(())
})
};
let start = Instant::now();
let mut aka_index = aka::Index::create(data_dir, index_dir)?;
log::info!("created AKA index (took {})", NiceDuration::since(start));
let start = Instant::now();
create_name_index(
&mut aka_index,
data_dir,
index_dir,
self.ngram_type,
self.ngram_size,
)?;
log::info!(
"created name index, ngram type: {}, ngram size: {} (took {})",
self.ngram_type,
self.ngram_size,
NiceDuration::since(start)
);
job.join().unwrap()?;
// Write out our config.
let config_file = create_file(index_dir.join(CONFIG))?;
serde_json::to_writer_pretty(
config_file,
&Config { version: VERSION },
)
.map_err(|e| Error::config(e.to_string()))?;
self.open(data_dir, index_dir)
}
/// Set the type of ngram generation to use.
///
/// The default type is `Window`.
pub fn ngram_type(&mut self, ngram_type: NgramType) -> &mut IndexBuilder {
self.ngram_type = ngram_type;
self
}
/// Set the ngram size on this index.
///
/// When creating an index, ngrams with this size will be used.
pub fn ngram_size(&mut self, ngram_size: usize) -> &mut IndexBuilder {
self.ngram_size = ngram_size;
self
}
}
impl Default for IndexBuilder {
fn default() -> IndexBuilder {
IndexBuilder::new()
}
}
/// Creates the name index from the title tsv data and an AKA index. The AKA
/// index is used to index additional names for each title record to improve
/// recall during search.
///
/// To avoid a second pass through the title records, this also creates the
/// title ID index, which provides an index for looking up a `Title` by its
/// ID in constant time.
fn create_name_index(
aka_index: &mut aka::Index,
data_dir: &Path,
index_dir: &Path,
ngram_type: NgramType,
ngram_size: usize,
) -> Result<()> {
// For logging.
let (mut count, mut title_count) = (0u64, 0u64);
let mut wtr = names::IndexWriter::open(index_dir, ngram_type, ngram_size)?;
let mut twtr = id::IndexSortedWriter::from_path(index_dir.join(TITLE))?;
let mut rdr = csv_file(data_dir.join(IMDB_BASICS))?;
let mut record = csv::StringRecord::new();
while rdr.read_record(&mut record).map_err(Error::csv)? {
let pos = record.position().expect("position on row");
let id = &record[0];
let title = &record[2];
let original_title = &record[3];
let is_adult = &record[4] == "1";
if is_adult {
// TODO: Expose an option to permit this.
continue;
}
count += 1;
title_count += 1;
twtr.insert(id.as_bytes(), pos.byte())?;
// Index the primary name.
wtr.insert(pos.byte(), title)?;
if title != original_title {
// Index the "original" name.
wtr.insert(pos.byte(), original_title)?;
count += 1;
}
// Now index all of the alternate names, if they exist.
for result in aka_index.find(id.as_bytes())? {
let akarecord = result?;
if title != akarecord.title {
wtr.insert(pos.byte(), &akarecord.title)?;
count += 1;
}
}
}
wtr.finish()?;
twtr.finish()?;
log::info!("{} titles indexed", title_count);
log::info!("{} total names indexed", count);
Ok(())
}
================================================
FILE: imdb-index/src/index/names.rs
================================================
use std::cmp;
use std::collections::{binary_heap, BinaryHeap};
use std::fmt;
use std::fs::File;
use std::io::{self, Write};
use std::path::Path;
use std::str::{self, FromStr};
use std::time::Instant;
use fnv::FnvHashMap;
use memmap::Mmap;
use serde::{Deserialize, Serialize};
use crate::error::{Error, Result};
use crate::index::writer::CursorWriter;
use crate::scored::{Scored, SearchResults};
use crate::util::{
fst_map_builder_file, fst_map_file, mmap_file, open_file, NiceDuration,
};
/// The name of the file containing the index configuration.
///
/// The index configuration is a JSON file with some meta data about this
/// index, such as its version, ngram size and aggregate statistics about the
/// corpus that has been indexed.
const CONFIG: &str = "names.config.json";
/// The name of the ngram term index.
///
/// The ngram term index maps ngrams (fixed size sequences of Unicode
/// codepoints) to file offsets. Each file offset points to the postings for
/// the corresponding term.
const NGRAM: &str = "names.ngram.fst";
/// The name of the postings list index.
///
/// The postings list contains an entry for every term in the ngram index.
/// Each entry corresponds to a list of document/frequency pairs. Namely, each
/// entry is a DocID and a frequency count indicating how many times the
/// corresponding term appeared in that document. Each entry in the list is
/// encoded as a single 32 little-endian integer. The high 4 bits represent
/// the frequency (which is capped at 15, a reasonable number for indexing
/// short name strings) while the low 28 bits represent the doc id. The
/// `MAX_DOC_ID` constant below ensures we make sure to never use a doc id
/// that won't fit this encoding scheme.
///
/// The last eight bytes in the postings index contains a 64-bit little-endian
/// encoded integer indicating the average length of all documents represented
/// by the ngram index. The length is recorded in units of terms, which
/// generally correspond to the total number of ngrams in a name.
const POSTINGS: &str = "names.postings.idx";
/// The name of the identifier map index.
///
/// This file maps `DocID`s to `NameID`s. It consists of a sequence of
/// 64-bit little-endian encoded integers, where the length of the sequence
/// corresponds to the total number of names in the index. Each entry in the
/// sequence encodes a `NameID`. In other words, the index to this sequence is
/// a `DocID` and the value at that index is a `NameID`.
///
/// The id map is used to map doc ids returned by the postings to name ids
/// which were provided by the caller. This also permits search to deduplicate
/// results. That is, we should never return multiple results for the same
/// NameID, even though we may have indexed multiple names for the same name
/// id.
const IDMAP: &str = "names.idmap.idx";
/// The name of the document length index.
///
/// This file consists of a sequence of 16-bit little-endian encoded
/// integers, where the length of the sequence corresponds to the total number
/// of names in the index. Each entry represents the length, in terms, of each
/// name.
///
/// The lengths are used during scoring to compute a normalization term. This
/// allows the scoring mechanism to take document length into account.
const NORMS: &str = "names.norms.idx";
/// The external identifier for every distinct record represented by this name
/// index. There are no restrictions on name ids, and multiple names may be
/// indexed that correspond to the same name id.
///
/// With respect to IMDb, there is a 1-to-1 correspondence between the records
/// in title.basics.tsv and the set of NameIDs, even though there may be
/// multiple names for each record.
///
/// For IMDb, this is represented by the byte offset of the corresponding
/// record in title.basics.tsv. This provides constant time lookup to full
/// record. Note, though, that this module knows nothing about such things.
/// To this module, name ids are opaque identifiers.
pub type NameID = u64;
/// An internal surrogate identifier for every distinct name in the index. Note
/// that multiple distinct doc ids can map to the same name id. For example, if
/// a name has multiple distinct forms, then they each get their own docid, but
/// each of the docids will map to the same name id.
///
/// The reason why we need DocID in addition to NameID is two fold:
///
/// 1. Firstly, we'd like each name variant to have its own term frequency
/// count. If every variant shared the same internal id, then names with
/// multiple variants would behave as if they were one long name with each
/// variant concatenated together. Our ranking scheme takes document length
/// into account, so we don't want this.
/// 2. Secondly, using an internal ID gives us control over the structure of
/// those ids. For example, we can declare them to be a sorted sequence of
/// increasing integers. This lets us traverse our postings more efficiently
/// during search.
type DocID = u32;
/// The maximum docid allowed.
///
/// When writing postings, we pack docids and their term frequency counts into
/// a single u32. We give 4 bits for frequency and 28 bits for docid. That
/// means we can permit up to 268,435,455 = (1<<28)-1 names, which is plenty
/// for all unique names in IMDb.
const MAX_DOC_ID: DocID = (1 << 28) - 1;
/// A query for searching the name index.
///
/// A query provides the name query and defines the maximum number of results
/// returned by searching the name index.
#[derive(Clone, Debug)]
pub struct NameQuery {
name: String,
size: usize,
scorer: NameScorer,
stop_word_ratio: f64,
}
impl NameQuery {
/// Create a query that searches the given name.
pub fn new(name: &str) -> NameQuery {
NameQuery {
name: name.to_string(),
size: 30,
scorer: NameScorer::default(),
stop_word_ratio: 0.01,
}
}
/// Set this query's result set size. At most `size` results will be
/// returned when searching with this query.
pub fn with_size(self, size: usize) -> NameQuery {
NameQuery { size, ..self }
}
/// Set this query's scorer. By default, Okapi BM25 is used.
pub fn with_scorer(self, scorer: NameScorer) -> NameQuery {
NameQuery { scorer, ..self }
}
/// Set the ratio (in the range `0.0` to `1.0`, inclusive) at which a term
/// is determined to be a stop word. Set to `0.0` to disable. By default
/// this is set to a non-zero value.
///
/// This ratio is used at query time to partition all of the ngrams in the
/// query into two bins: one bin is for "low frequency" ngrams while the
/// other is for "high frequency" ngrams. The partitioning is determined
/// by this ratio. Namely, if an ngram occurs in fewer than `ratio`
/// documents in the entire corpus, then it is considered a low frequency
/// ngram.
///
/// Once these two partitions are created, both are used to create two
/// disjunction queries. The low frequency query drives search results,
/// while the high frequency query is only used to boost scores when it
/// matches a result yielded by the low frequency query. Otherwise, results
/// from the high frequency query aren't considered.
pub fn with_stop_word_ratio(self, ratio: f64) -> NameQuery {
NameQuery { stop_word_ratio: ratio, ..self }
}
}
/// A reader for the name index.
#[derive(Debug)]
pub struct IndexReader {
/// The configuration of this index. This is how we determine index-time
/// settings automatically, such as ngram size and type.
config: Config,
/// The ngram index, also known more generally as the "term index." It maps
/// terms (which are ngrams for this index) to offsets into the postings
/// file. The offset indicates the start of a list of document ids
/// containing that term.
ngram: fst::Map<Mmap>,
/// The postings. This corresponds to a sequence of lists, where each list
/// is a list of document ID/frequency pairs. Each list corresponds to the
/// document ids containing a particular term. The beginning of each list
/// is pointed to by an offset in the term index.
postings: Mmap,
/// A sequence of 64-bit little-endian encoded integers that provide a
/// map from document ID to name ID. The document ID is an internal
/// identifier assigned to each unique name indexed, while the name ID is
/// an external identifier provided by users of this index.
///
/// This map is used to return name IDs to callers. Namely, results are
/// natively represented by document IDs, but they are mapped to name IDs
/// during collection of results and subsequently deduped. In particular,
/// multiple document IDs can map to the same name ID.
///
/// The number of entries in this map is equivalent to the total number of
/// names indexed.
idmap: Mmap,
/// A sequence of 16-bit little-endian encoded integers indicating the
/// document length (in terms) of the correspond document ID.
///
/// The number of entries in this map is equivalent to the total number of
/// names indexed.
norms: Mmap,
}
/// The configuration for this name index. It is JSON encoded to disk.
///
/// Note that we don't track the version here. Instead, it is tracked wholesale
/// as part of the parent index.
#[derive(Debug, Deserialize, Serialize)]
struct Config {
ngram_type: NgramType,
ngram_size: usize,
avg_document_len: f64,
num_documents: u64,
}
impl IndexReader {
/// Open a name index in the given directory.
pub fn open<P: AsRef<Path>>(dir: P) -> Result<IndexReader> {
let dir = dir.as_ref();
// All of the following open memory maps. We claim it is safe because
// we don't mutate them and no other process (should) either.
let ngram = unsafe { fst_map_file(dir.join(NGRAM))? };
let postings = unsafe { mmap_file(dir.join(POSTINGS))? };
let idmap = unsafe { mmap_file(dir.join(IDMAP))? };
let norms = unsafe { mmap_file(dir.join(NORMS))? };
let config_file = open_file(dir.join(CONFIG))?;
let config: Config = serde_json::from_reader(config_file)
.map_err(|e| Error::config(e.to_string()))?;
Ok(IndexReader { config, ngram, postings, idmap, norms })
}
/// Execute a search.
pub fn search(&self, query: &NameQuery) -> SearchResults<NameID> {
let start = Instant::now();
let mut searcher = Searcher::new(self, query);
let results = CollectTopK::new(query.size).collect(&mut searcher);
log::debug!(
"search for {:?} took {}",
query,
NiceDuration::since(start)
);
results
}
/// Return the name ID used to the index the given document id.
///
/// This panics if the given document id does not correspond to an indexed
/// document.
fn docid_to_nameid(&self, docid: DocID) -> NameID {
let start = 8 * (docid as usize);
let buf = self.idmap[start..start + 8].try_into().unwrap();
u64::from_le_bytes(buf)
}
/// Return the length, in terms, of the given document.
///
/// This panics if the given document id does not correspond to an indexed
/// document.
fn document_length(&self, docid: DocID) -> u64 {
let start = 2 * (docid as usize);
let buf = self.norms[start..start + 2].try_into().unwrap();
u16::from_le_bytes(buf) as u64
}
}
/// A collector for gathering the top K results from a search.
///
/// This maintains a min-heap of search results. When a new result is
/// considered, it is compared against the worst result in the heap. If the
/// candidate is worse, then it is discarded. Otherwise, it is shuffled into
/// the heap.
struct CollectTopK {
/// The total number of hits to collect.
k: usize,
/// The min-heap, according to score. Note that since BinaryHeap is a
/// max-heap by default, we reverse the comparison to get a min-heap.
queue: BinaryHeap<cmp::Reverse<Scored<NameID>>>,
/// A set for deduplicating results. Namely, multiple doc IDs can map to
/// the same name ID. This set makes sure we only collect one name ID.
///
/// We map name IDs to scores. In this way, we always report the best
/// scoring match.
byid: FnvHashMap<NameID, f64>,
}
impl CollectTopK {
/// Build a new collector that collects at most `k` results.
fn new(k: usize) -> CollectTopK {
CollectTopK {
k,
queue: BinaryHeap::with_capacity(k),
byid: FnvHashMap::default(),
}
}
/// Collect the top K results from the given searcher using the given
/// index reader. Return the results with normalized scores sorted in
/// order of best-to-worst.
fn collect(mut self, searcher: &mut Searcher) -> SearchResults<NameID> {
if self.k == 0 {
return SearchResults::new();
}
let index = searcher.index();
let (mut count, mut push_count) = (0, 0);
for scored_with_docid in searcher {
count += 1;
let scored = scored_with_docid.map(|v| index.docid_to_nameid(v));
// Since multiple names can correspond to a single IMDb title,
// we dedup our results here. That is, if our result set
// already contains this result, then update the score if need
// be, and then move on.
if let Some(&score) = self.byid.get(scored.value()) {
if scored.score() > score {
self.byid.insert(*scored.value(), scored.score());
}
continue;
}
let mut dopush = self.queue.len() < self.k;
if !dopush {
// This unwrap is OK because k > 0 and queue is non-empty.
let worst = self.queue.peek_mut().unwrap();
// If our queue is full, then we should only push if this
// doc id has a better score than the worst one in the queue.
if worst.0 < scored {
self.byid.remove(worst.0.value());
binary_heap::PeekMut::pop(worst);
dopush = true;
}
}
if dopush {
push_count += 1;
self.byid.insert(*scored.value(), scored.score());
self.queue.push(cmp::Reverse(scored));
}
}
log::debug!(
"collect count: {:?}, collect push count: {:?}",
count,
push_count
);
// Pull out the results from our heap and normalize the scores.
let mut results = SearchResults::from_min_heap(&mut self.queue);
results.normalize();
results
}
}
/// A searcher for resolving fulltext queries.
///
/// A searcher takes a fulltext query, usually typed by an end user, along with
/// a scoring function and produces a stream of matching results with scores
/// computed via the provided function. Results are always yielded in
/// ascending order with respect to document IDs, which are internal IDs
/// assigned to each name in the index.
///
/// This searcher combines a bit of smarts to handle stop words, usually
/// referred to as "dynamic stop word detection." Namely, after the searcher
/// splits the query into ngrams, it partitions the ngrams into infrequently
/// occurring ngrams and frequently occurring ngrams, according to some
/// hard-coded threshold. Each group is then turned into a `Disjunction`
/// query. The searcher then visits every doc ID that matches the infrequently
/// occurring disjunction. When a score is computed for a doc ID, then its
/// score is increased if the frequently occurring disjunction also contains
/// that same doc ID. Otherwise, the frequently occurring disjunction isn't
/// consulted at all, which permits skipping the score calculation for a
/// potentially large number of doc IDs.
///
/// When two partitions cannot be created (e.g., all of the terms are
/// infrequently occurring or all of the terms are frequently occurring), then
/// only one disjunction query is used and no skipping logic is employed. That
/// means that a query consisting of all high frequency terms could be quite
/// slow.
///
/// This does of course sacrifice recall for a performance benefit, but so do
/// all filtering strategies based on stop words. The benefit of this "dynamic"
/// approach is that stop word detection is tailored exactly to the corpus, and
/// that stop words can still influence scoring. That means queries like "the
/// matrix" will match "The Matrix" better than "Matrix" (which is a legitimate
/// example, try it).
struct Searcher<'i> {
/// A handle to the index.
index: &'i IndexReader,
/// The primary disjunction query that drives results. Typically, this
/// corresponds to the infrequent terms in the query.
primary: Disjunction<'i>,
/// A disjunction of only high frequency terms. When the query consists
/// of exclusively high frequency terms, then this is empty (which matches
/// nothing) and `primary` is set to the disjunction of terms.
high: Disjunction<'i>,
}
impl<'i> Searcher<'i> {
/// Create a new searcher.
fn new(idx: &'i IndexReader, query: &NameQuery) -> Searcher<'i> {
let num_docs = idx.config.num_documents as f64;
let (mut low, mut high) = (vec![], vec![]);
let (mut low_terms, mut high_terms) = (vec![], vec![]);
let name = normalize_query(&query.name);
let mut query_len = 0;
let mut multiset = FnvHashMap::default();
idx.config.ngram_type.iter(idx.config.ngram_size, &name, |term| {
*multiset.entry(term).or_insert(0) += 1;
query_len += 1;
});
for (term, &count) in multiset.iter() {
let postings = PostingIter::new(idx, query.scorer, count, term);
let ratio = (postings.len() as f64) / num_docs;
if ratio < query.stop_word_ratio {
low.push(postings);
low_terms.push(format!("{}:{}:{:0.6}", term, count, ratio));
} else {
high.push(postings);
high_terms.push(format!("{}:{}:{:0.6}", term, count, ratio));
}
}
log::debug!("starting search for: {:?}", name);
log::debug!("{:?} low frequency terms: {:?}", low.len(), low_terms);
log::debug!("{:?} high frequency terms: {:?}", high.len(), high_terms);
if low.is_empty() {
Searcher {
index: idx,
primary: Disjunction::new(idx, query_len, query.scorer, high),
high: Disjunction::empty(idx, query.scorer),
}
} else {
Searcher {
index: idx,
primary: Disjunction::new(idx, query_len, query.scorer, low),
high: Disjunction::new(idx, query_len, query.scorer, high),
}
}
}
/// Return a reference to the underlying index reader.
fn index(&self) -> &'i IndexReader {
self.index
}
}
impl<'i> Iterator for Searcher<'i> {
type Item = Scored<DocID>;
fn next(&mut self) -> Option<Scored<DocID>> {
// This is pretty simple. We drive the iterator via the primary
// disjunction, which is usually a disjunction of infrequently
// occurring ngrams.
let mut scored = match self.primary.next() {
None => return None,
Some(scored) => scored,
};
// We then skip our frequently occurring disjunction to the doc ID
// yielded above. Any frequently occurring ngrams found then improve
// this score. This makes queries like 'the matrix' match 'The Matrix'
// better than 'Matrix'.
if let Some(other_scored) = self.high.skip_to(*scored.value()) {
scored = scored.map_score(|s| s + other_scored.score());
}
Some(scored)
}
}
/// A disjunction over a collection of ngrams. A disjunction yields scored
/// document IDs for every document that contains any of the terms in this
/// disjunction. The more ngrams that match the document in the disjunction,
/// the better the score.
struct Disjunction<'i> {
/// A handle to the underlying index that we're searching.
index: &'i IndexReader,
/// The number of ngrams in the original query.
///
/// This is not necessarily equivalent to the number of ngrams in this
/// specific disjunction. Namely, this is used to compute scores, and it
/// is important that scores are computed using the total number of ngrams
/// and not the number of ngrams in a specific disjunction. For example,
/// if a query consisted of 8 infrequent ngrams and 1 frequent ngram, then
/// the disjunction containing the single frequent ngram would contribute a
/// disproportionately high score.
query_len: f64,
/// The scoring function to use.
scorer: NameScorer,
/// A min-heap of posting iterators. Each posting iterator corresponds to
/// an iterator over (doc ID, frequency) pairs for a single ngram, sorted
/// by doc ID in ascending order.
///
/// A min-heap is a classic way of optimally computing a disjunction over
/// an arbitrary number of ordered streams.
queue: BinaryHeap<PostingIter<'i>>,
/// Whether this disjunction has been exhausted or not.
is_done: bool,
}
impl<'i> Disjunction<'i> {
/// Create a new disjunction over the given posting iterators.
fn new(
index: &'i IndexReader,
query_len: usize,
scorer: NameScorer,
posting_iters: Vec<PostingIter<'i>>,
) -> Disjunction<'i> {
let mut queue = BinaryHeap::new();
for postings in posting_iters {
queue.push(postings);
}
let is_done = queue.is_empty();
let query_len = query_len as f64;
Disjunction { index, query_len, scorer, queue, is_done }
}
/// Create an empty disjunction that never matches anything.
fn empty(index: &'i IndexReader, scorer: NameScorer) -> Disjunction<'i> {
Disjunction {
index,
query_len: 0.0,
scorer,
queue: BinaryHeap::new(),
is_done: true,
}
}
/// Skip this disjunction such that all posting iterators are either
/// positioned at the smallest doc ID greater than the given doc ID.
///
/// If any posting iterator contains the given doc ID, then it is scored
/// and returned. The score incorporates all posting iterators that contain
/// the given doc ID.
fn skip_to(&mut self, target_docid: DocID) -> Option<Scored<DocID>> {
if self.is_done {
return None;
}
let mut found = false;
// loop invariant: loop until all posting iterators are either
// positioned directly at the target doc ID (in which case, `found`
// is set to that doc ID) or beyond the target doc ID. If none of the
// iterators contain the target doc ID, then `found` remains `None`.
loop {
// This unwrap is OK because we're only here if we have a
// non-empty queue.
let mut postings = self.queue.peek_mut().unwrap();
if postings.docid().map_or(true, |x| x >= target_docid) {
found = found || Some(target_docid) == postings.docid();
// This is the smallest posting iterator, which means all
// iterators are now either at or beyond target_docid.
break;
}
// Skip through this iterator until we're at or beyond the target
// doc ID.
while postings.docid().map_or(false, |x| x < target_docid) {
postings.next();
}
found = found || Some(target_docid) == postings.docid();
}
if !found {
return None;
}
// We're here if we found our target doc ID, which means at least one
// posting iterator is pointing to the doc ID and it is necessarily
// the minimum doc ID of all the posting iterators in this disjunction.
// Therefore, advance such that all posting iterators are beyond the
// target doc ID.
//
// (If we didn't find the target doc ID, then the loop invariant above
// guarantees that we are already passed the target doc ID.)
self.next()
}
}
impl<'i> Iterator for Disjunction<'i> {
type Item = Scored<DocID>;
fn next(&mut self) -> Option<Scored<DocID>> {
if self.is_done {
return None;
}
// Find our next matching ngram.
let mut scored1 = {
// This unwrap is OK because we're only here if we have a
// non-empty queue.
let mut postings = self.queue.peek_mut().unwrap();
match postings.score() {
None => {
self.is_done = true;
return None;
}
Some(scored) => {
postings.next();
scored
}
}
};
// Discover if any of the other posting iterators also match this
// ngram.
loop {
// This unwrap is OK because we're only here if we have a
// non-empty queue.
let mut postings = self.queue.peek_mut().unwrap();
match postings.score() {
None => break,
Some(scored2) => {
// If the smallest posting iterator isn't equivalent to
// the doc ID found above, then we've found all of the
// matching terms for this doc ID that we'll find.
if scored1.value() != scored2.value() {
break;
}
scored1 = scored1.map_score(|s| s + scored2.score());
postings.next();
}
}
}
// Some of our scorers are more convenient to compute at the
// disjunction level rather than at the term level.
if let NameScorer::Jaccard = self.scorer {
// When using Jaccard, the score returned by the posting
// iterator is always 1. Thus, `scored.score` represents the
// total number of terms that matched this document. In other
// words, it is the cardinality of the intersection of terms
// between the query and our candidate, `|A ∩ B|`.
//
// `query_len` represents the total number of terms in our query
// (not just the number of terms in this disjunction!), and
// `doc_len` represents the total number of terms in our candidate.
// Thus, since `|A u B| = |A| + |B| - |A ∩ B|`, we have that
// `|A u B| = query_len + doc_len - scored.score`. And finally, the
// Jaccard index is `|A ∩ B| / |A u B|`.
let doc_len = self.index.document_length(*scored1.value()) as f64;
let union = self.query_len + doc_len - scored1.score();
scored1 = scored1.map_score(|s| s / union);
} else if let NameScorer::QueryRatio = self.scorer {
// This is like Jaccard, but our score is computely purely as the
// ratio of query terms that matched this document.
scored1 = scored1.map_score(|s| s / self.query_len)
}
Some(scored1)
}
}
/// An iterator over a postings list for a specific ngram.
///
/// A postings list is a sequence of pairs, where each pair has a document
/// ID and a frequency. The document ID indicates that the ngram is in the
/// text indexed for that ID, and the frequency counts the number of times
/// that ngram occurs in the document.
///
/// To save space, each pair is encoded using 32 bits. Frequencies are capped
/// at a maximum of 15, which fit into the high 4 bits. The low 28 bits contain
/// the doc ID.
///
/// The postings list starts with a single 32-bit little endian
/// integer that represents the document frequency of the ngram. This in turn
/// determines how many pairs to read. In other words, a posting list is a
/// length prefixed array of 32 bit little endian integer values.
///
/// This type is intended to be used in a max-heap, and orients its Ord
/// definition such that the heap becomes a min-heap. The ordering criteria
/// is derived from only the docid.
#[derive(Clone)]
struct PostingIter<'i> {
/// A handle to the underlying index.
index: &'i IndexReader,
/// The scoring function to use.
scorer: NameScorer,
/// The number of times the term for these postings appeared in the
/// original query. This increases the score proportionally.
count: f64,
/// The raw bytes of the posting list. The number of bytes is
/// exactly equivalent to `4 * document-frequency(ngram)`, where
/// `document-frequency(ngram)` is the total number of documents in which
/// `ngram` occurs.
///
/// This does not include the length prefix.
postings: &'i [u8],
/// The document frequency of this term.
len: usize,
/// The current posting. This is `None` once this iterator is exhausted.
posting: Option<Posting>,
/// A docid used for sorting postings. When the iterator is exhausted,
/// this is greater than the maximum doc id. Otherwise, this is always
/// equivalent to posting.docid.
///
/// We do this for efficiency by avoiding going through the optional
/// Posting.
docid: DocID,
/// The OkapiBM25 IDF score. This is invariant across all items in a
/// posting list, so we compute it once at construction. This saves a
/// call to `log` for every doc ID visited.
okapi_idf: f64,
}
/// A single entry in a posting list.
#[derive(Clone, Copy, Debug)]
struct Posting {
/// The document id.
docid: DocID,
/// The frequency, i.e., the number of times the ngram occurred in the
/// document identified by the docid.
frequency: u32,
}
impl Posting {
/// Read the next posting pair (doc ID and frequency) from the given
/// postings list. If the list is empty, then return `None`.
fn read(slice: &[u8]) -> Option<Posting> {
if slice.is_empty() {
None
} else {
let v = read_le_u32(slice);
Some(Posting { docid: v & MAX_DOC_ID, frequency: v >> 28 })
}
}
}
impl<'i> PostingIter<'i> {
/// Create a new posting iterator for the given term in the given index.
/// Scores will be computed with the given scoring function.
///
/// `count` should be the number of times this term occurred in the
/// original query string.
fn new(
index: &'i IndexReader,
scorer: NameScorer,
count: usize,
term: &str,
) -> PostingIter<'i> {
let mut postings = &*index.postings;
let offset = match index.ngram.get(term.as_bytes()) {
Some(offset) => offset as usize,
None => {
// If the term isn't in the index, then return an exhausted
// iterator.
return PostingIter {
index,
scorer,
count: 0.0,
postings: &[],
len: 0,
posting: None,
docid: MAX_DOC_ID + 1,
okapi_idf: 0.0,
};
}
};
postings = &postings[offset..];
let len = read_le_u32(postings) as usize;
postings = &postings[4..];
let corpus_count = index.config.num_documents as f64;
let df = len as f64;
let okapi_idf = (1.0 + (corpus_count - df + 0.5) / (df + 0.5)).log2();
let mut it = PostingIter {
index,
scorer,
count: count as f64,
postings: &postings[..4 * len],
len,
posting: None,
docid: 0,
okapi_idf,
};
// Advance to the first posting.
it.next();
it
}
/// Return the current posting. If this iterator has been exhausted, then
/// this returns `None`.
fn posting(&self) -> Option<Posting> {
self.posting
}
/// Returns the document frequency for the term corresponding to these
/// postings.
fn len(&self) -> usize {
self.len
}
/// Return the current document ID. If this iterator has been exhausted,
/// then this returns `None`.
fn docid(&self) -> Option<DocID> {
self.posting().map(|p| p.docid)
}
/// Return the score with the current document ID. If this iterator has
/// been exhausted, then this returns `None`.
fn score(&self) -> Option<Scored<DocID>> {
match self.scorer {
NameScorer::OkapiBM25 => self.score_okapibm25(),
NameScorer::TFIDF => self.score_tfidf(),
NameScorer::Jaccard => self.score_jaccard(),
NameScorer::QueryRatio => self.score_query_ratio(),
}
.map(|scored| scored.map_score(|s| s * self.count))
}
/// Score the current doc ID using Okapi BM25. It's similarish to TF-IDF,
/// but uses a document length normalization term.
fn score_okapibm25(&self) -> Option<Scored<DocID>> {
let post = match self.posting() {
None => return None,
Some(post) => post,
};
let k1 = 1.2;
let b = 0.75;
let doc_len = self.index.document_length(post.docid);
let norm = (doc_len as f64) / self.index.config.avg_document_len;
let tf = post.frequency as f64;
let num = tf * (k1 + 1.0);
let den = tf + k1 * (1.0 - b + b * norm);
let score = (num / den) * self.okapi_idf;
let capped = if score < 0.0 { 0.0 } else { score };
Some(Scored::new(post.docid).with_score(capped))
}
/// Score the current doc ID using the traditional TF-IDF ranking function.
fn score_tfidf(&self) -> Option<Scored<DocID>> {
let post = match self.posting() {
None => return None,
Some(post) => post,
};
let corpus_docs = self.index.config.num_documents as f64;
let term_docs = self.len as f64;
let tf = post.frequency as f64;
let idf = (corpus_docs / (1.0 + term_docs)).log2();
let score = tf * idf;
Some(Scored::new(post.docid).with_score(score))
}
/// Score the current doc ID using the Jaccard index, which measures the
/// overlap between two sets.
///
/// Note that this always returns `1.0`. The Jaccard index itself must be
/// computed by the disjunction scorer.
fn score_jaccard(&self) -> Option<Scored<DocID>> {
self.posting().map(|p| Scored::new(p.docid).with_score(1.0))
}
/// Score the current doc ID using the ratio of terms in the query that
/// matched the terms in this doc ID.
///
/// Note that this always returns `1.0`. The query ratio itself must be
/// computed by the disjunction scorer.
fn score_query_ratio(&self) -> Option<Scored<DocID>> {
self.posting().map(|p| Scored::new(p.docid).with_score(1.0))
}
}
impl<'i> Iterator for PostingIter<'i> {
type Item = Posting;
fn next(&mut self) -> Option<Posting> {
self.posting = match Posting::read(self.postings) {
None => {
self.docid = MAX_DOC_ID + 1;
None
}
Some(p) => {
self.postings = &self.postings[4..];
self.docid = p.docid;
Some(p)
}
};
self.posting
}
}
impl<'i> Eq for PostingIter<'i> {}
impl<'i> PartialEq for PostingIter<'i> {
fn eq(&self, other: &PostingIter<'i>) -> bool {
self.docid == other.docid
}
}
impl<'i> Ord for PostingIter<'i> {
fn cmp(&self, other: &PostingIter<'i>) -> cmp::Ordering {
// std::collections::BinaryHeap is a max-heap and we need a
// min-heap, so write this as-if it were a max-heap, then reverse it.
// Note that exhausted searchers should always have the lowest
// priority, and therefore, be considered maximal.
self.docid.cmp(&other.docid).reverse()
}
}
impl<'i> PartialOrd for PostingIter<'i> {
fn partial_cmp(&self, other: &PostingIter<'i>) -> Option<cmp::Ordering> {
Some(self.cmp(other))
}
}
/// A writer for indexing names to disk.
///
/// A writer opens and writes to several files simultaneously, which keeps the
/// implementation simple.
///
/// The index writer cannot stream the postings or term index, since the term
/// index requires its ngrams to be inserted in sorted order. Postings lists
/// are written as length prefixed sequences, so we need to know the lengths
/// of all our postings lists before writing them.
pub struct IndexWriter {
/// A builder for the ngram term index.
///
/// This isn't used until the caller indicates that it is done indexing
/// names. At which point, we insert all ngrams into the FST in sorted
/// order. Each ngram is mapped to the beginning of its correspond
/// postings list.
ngram: fst::MapBuilder<io::BufWriter<File>>,
/// The type of ngram extraction to use.
ngram_type: NgramType,
/// The size of ngrams to generate.
ngram_size: usize,
/// A writer for postings lists.
///
/// This isn't written to until the caller indicates that it is done
/// indexing names. At which point, every posting list is written as a
/// length prefixed array, in the same order that terms are written to the
/// term index.
postings: CursorWriter<io::BufWriter<File>>,
/// A map from document ID to name ID. This is written to in a streaming
/// fashion during indexing. The ID map consists of N 64-bit little
/// endian integers, where N is the total number of names indexed.
///
/// The document ID (the position in this map) is a unique internal
/// identifier assigned to each name, while the name ID is an identifier
/// provided by the caller. Multiple document IDs may map to the same
/// name ID (e.g., for indexing alternate names).
idmap: CursorWriter<io::BufWriter<File>>,
/// A map from document ID to document length, where the length corresponds
/// to the number of ngrams in the document. The map consists of N 16-bit
/// little endian integers, where N is the total number of names indexed.
///
/// The document lengths are used at query time as normalization
/// parameters. They are written in a streaming fashion during the indexing
/// process.
norms: CursorWriter<io::BufWriter<File>>,
/// A JSON formatted configuration file that includes some aggregate
/// statistics (such as the average document length, in ngrams) and the
/// ngram configuration. The ngram configuration in particular is used at
/// query time to make sure that query-time uses the same analysis as
/// index-time.
///
/// This is written at the end of the indexing process.
config: CursorWriter<io::BufWriter<File>>,
/// An in-memory map from ngram to its corresponding postings list. Once
/// indexing is done, this is written to disk via the FST term index and
/// postings list writers documented above.
terms: FnvHashMap<String, Postings>,
/// The next document ID, starting at 0. Each name added gets assigned its
/// own unique document ID. Queries read document IDs from the postings
/// list, but are mapped back to name IDs using the `idmap` before being
/// returned to the caller.
next_docid: DocID,
/// The average document length, in ngrams, for every name indexed. This is
/// used along with document lengths to compute normalization terms for
/// scoring at query time.
avg_document_len: f64,
}
/// A single postings list.
#[derive(Clone, Debug, Default)]
struct Postings {
/// A sorted list of postings, in order of ascending document IDs.
list: Vec<Posting>,
}
impl IndexWriter {
/// Open an index for writing to the given directory. Any previous name
/// index in the given directory is overwritten.
///
/// The given ngram configuration is used to transform all indexed names
/// into terms for the inverted index.
pub fn open<P: AsRef<Path>>(
dir: P,
ngram_type: NgramType,
ngram_size: usize,
) -> Result<IndexWriter> {
let dir = dir.as_ref();
let ngram = fst_map_builder_file(dir.join(NGRAM))?;
let postings = CursorWriter::from_path(dir.join(POSTINGS))?;
let idmap = CursorWriter::from_path(dir.join(IDMAP))?;
let norms = CursorWriter::from_path(dir.join(NORMS))?;
let config = CursorWriter::from_path(dir.join(CONFIG))?;
Ok(IndexWriter {
ngram,
ngram_type,
ngram_size,
postings,
idmap,
norms,
config,
terms: FnvHashMap::default(),
next_docid: 0,
avg_document_len: 0.0,
})
}
/// Finish writing names and serialize the index to disk.
pub fn finish(mut self) -> Result<()> {
let num_docs = self.num_docs();
let mut ngram_to_postings: Vec<(String, Postings)> =
self.terms.into_iter().collect();
// We could use a BTreeMap and get out our keys in sorted order, but
// the overhead of inserting into the BTreeMap dwarfs the savings we
// get from pre-sorted keys.
ngram_to_postings.sort_by(|&(ref t1, _), &(ref t2, _)| t1.cmp(t2));
for (term, postings) in ngram_to_postings {
let pos = self.postings.position() as u64;
self.ngram.insert(term.as_bytes(), pos).map_err(Error::fst)?;
self.postings
.write_u32(postings.list.len() as u32)
.map_err(Error::io)?;
for posting in postings.list {
let freq = cmp::min(15, posting.frequency);
let v = (freq << 28) | posting.docid;
self.postings.write_u32(v).map_err(Error::io)?;
}
}
serde_json::to_writer_pretty(
&mut self.config,
&Config {
ngram_type: self.ngram_type,
ngram_size: self.ngram_size,
avg_document_len: self.avg_document_len,
num_documents: num_docs as u64,
},
)
.map_err(|e| Error::config(e.to_string()))?;
self.ngram.finish().map_err(Error::fst)?;
self.idmap.flush().map_err(Error::io)?;
self.postings.flush().map_err(Error::io)?;
self.norms.flush().map_err(Error::io)?;
self.config.flush().map_err(Error::io)?;
Ok(())
}
/// Inserts the given name to this index, and associates it with the
/// provided `NameID`. Multiple names may be associated with the same
/// `NameID`.
pub fn insert(&mut self, name_id: NameID, name: &str) -> Result<()> {
let docid = self.next_docid(name_id)?;
let name = normalize_query(name);
let mut count = 0u16; // document length in number of ngrams
self.ngram_type.clone().iter(self.ngram_size, &name, |ngram| {
self.insert_term(docid, ngram);
// If a document length exceeds 2^16, then it is far too long for
// a name anyway, so we cap it at 2^16.
count = count.saturating_add(1);
});
// Update our mean document length (in ngrams).
self.avg_document_len +=
(count as f64 - self.avg_document_len) / (self.num_docs() as f64);
// Write the document length to disk, which is used as a normalization
// term for some scorers (like Okapi-BM25).
self.norms.write_u16(count).map_err(Error::io)?;
Ok(())
}
/// Add a single term that is part of a name identified by the given docid.
/// This updates the postings for this term, or creates a new posting if
/// this is the first time this term has been seen.
fn insert_term(&mut self, docid: DocID, term: &str) {
if let Some(posts) = self.terms.get_mut(term) {
posts.posting(docid).frequency += 1;
return;
}
let mut list = Postings::default();
list.posting(docid).frequency = 1;
self.terms.insert(term.to_string(), list);
}
/// Retrieve a fresh doc id, and associate it with the given name id.
fn next_docid(&mut self, name_id: NameID) -> Result<DocID> {
let docid = self.next_docid;
self.idmap.write_u64(name_id).map_err(Error::io)?;
self.next_docid = match self.next_docid.checked_add(1) {
None => bug!("exhausted doc ids"),
Some(next_docid) => next_docid,
};
if self.next_docid > MAX_DOC_ID {
let max = MAX_DOC_ID + 1; // docids are 0-indexed
bug!("exceeded maximum number of names ({})", max);
}
Ok(docid)
}
/// Return the total number of documents have been assigned doc ids.
fn num_docs(&self) -> u32 {
self.next_docid
}
}
impl Postings {
/// Return a mutable reference to the posting for the given docid. If one
/// doesn't exist, then create one (with a zero frequency) and return it.
fn posting(&mut self, docid: DocID) -> &mut Posting {
if self.list.last().map_or(true, |x| x.docid != docid) {
self.list.push(Posting { docid, frequency: 0 });
}
// This unwrap is OK because if the list was empty when this method was
// called, then we added an element above, and is thus now non-empty.
self.list.last_mut().unwrap()
}
}
/// The type of scorer that the name index should use.
///
/// The default is OkapiBM25. If you aren't sure which scorer to use, then
/// stick with the default.
#[derive(Clone, Copy, Debug, Eq, Hash, PartialEq)]
pub enum NameScorer {
/// OkapiBM25 is a TF-IDF-like ranking function, which takes name length
/// into account.
OkapiBM25,
/// TFIDF is the traditional TF-IDF ranking function, which does not
/// incorporate document length.
TFIDF,
/// Jaccard is a ranking function determined by computing the similarity
/// of ngrams between the query and a name in the index. The similarity
/// is computed by dividing the number of ngrams in common by the total
/// number of distinct ngrams in both the query and the name combined.
Jaccard,
/// QueryRatio is a ranking function that represents the ratio of query
/// terms that matched a name. It is computed by dividing the number of
/// ngrams in common by the total number of ngrams in the query only.
QueryRatio,
}
impl NameScorer {
/// Returns a list of strings representing the possible scorer values.
pub fn possible_names() -> &'static [&'static str] {
&["okapibm25", "tfidf", "jaccard", "queryratio"]
}
/// Return a string representation of this scorer.
///
/// The string returned can be parsed back into a `NameScorer`.
pub fn as_str(&self) -> &'static str {
match *self {
NameScorer::OkapiBM25 => "okapibm25",
NameScorer::TFIDF => "tfidf",
NameScorer::Jaccard => "jaccard",
NameScorer::QueryRatio => "queryratio",
}
}
}
impl Default for NameScorer {
fn default() -> NameScorer {
NameScorer::OkapiBM25
}
}
impl fmt::Display for NameScorer {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "{}", self.as_str())
}
}
impl FromStr for NameScorer {
type Err = Error;
fn from_str(s: &str) -> Result<NameScorer> {
match s {
"okapibm25" => Ok(NameScorer::OkapiBM25),
"tfidf" => Ok(NameScorer::TFIDF),
"jaccard" => Ok(NameScorer::Jaccard),
"queryratio" => Ok(NameScorer::QueryRatio),
unk => Err(Error::unknown_scorer(unk)),
}
}
}
/// The style of ngram extraction to use.
///
/// The same style of ngram extraction is always used at index time and at
/// query time.
///
/// Each ngram type uses the ngram size configuration differently.
///
/// All ngram styles used Unicode codepoints as the definition of a character.
/// For example, a 3-gram might contain up to 4 bytes, if it contains 3 Unicode
/// codepoints that each require 4 UTF-8 code units.
#[derive(Clone, Copy, Debug, Deserialize, Eq, Hash, PartialEq, Serialize)]
pub enum NgramType {
/// A windowing ngram.
///
/// This is the tradition style of ngram, where sliding window of size
/// `N` is moved across the entire content to be index. For example, the
/// 3-grams for the string `homer` are hom, ome and mer.
#[serde(rename = "window")]
Window,
/// An edge ngram.
///
/// This style of ngram produces ever longer ngrams, where each ngram is
/// anchored to the start of a word. Words are determined simply by
/// splitting whitespace.
///
/// For example, the edge ngrams of `homer simpson`, where the max ngram
/// size is 5, would be: hom, home, homer, sim, simp, simps. Generally,
/// for this ngram type, one wants to use a large maximum ngram size.
/// Perhaps somewhere close to the maximum number of ngrams in any word
/// in the corpus.
///
/// Note that there is no way to set the minimum ngram size (which is 3).
#[serde(rename = "edge")]
Edge,
}
/// The minimum size of an ngram emitted by the edge ngram iterator.
const MIN_EDGE_NGRAM_SIZE: usize = 3;
impl NgramType {
/// Return all possible ngram types.
pub fn possible_names() -> &'static [&'static str] {
&["window", "edge"]
}
/// Return a string representation of this type.
pub fn as_str(&self) -> &'static str {
match *self {
NgramType::Window => "window",
NgramType::Edge => "edge",
}
}
/// Execute the given function over each ngram in the text provided using
/// the given size configuration.
///
/// We don't use normal Rust iterators here because an internal iterator
/// is much easier to implement.
fn iter<'t, F: FnMut(&'t str)>(&self, size: usize, text: &'t str, f: F) {
match *self {
NgramType::Window => NgramType::iter_window(size, text, f),
NgramType::Edge => NgramType::iter_edge(size, text, f),
}
}
fn iter_window<'t, F: FnMut(&'t str)>(
size: usize,
text: &'t str,
mut f: F,
) {
if size == 0 {
return;
}
let end_skip = text.chars().take(size).count().saturating_sub(1);
let start = text.char_indices();
let end = text.char_indices().skip(end_skip);
for ((s, _), (e, c)) in start.zip(end) {
f(&text[s..e + c.len_utf8()]);
}
}
fn iter_edge<'t, F: FnMut(&'t str)>(
max_size: usize,
text: &'t str,
mut f: F,
) {
if max_size == 0 {
return;
}
for word in text.split_whitespace() {
let end_skip = word
.chars()
.take(MIN_EDGE_NGRAM_SIZE)
.count()
.saturating_sub(1);
let mut size = end_skip + 1;
for (end, c) in word.char_indices().skip(end_skip) {
f(&word[..end + c.len_utf8()]);
size += 1;
if size > max_size {
break;
}
}
}
}
}
impl Default for NgramType {
fn default() -> NgramType {
NgramType::Window
}
}
impl fmt::Display for NgramType {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "{}", self.as_str())
}
}
impl FromStr for NgramType {
type Err = Error;
fn from_str(s: &str) -> Result<NgramType> {
match s {
"window" => Ok(NgramType::Window),
"edge" => Ok(NgramType::Edge),
unk => Err(Error::unknown_ngram_type(unk)),
}
}
}
fn normalize_query(s: &str) -> String {
// We might consider doing Unicode normalization here, but it probably
// doesn't matter too much on a predominantly ASCII data set.
s.to_lowercase()
}
fn read_le_u32(slice: &[u8]) -> u32 {
u32::from_le_bytes(slice[..4].try_into().unwrap())
}
#[cfg(test)]
mod tests {
use super::*;
use crate::index::tests::TestContext;
// Test the actual name index.
/// Creates a name index, where each name provided is assigned its own
/// unique ID, starting at 0.
fn create_index(index_dir: &Path, names: &[&str]) -> IndexReader {
let mut wtr =
IndexWriter::open(index_dir, NgramType::Window, 3).unwrap();
for (i, name) in names.iter().enumerate() {
wtr.insert(i as u64, name).unwrap();
}
wtr.finish().unwrap();
IndexReader::open(index_dir).unwrap()
}
/// Build a name query, and disable the dynamic stop word detection.
///
/// It would be nice to test the stop word detection, but it makes writing
/// unit tests very difficult unfortunately.
fn name_query(name: &str) -> NameQuery {
NameQuery::new(name).with_stop_word_ratio(0.0)
}
fn ids(results: &[Scored<NameID>]) -> Vec<NameID> {
let mut ids: Vec<_> = results.iter().map(|r| *r.value()).collect();
ids.sort();
ids
}
/// Some names involving bruce.
const BRUCES: &'static [&'static str] = &[
"Bruce Springsteen", // 0
"Bruce Kulick", // 1
"Bruce Arians", // 2
"Bruce Smith", // 3
"Bruce Willis", // 4
"Bruce Wayne", // 5
"Bruce Banner", // 6
];
#[test]
fn names_bruces_1() {
let ctx = TestContext::new("small");
let idx = create_index(ctx.index_dir(), BRUCES);
let query = name_query("bruce");
let results = idx.search(&query).into_vec();
// This query matches everything.
assert_eq!(results.len(), 7);
// The top two hits are the shortest documents, because of Okapi-BM25's
// length normalization.
assert_eq!(results[0].score(), 1.0);
assert_eq!(results[1].score(), 1.0);
assert_eq!(ids(&results[0..2]), vec![3, 5]);
}
#[test]
fn names_bruces_2() {
let ctx = TestContext::new("small");
let idx = create_index(ctx.index_dir(), BRUCES);
let query = name_query("e w");
let results = idx.search(&query).into_vec();
// The 'e w' ngram is only in two documents: Bruce Willis and
// Bruce Wayne. Since Wayne is shorter than Willis, it should always
// be first.
assert_eq!(results.len(), 2);
assert_eq!(*results[0].value(), 5);
assert_eq!(*results[1].value(), 4);
}
#[test]
fn names_bruces_3() {
let ctx = TestContext::new("small");
let idx = create_index(ctx.index_dir(), BRUCES);
let query = name_query("Springsteen");
let results = idx.search(&query).into_vec();
assert_eq!(results.len(), 1);
assert_eq!(*results[0].value(), 0);
}
#[test]
fn names_bruces_4() {
let ctx = TestContext::new("small");
let idx = create_index(ctx.index_dir(), BRUCES);
let query =
name_query("Springsteen Kulick Arians Smith Willis Wayne Banner");
let results = idx.search(&query).into_vec();
// This query should hit everything.
assert_eq!(results.len(), 7);
}
// Test our various ngram strategies.
fn ngrams_window(n: usize, text: &str) -> Vec<&str> {
let mut grams = vec![];
NgramType::Window.iter(n, text, |gram| grams.push(gram));
grams
}
fn ngrams_edge(n: usize, text: &str) -> Vec<&str> {
let mut grams = vec![];
NgramType::Edge.iter(n, text, |gram| grams.push(gram));
grams
}
#[test]
#[should_panic]
fn ngrams_window_zero_banned() {
assert_eq!(ngrams_window(0, "abc"), vec!["abc"]);
}
#[test]
fn ngrams_window_weird_sizes() {
assert_eq!(
ngrams_window(2, "abcdef"),
vec!["ab", "bc", "cd", "de", "ef",]
);
assert_eq!(
ngrams_window(1, "abcdef"),
vec!["a", "b", "c", "d", "e", "f",]
);
assert_eq!(ngrams_window(2, "ab"), vec!["ab",]);
assert_eq!(ngrams_window(1, "ab"), vec!["a", "b",]);
assert_eq!(ngrams_window(1, "a"), vec!["a",]);
assert_eq!(ngrams_window(1, ""), Vec::<&str>::new());
}
#[test]
fn ngrams_window_ascii() {
assert_eq!(
ngrams_window(3, "abcdef"),
vec!["abc", "bcd", "cde", "def",]
);
assert_eq!(ngrams_window(3, "abcde"), vec!["abc", "bcd", "cde",]);
assert_eq!(ngrams_window(3, "abcd"), vec!["abc", "bcd",]);
assert_eq!(ngrams_window(3, "abc"), vec!["abc",]);
assert_eq!(ngrams_window(3, "ab"), vec!["ab",]);
assert_eq!(ngrams_window(3, "a"), vec!["a",]);
assert_eq!(ngrams_window(3, ""), Vec::<&str>::new());
}
#[test]
fn ngrams_window_non_ascii() {
assert_eq!(
ngrams_window(3, "αβγφδε"),
vec!["αβγ", "βγφ", "γφδ", "φδε",]
);
assert_eq!(ngrams_window(3, "αβγφδ"), vec!["αβγ", "βγφ", "γφδ",]);
assert_eq!(ngrams_window(3, "αβγφ"), vec!["αβγ", "βγφ",]);
assert_eq!(ngrams_window(3, "αβγ"), vec!["αβγ",]);
assert_eq!(ngrams_window(3, "αβ"), vec!["αβ",]);
assert_eq!(ngrams_window(3, "α"), vec!["α",]);
}
#[test]
fn ngrams_edge_ascii() {
assert_eq!(
ngrams_edge(5, "homer simpson"),
vec!["hom", "home", "homer", "sim", "simp", "simps",]
);
assert_eq!(ngrams_edge(5, "h"), vec!["h",]);
assert_eq!(ngrams_edge(5, "ho"), vec!["ho",]);
assert_eq!(ngrams_edge(5, "hom"), vec!["hom",]);
assert_eq!(ngrams_edge(5, "home"), vec!["hom", "home",]);
}
#[test]
fn ngrams_edge_non_ascii() {
assert_eq!(
ngrams_edge(5, "δεαβγφδε δε"),
vec!["δεα", "δεαβ", "δεαβγ", "δε",]
);
}
}
================================================
FILE: imdb-index/src/index/rating.rs
================================================
use std::path::Path;
use fst::{IntoStreamer, Streamer};
use memmap::Mmap;
use crate::error::{Error, Result};
use crate::record::Rating;
use crate::util::{
csv_file, fst_set_builder_file, fst_set_file, IMDB_RATINGS,
};
/// The name of the ratings index file.
///
/// The ratings index maps IMDb title ID to their average rating and number of
/// votes. The index is itself an FST set, where all keys begin with the IMDb
/// title ID, and also contain the average rating and number votes. Thus, a
/// lookup is accomplished via a range query on the title ID without needing
/// to consult the original CSV data.
const RATINGS: &str = "ratings.fst";
/// An index for ratings, which supports looking up ratings/votes for IMDb
/// titles efficiently.
#[derive(Debug)]
pub struct Index {
idx: fst::Set<Mmap>,
}
impl Index {
/// Open a rating index from the given index directory.
pub fn open<P: AsRef<Path>>(index_dir: P) -> Result<Index> {
Ok(Index {
// We claim it is safe to open the following memory map because we
// don't mutate them and no other process (should) either.
idx: unsafe { fst_set_file(index_dir.as_ref().join(RATINGS))? },
})
}
/// Create a rating index from the given IMDb data directory, and write it
/// to the given index directory. If a rating index already exists, then it
/// is overwritten.
pub fn create<P1: AsRef<Path>, P2: AsRef<Path>>(
data_dir: P1,
index_dir: P2,
) -> Result<Index> {
let data_dir = data_dir.as_ref();
let index_dir = index_dir.as_ref();
let mut buf = vec![];
let mut count = 0u64;
let mut idx = fst_set_builder_file(index_dir.join(RATINGS))?;
let mut rdr = csv_file(data_dir.join(IMDB_RATINGS))?;
for result in rdr.deserialize() {
let record: Rating = result.map_err(Error::csv)?;
buf.clear();
write_rating(&record, &mut buf)?;
idx.insert(&buf).map_err(Error::fst)?;
count += 1;
}
idx.finish().map_err(Error::fst)?;
log::info!("{} ratings indexed", count);
Index::open(index_dir)
}
/// Return the rating information (which includes the actual rating and
/// the number of votes associated with that rating) for the given IMDb
/// identifier. If no rating information exists for the given ID, then
/// `None` is returned.
pub fn rating(&self, id: &[u8]) -> Result<Option<Rating>> {
let mut upper = id.to_vec();
upper.push(0xFF);
let mut stream = self.idx.range().ge(id).le(upper).into_stream();
while let Some(rating_bytes) = stream.next() {
return Ok(Some(read_rating(rating_bytes)?));
}
Ok(None)
}
}
fn read_rating(bytes: &[u8]) -> Result<Rating> {
let nul = match bytes.iter().position(|&b| b == 0) {
Some(nul) => nul,
None => bug!("could not find nul byte"),
};
let id = match String::from_utf8(bytes[..nul].to_vec()) {
Err(err) => bug!("rating id invalid UTF-8: {}", err),
Ok(tvshow_id) => tvshow_id,
};
let i = nul + 1;
Ok(Rating {
id,
rating: read_rating_value(&bytes[i..])?,
votes: read_votes_value(&bytes[i + 4..])?,
})
}
fn write_rating(rat: &Rating, buf: &mut Vec<u8>) -> Result<()> {
if rat.id.as_bytes().iter().any(|&b| b == 0) {
bug!("unsupported rating id (with NUL byte) for {:?}", rat);
}
buf.extend_from_slice(rat.id.as_bytes());
buf.push(0x00);
write_rating_value(rat.rating, buf);
write_votes_value(rat.votes, buf);
Ok(())
}
fn read_votes_value(slice: &[u8]) -> Result<u32> {
if slice.len() < 4 {
bug!("not enough bytes to read votes value");
}
Ok(u32::from_be_bytes(slice[..4].try_into().unwrap()))
}
fn write_votes_value(votes: u32, buf: &mut Vec<u8>) {
buf.extend_from_slice(&votes.to_be_bytes())
}
fn read_rating_value(slice: &[u8]) -> Result<f32> {
if slice.len() < 4 {
bug!("not enough bytes to read rating value");
}
Ok(f32::from_be_bytes(slice[..4].try_into().unwrap()))
}
fn write_rating_value(rating: f32, buf: &mut Vec<u8>) {
buf.extend_from_slice(&rating.to_be_bytes())
}
#[cfg(test)]
mod tests {
use super::Index;
use crate::index::tests::TestContext;
#[test]
fn basics() {
let ctx = TestContext::new("small");
let idx = Index::create(ctx.data_dir(), ctx.index_dir()).unwrap();
let rat = idx.rating(b"tt0000001").unwrap().unwrap();
assert_eq!(rat.rating, 5.8);
assert_eq!(rat.votes, 1356);
assert!(idx.rating(b"tt9999999").unwrap().is_none());
}
}
================================================
FILE: imdb-index/src/index/tests.rs
================================================
use std::path::{Path, PathBuf};
/// Create an error from a format!-like syntax.
#[macro_export]
macro_rules! err {
($($tt:tt)*) => {
Box::<dyn std::error::Error>::from(format!($($tt)*))
}
}
/// A convenient result type alias.
pub type Result<T> = std::result::Result<T, Box<dyn std::error::Error>>;
/// A simple test context that makes it convenient to create an index.
///
/// Each test context has an IMDb data directory (which usually has only a
/// subset of the actual data) and an index directory (which starts empty by
/// default).
#[derive(Debug)]
pub struct TestContext {
_tmpdir: TempDir,
data_dir: PathBuf,
index_dir: PathBuf,
}
impl TestContext {
/// Create a new test context using the test data set name given.
///
/// Test data sets can be found in the `data/test` directory in this
/// repository's root. Data set names are the names of sub-directories of
/// `data`.
pub fn new(name: &str) -> TestContext {
let tmpdir = TempDir::new("imdb-rename-test-index").unwrap();
let data_dir = PathBuf::from("../data/test").join(name);
let index_dir = tmpdir.path().to_path_buf();
TestContext { _tmpdir: tmpdir, data_dir, index_dir }
}
/// Return the path to the data directory for this context.
pub fn data_dir(&self) -> &Path {
&self.data_dir
}
/// Return the path to the index directory for this context.
pub fn index_dir(&self) -> &Path {
&self.index_dir
}
}
/// A simple wrapper for creating a temporary directory that is automatically
/// deleted when it's dropped.
///
/// We use this in lieu of tempfile because tempfile brings in too many
/// dependencies.
#[derive(Debug)]
pub struct TempDir(PathBuf);
impl Drop for TempDir {
fn drop(&mut self) {
std::fs::remove_dir_all(&self.0).unwrap();
}
}
impl TempDir {
/// Create a new empty temporary directory under the system's configured
/// temporary directory.
pub fn new(prefix: &str) -> Result<TempDir> {
use std::sync::atomic::{AtomicUsize, Ordering};
static TRIES: usize = 100;
static COUNTER: AtomicUsize = AtomicUsize::new(0);
let tmpdir = std::env::temp_dir();
for _ in 0..TRIES {
let count = COUNTER.fetch_add(1, Ordering::SeqCst);
let path = tmpdir.join(prefix).join(count.to_string());
if path.is_dir() {
continue;
}
std::fs::create_dir_all(&path).map_err(|e| {
err!("failed to create {}: {}", path.display(), e)
})?;
return Ok(TempDir(path));
}
Err(err!("failed to create temp dir after {} tries", TRIES))
}
/// Return the underlying path to this temporary directory.
pub fn path(&self) -> &Path {
&self.0
}
}
================================================
FILE: imdb-index/src/index/writer.rs
================================================
use std::fs::File;
use std::io::{self, Write};
use std::path::Path;
use crate::error::Result;
use crate::util::create_file;
/// Wraps any writer and records the current position in the writer.
///
/// The position recorded always corresponds to the position that the next
/// byte would be written to.
#[derive(Clone, Debug)]
pub struct CursorWriter<W> {
wtr: W,
pos: usize,
}
impl CursorWriter<io::BufWriter<File>> {
/// Create a new cursor writer that will write to a file at the given path.
/// The file is truncated before writing.
pub fn from_path<P: AsRef<Path>>(path: P) -> Result<Self> {
let file = create_file(path)?;
Ok(CursorWriter::new(io::BufWriter::new(file)))
}
}
impl<W: io::Write> CursorWriter<W> {
/// Wrap the given writer with a counter.
pub fn new(wtr: W) -> CursorWriter<W> {
CursorWriter { wtr, pos: 0 }
}
/// Return the current position of this writer.
pub fn position(&self) -> usize {
self.pos
}
/// Write a u16LE.
pub fn write_u16(&mut self, n: u16) -> io::Result<()> {
self.write_all(&n.to_le_bytes())
}
/// Write a u32LE.
pub fn write_u32(&mut self, n: u32) -> io::Result<()> {
self.write_all(&n.to_le_bytes())
}
/// Write a u64LE.
pub fn write_u64(&mut self, n: u64) -> io::Result<()> {
self.write_all(&n.to_le_bytes())
}
}
impl<W: io::Write> io::Write for CursorWriter<W> {
fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
let n = self.wtr.write(buf)?;
self.pos += n;
Ok(n)
}
fn flush(&mut self) -> io::Result<()> {
self.wtr.flush()
}
}
================================================
FILE: imdb-index/src/lib.rs
================================================
/*!
This crate provides an on-disk indexing data structure for searching IMDb.
Searching is primarily done using information retrieval techniques, which
support fuzzy name queries and using TF-IDF-like ranking functions.
*/
#![deny(missing_docs)]
pub use crate::error::{Error, ErrorKind, Result};
pub use crate::index::{
AKARecordIter, Index, IndexBuilder, MediaEntity, NameQuery, NameScorer,
NgramType,
};
pub use crate::record::{Episode, Rating, Title, TitleKind, AKA};
pub use crate::scored::{Scored, SearchResults};
pub use crate::search::{Query, Searcher, Similarity};
// A macro that creates an error that represents a bug.
//
// This is typically used when reading index structures from disk. Since the
// data on disk is generally outside our control, we return an error using this
// macro instead of panicking (or worse, silently misinterpreting data).
macro_rules! bug {
($($tt:tt)*) => {{
return Err($crate::error::Error::bug(format!($($tt)*)));
}}
}
mod error;
mod index;
mod record;
mod scored;
mod search;
mod util;
================================================
FILE: imdb-index/src/record.rs
================================================
use std::cmp;
use std::fmt;
use std::str::FromStr;
use serde::{Deserialize, Deserializer, Serialize};
use crate::error::Error;
/// An IMDb title record.
///
/// This is the primary type of an IMDb media entry. This record defines the
/// identifier of an IMDb title, which serves as a foreign key in other data
/// files (such as alternate names, episodes and ratings).
#[derive(Clone, Debug, Deserialize)]
pub struct Title {
/// An IMDb identifier.
///
/// Generally, this is a fixed width string beginning with the characters
/// `tt`.
#[serde(rename = "tconst")]
pub id: String,
/// The specific type of a title, e.g., movie, TV show, episode, etc.
#[serde(rename = "titleType")]
pub kind: TitleKind,
/// The primary name of this title.
#[serde(rename = "primaryTitle")]
pub title: String,
/// The "original" name of this title.
#[serde(rename = "originalTitle")]
pub original_title: String,
/// Whether this title is classified as "adult" material or not.
#[serde(rename = "isAdult", deserialize_with = "number_as_bool")]
pub is_adult: bool,
/// The start year of this title.
///
/// Generally, things like movies or TV episodes have a start year to
/// indicate their release year and no end year. TV shows also have a start
/// year. TV shows that are still airing lack an end time, but TV shows
/// that have stopped will typically have an end year indicating when it
/// stopped airing.
///
/// Note that not all titles have a start year.
#[serde(rename = "startYear", deserialize_with = "csv::invalid_option")]
pub start_year: Option<u32>,
/// The end year of this title.
///
/// This is typically used to indicate the ending year of a TV show that
/// has stopped production.
#[serde(rename = "endYear", deserialize_with = "csv::invalid_option")]
pub end_year: Option<u32>,
/// The runtime, in minutes, of this title.
#[serde(
rename = "runtimeMinutes",
deserialize_with = "csv::invalid_option"
)]
pub runtime_minutes: Option<u32>,
/// A comma separated string of genres.
#[serde(rename = "genres")]
pub genres: String,
}
/// The kind of a title. These form a partioning of all titles, where every
/// title has exactly one kind.
///
/// This type has a `FromStr` implementation that permits parsing a string
/// containing a title kind into this type. Note that parsing a title kind
/// recognizes all forms present in the IMDb data, and also addition common
/// sense forms. For example, `tvshow` and `tvSeries` are both accepted as
/// terms for the `TVSeries` variant.
#[derive(Clone, Copy, Debug, Deserialize, Eq, Hash, PartialEq, Serialize)]
#[allow(missing_docs)]
pub enum TitleKind {
#[serde(rename = "movie")]
Movie,
#[serde(rename = "short")]
Short,
#[serde(rename = "tvEpisode")]
TVEpisode,
#[serde(rename = "tvMiniSeries")]
TVMiniSeries,
#[serde(rename = "tvMovie")]
TVMovie,
#[serde(rename = "tvSeries")]
TVSeries,
#[serde(rename = "tvShort")]
TVShort,
#[serde(rename = "tvSpecial")]
TVSpecial,
#[serde(rename = "video")]
Video,
#[serde(rename = "videoGame")]
VideoGame,
}
impl TitleKind {
/// Return a string representation of this title kind.
///
/// This string representation is intended to be the same string
/// representation used in the IMDb data files.
pub fn as_str(&self) -> &'static str {
use self::TitleKind::*;
match *self {
Movie => "movie",
Short => "short",
TVEpisode => "tvEpisode",
TVMiniSeries => "tvMiniSeries",
TVMovie => "tvMovie",
TVSeries => "tvSeries",
TVShort => "tvShort",
TVSpecial => "tvSpecial",
Video => "video",
VideoGame => "videoGame",
}
}
/// Returns true if and only if this kind represents a TV series.
pub fn is_tv_series(&self) -> bool {
use self::TitleKind::*;
match *self {
TVMiniSeries | TVSeries => true,
_ => false,
}
}
}
impl fmt::Display for TitleKind {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "{}", self.as_str())
}
}
impl Ord for TitleKind {
fn cmp(&self, other: &TitleKind) -> cmp::Ordering {
self.as_str().cmp(other.as_str())
}
}
impl PartialOrd for TitleKind {
fn partial_cmp(&self, other: &TitleKind) -> Option<cmp::Ordering> {
Some(self.cmp(other))
}
}
impl FromStr for TitleKind {
type Err = Error;
fn from_str(ty: &str) -> Result<TitleKind, Error> {
use self::TitleKind::*;
match &*ty.to_lowercase() {
"movie" => Ok(Movie),
"short" => Ok(Short),
"tvepisode" | "episode" => Ok(TVEpisode),
"tvminiseries" | "miniseries" => Ok(TVMiniSeries),
"tvmovie" => Ok(TVMovie),
"tvseries" | "tvshow" | "show" => Ok(TVSeries),
"tvshort" => Ok(TVShort),
"tvspecial" | "special" => Ok(TVSpecial),
"video" => Ok(Video),
"videogame" | "game" => Ok(VideoGame),
unk => Err(Error::unknown_title(unk)),
}
}
}
/// A single alternate name.
///
/// Every title has one or more names, and zero or more alternate names. To
/// represent multiple names, AKA or "also known as" records are provided.
/// There may be many AKA records for a single title.
#[derive(Clone, Debug, Deserialize)]
pub struct AKA {
/// The IMDb identifier that these AKA records describe.
#[serde(rename = "titleId")]
pub id: String,
/// The order in which an AKA record should be preferred.
#[serde(rename = "ordering")]
pub order: i32,
/// The alternate name.
#[serde(rename = "title")]
pub title: String,
/// A geographic region in which this alternate name applies.
#[serde(rename = "region")]
pub region: String,
/// The language of this alternate name.
#[serde(rename = "language")]
pub language: String,
/// A comma separated list of types for this name.
#[serde(rename = "types")]
pub types: String,
/// A comma separated list of attributes for this name.
#[serde(rename = "attributes")]
pub attributes: String,
/// A flag indicating whether this corresponds to the original title or
/// not.
#[serde(
rename = "isOriginalTitle",
deserialize_with = "optional_number_as_bool"
)]
pub is_original_title: Option<bool>,
}
/// A single episode record.
///
/// An episode record is an entry that joins two title records together, and
/// provides episode specific information, such as the season and episode
/// number. The two title records joined correspond to the title record for the
/// TV show and the title record for the episode.
#[derive(Clone, Debug, Deserialize)]
pub struct Episode {
/// The IMDb title identifier for this episode.
#[serde(rename = "tconst")]
pub id: String,
/// The IMDb title identifier for the parent TV show of this episode.
#[serde(rename = "parentTconst")]
pub tvshow_id: String,
/// The season in which this episode is contained, if it exists.
#[serde(
rename = "seasonNumber",
deserialize_with = "csv::invalid_option"
)]
pub season: Option<u32>,
/// The episode number of the season in which this episode is contained, if
/// it exists.
#[serde(
rename = "episodeNumber",
deserialize_with = "csv::invalid_option"
)]
pub episode: Option<u32>,
}
/// A rating associated with a single title record.
#[derive(Clone, Debug, Deserialize)]
pub struct Rating {
/// The IMDb title identifier for this rating.
#[serde(rename = "tconst")]
pub id: String,
/// The rating, on a scale of 0 to 10, for this title.
#[serde(rename = "averageRating")]
pub rating: f32,
/// The number of votes involved in this rating.
#[serde(rename = "numVotes")]
pub votes: u32,
}
fn number_as_bool<'de, D>(de: D) -> Result<bool, D::Error>
where
D: Deserializer<'de>,
{
i32::deserialize
gitextract_oovomjyk/
├── .github/
│ ├── FUNDING.yml
│ └── workflows/
│ └── ci.yml
├── .gitignore
├── COPYING
├── Cargo.toml
├── LICENSE-MIT
├── README.md
├── UNLICENSE
├── data/
│ ├── eval/
│ │ └── truth.toml
│ └── test/
│ └── small/
│ ├── title.akas.tsv
│ ├── title.basics.tsv
│ ├── title.episode.tsv
│ └── title.ratings.tsv
├── imdb-eval/
│ ├── COPYING
│ ├── Cargo.toml
│ ├── LICENSE-MIT
│ ├── README.md
│ ├── UNLICENSE
│ └── src/
│ ├── eval.rs
│ ├── logger.rs
│ └── main.rs
├── imdb-index/
│ ├── COPYING
│ ├── Cargo.toml
│ ├── LICENSE-MIT
│ ├── README.md
│ ├── UNLICENSE
│ └── src/
│ ├── error.rs
│ ├── index/
│ │ ├── aka.rs
│ │ ├── episode.rs
│ │ ├── id.rs
│ │ ├── mod.rs
│ │ ├── names.rs
│ │ ├── rating.rs
│ │ ├── tests.rs
│ │ └── writer.rs
│ ├── lib.rs
│ ├── record.rs
│ ├── scored.rs
│ ├── search.rs
│ └── util.rs
├── rustfmt.toml
└── src/
├── download.rs
├── logger.rs
├── main.rs
├── rename.rs
└── util.rs
SYMBOL INDEX (473 symbols across 21 files)
FILE: imdb-eval/src/eval.rs
constant TRUTH_DATA (line 18) | const TRUTH_DATA: &str = include_str!("../../data/eval/truth.toml");
type Truth (line 29) | struct Truth {
method from_path (line 45) | fn from_path<P: AsRef<Path>>(path: P) -> anyhow::Result<Truth> {
type Task (line 38) | struct Task {
type Spec (line 72) | pub struct Spec {
method new (line 82) | pub fn new() -> Spec {
method with_result_size (line 95) | pub fn with_result_size(
method with_ngram_size (line 112) | pub fn with_ngram_size(
method with_ngram_type (line 127) | pub fn with_ngram_type(mut self, ngram_type: NgramType) -> Spec {
method with_similarity (line 133) | pub fn with_similarity(mut self, sim: Similarity) -> Spec {
method with_scorer (line 143) | pub fn with_scorer(mut self, scorer: Option<NameScorer>) -> Spec {
method evaluate (line 149) | pub fn evaluate<P1: AsRef<Path>, P2: AsRef<Path>>(
method evaluate_with (line 163) | pub fn evaluate_with<P1: AsRef<Path>, P2: AsRef<Path>, P3: AsRef<Path>>(
method query (line 178) | fn query(&self, task: &Task) -> Query {
method index (line 190) | fn index<P1: AsRef<Path>, P2: AsRef<Path>>(
method index_dir (line 208) | fn index_dir<P: AsRef<Path>>(&self, eval_dir: P) -> PathBuf {
method index_name (line 217) | fn index_name(&self) -> String {
method fmt (line 229) | fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
method default (line 223) | fn default() -> Spec {
type Summary (line 270) | pub struct Summary {
method from_task_results (line 284) | pub fn from_task_results(results: &[TaskResult]) -> Vec<Summary> {
method from_same_task_results (line 300) | fn from_same_task_results(results: &[&TaskResult]) -> Summary {
type TaskResult (line 322) | pub struct TaskResult {
type Evaluation (line 352) | pub struct Evaluation<'s> {
type Item (line 360) | type Item = anyhow::Result<TaskResult>;
method next (line 362) | fn next(&mut self) -> Option<anyhow::Result<TaskResult>> {
type Evaluator (line 371) | struct Evaluator<'s> {
function run (line 381) | fn run(&mut self, task: &Task) -> anyhow::Result<TaskResult> {
function rank (line 440) | fn rank(&mut self, task: &Task) -> anyhow::Result<Option<u64>> {
function approx_eq (line 467) | fn approx_eq(x1: f64, x2: f64) -> bool {
function fractional_seconds (line 475) | fn fractional_seconds(d: &Duration) -> f64 {
function spec_printer (line 487) | fn spec_printer() {
FILE: imdb-eval/src/logger.rs
function init (line 11) | pub fn init() -> Result<()> {
type Logger (line 20) | struct Logger(());
method init (line 28) | fn init() -> std::result::Result<(), log::SetLoggerError> {
constant LOGGER (line 22) | const LOGGER: &'static Logger = &Logger(());
method enabled (line 34) | fn enabled(&self, _: &log::Metadata) -> bool {
method log (line 40) | fn log(&self, record: &log::Record) {
method flush (line 47) | fn flush(&self) {
function should_log (line 52) | fn should_log(record: &log::Record) -> bool {
FILE: imdb-eval/src/main.rs
function main (line 16) | fn main() {
function try_main (line 28) | fn try_main() -> anyhow::Result<()> {
function run_eval (line 61) | fn run_eval(
function run_summarize (line 90) | fn run_summarize(summarize: &Path) -> anyhow::Result<()> {
type Args (line 106) | struct Args {
method from_matches (line 122) | fn from_matches(matches: &clap::ArgMatches) -> anyhow::Result<Args> {
method specs (line 169) | fn specs(&self) -> anyhow::Result<Vec<Spec>> {
function app (line 202) | fn app() -> clap::App<'static, 'static> {
type OptionalNameScorer (line 309) | struct OptionalNameScorer(Option<NameScorer>);
method from (line 323) | fn from(scorer: NameScorer) -> OptionalNameScorer {
type Err (line 312) | type Err = imdb_index::Error;
method from_str (line 314) | fn from_str(
function parse_many_lossy (line 329) | fn parse_many_lossy<
function is_pipe_error (line 350) | fn is_pipe_error(err: &anyhow::Error) -> bool {
FILE: imdb-index/src/error.rs
type Result (line 5) | pub type Result<T> = std::result::Result<T, Error>;
type Error (line 9) | pub struct Error {
method kind (line 15) | pub fn kind(&self) -> &ErrorKind {
method into_kind (line 20) | pub fn into_kind(self) -> ErrorKind {
method new (line 24) | pub(crate) fn new(kind: ErrorKind) -> Error {
method unknown_title (line 28) | pub(crate) fn unknown_title<T: AsRef<str>>(unk: T) -> Error {
method unknown_scorer (line 32) | pub(crate) fn unknown_scorer<T: AsRef<str>>(unk: T) -> Error {
method unknown_ngram_type (line 36) | pub(crate) fn unknown_ngram_type<T: AsRef<str>>(unk: T) -> Error {
method unknown_sim (line 40) | pub(crate) fn unknown_sim<T: AsRef<str>>(unk: T) -> Error {
method unknown_directive (line 44) | pub(crate) fn unknown_directive<T: AsRef<str>>(unk: T) -> Error {
method bug (line 48) | pub(crate) fn bug<T: AsRef<str>>(msg: T) -> Error {
method config (line 52) | pub(crate) fn config<T: AsRef<str>>(msg: T) -> Error {
method version (line 56) | pub(crate) fn version(expected: u64, got: u64) -> Error {
method csv (line 60) | pub(crate) fn csv(err: csv::Error) -> Error {
method fst (line 64) | pub(crate) fn fst(err: fst::Error) -> Error {
method io (line 68) | pub(crate) fn io(err: std::io::Error) -> Error {
method io_path (line 72) | pub(crate) fn io_path<P: AsRef<Path>>(
method number (line 84) | pub(crate) fn number<E: std::error::Error + Send + Sync + 'static>(
method source (line 92) | fn source(&self) -> Option<&(dyn std::error::Error + 'static)> {
method fmt (line 102) | fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
type ErrorKind (line 109) | pub enum ErrorKind {
method fmt (line 172) | fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
FILE: imdb-index/src/index/aka.rs
constant AKAS (line 18) | const AKAS: &str = "akas.fst";
type Index (line 26) | pub struct Index {
method open (line 35) | pub fn open<P1: AsRef<Path>, P2: AsRef<Path>>(
method create (line 49) | pub fn create<P1: AsRef<Path>, P2: AsRef<Path>>(
method find (line 72) | pub fn find(&mut self, id: &[u8]) -> Result<AKARecordIter> {
type AKARecordIter (line 97) | pub struct AKARecordIter<'r>(
type Item (line 102) | type Item = Result<AKA>;
method next (line 104) | fn next(&mut self) -> Option<Result<AKA>> {
type AKAIndexRecord (line 121) | struct AKAIndexRecord {
type AKAIndexRecords (line 136) | struct AKAIndexRecords<R> {
function new (line 147) | fn new(rdr: csv::Reader<R>) -> AKAIndexRecords<R> {
type Item (line 153) | type Item = Result<AKAIndexRecord>;
method next (line 160) | fn next(&mut self) -> Option<Result<AKAIndexRecord>> {
function aka_index_records1 (line 205) | fn aka_index_records1() {
function aka_index_records2 (line 245) | fn aka_index_records2() {
function aka_index_records3 (line 277) | fn aka_index_records3() {
function aka_index_records4 (line 297) | fn aka_index_records4() {
FILE: imdb-index/src/index/episode.rs
constant SEASONS (line 23) | const SEASONS: &str = "episode.seasons.fst";
constant TVSHOWS (line 34) | const TVSHOWS: &str = "episode.tvshows.fst";
type Index (line 39) | pub struct Index {
method open (line 46) | pub fn open<P: AsRef<Path>>(index_dir: P) -> Result<Index> {
method create (line 58) | pub fn create<P1: AsRef<Path>, P2: AsRef<Path>>(
method seasons (line 97) | pub fn seasons(&self, tvshow_id: &[u8]) -> Result<Vec<Episode>> {
method episodes (line 115) | pub fn episodes(
method episode (line 143) | pub fn episode(&self, episode_id: &[u8]) -> Result<Option<Episode>> {
function read_sorted_episodes (line 156) | fn read_sorted_episodes(data_dir: &Path) -> Result<Vec<Episode>> {
function cmp_episode (line 169) | fn cmp_episode(ep1: &Episode, ep2: &Episode) -> cmp::Ordering {
function read_episode (line 185) | fn read_episode(bytes: &[u8]) -> Result<Episode> {
function write_episode (line 209) | fn write_episode(ep: &Episode, buf: &mut Vec<u8>) -> Result<()> {
function read_tvshow (line 221) | fn read_tvshow(bytes: &[u8]) -> Result<Episode> {
function write_tvshow (line 245) | fn write_tvshow(ep: &Episode, buf: &mut Vec<u8>) -> Result<()> {
function from_optional_u32 (line 258) | fn from_optional_u32(
function to_optional_season (line 271) | fn to_optional_season(ep: &Episode) -> Result<u32> {
function to_optional_epnum (line 283) | fn to_optional_epnum(ep: &Episode) -> Result<u32> {
function basics (line 302) | fn basics() {
function by_season (line 318) | fn by_season() {
function tvshow (line 333) | fn tvshow() {
FILE: imdb-index/src/index/id.rs
type IndexReader (line 15) | pub struct IndexReader {
method from_path (line 21) | pub fn from_path<P: AsRef<Path>>(path: P) -> Result<IndexReader> {
method get (line 28) | pub fn get(&self, key: &[u8]) -> Option<u64> {
type IndexSortedWriter (line 35) | pub struct IndexSortedWriter<W> {
function from_path (line 41) | pub fn from_path<P: AsRef<Path>>(
function insert (line 53) | pub fn insert(&mut self, key: &[u8], value: u64) -> Result<()> {
function finish (line 61) | pub fn finish(self) -> Result<()> {
FILE: imdb-index/src/index/mod.rs
constant VERSION (line 36) | const VERSION: u64 = 1;
constant TITLE (line 42) | const TITLE: &str = "title.fst";
constant CONFIG (line 48) | const CONFIG: &str = "config.json";
type MediaEntity (line 58) | pub struct MediaEntity {
method title (line 66) | pub fn title(&self) -> &Title {
method episode (line 71) | pub fn episode(&self) -> Option<&Episode> {
method rating (line 76) | pub fn rating(&self) -> Option<&Rating> {
type Index (line 104) | pub struct Index {
method open (line 143) | pub fn open<P1: AsRef<Path>, P2: AsRef<Path>>(
method create (line 163) | pub fn create<P1: AsRef<Path>, P2: AsRef<Path>>(
method try_clone (line 177) | pub fn try_clone(&self) -> Result<Index> {
method search (line 196) | pub fn search(
method entity (line 222) | pub fn entity(&mut self, id: &str) -> Result<Option<MediaEntity>> {
method entity_from_title (line 233) | pub fn entity_from_title(&mut self, title: Title) -> Result<MediaEntit...
method title (line 247) | pub fn title(&mut self, id: &str) -> Result<Option<Title>> {
method aka_records (line 260) | pub fn aka_records(&mut self, id: &str) -> Result<AKARecordIter> {
method rating (line 269) | pub fn rating(&mut self, id: &str) -> Result<Option<Rating>> {
method seasons (line 284) | pub fn seasons(&mut self, tvshow_id: &str) -> Result<Vec<Episode>> {
method episodes (line 300) | pub fn episodes(
method episode (line 313) | pub fn episode(&mut self, episode_id: &str) -> Result<Option<Episode>> {
method data_dir (line 318) | pub fn data_dir(&self) -> &Path {
method index_dir (line 323) | pub fn index_dir(&self) -> &Path {
method read_record (line 336) | fn read_record(&mut self, offset: u64) -> Result<Option<Title>> {
type Config (line 126) | struct Config {
type IndexBuilder (line 353) | pub struct IndexBuilder {
method new (line 360) | pub fn new() -> IndexBuilder {
method open (line 378) | pub fn open<P1: AsRef<Path>, P2: AsRef<Path>>(
method create (line 421) | pub fn create<P1: AsRef<Path>, P2: AsRef<Path>>(
method ngram_type (line 491) | pub fn ngram_type(&mut self, ngram_type: NgramType) -> &mut IndexBuild...
method ngram_size (line 499) | pub fn ngram_size(&mut self, ngram_size: usize) -> &mut IndexBuilder {
method default (line 506) | fn default() -> IndexBuilder {
function create_name_index (line 518) | fn create_name_index(
FILE: imdb-index/src/index/names.rs
constant CONFIG (line 26) | const CONFIG: &str = "names.config.json";
constant NGRAM (line 33) | const NGRAM: &str = "names.ngram.fst";
constant POSTINGS (line 51) | const POSTINGS: &str = "names.postings.idx";
constant IDMAP (line 66) | const IDMAP: &str = "names.idmap.idx";
constant NORMS (line 77) | const NORMS: &str = "names.norms.idx";
type NameID (line 91) | pub type NameID = u64;
type DocID (line 109) | type DocID = u32;
constant MAX_DOC_ID (line 117) | const MAX_DOC_ID: DocID = (1 << 28) - 1;
type NameQuery (line 124) | pub struct NameQuery {
method new (line 133) | pub fn new(name: &str) -> NameQuery {
method with_size (line 144) | pub fn with_size(self, size: usize) -> NameQuery {
method with_scorer (line 149) | pub fn with_scorer(self, scorer: NameScorer) -> NameQuery {
method with_stop_word_ratio (line 169) | pub fn with_stop_word_ratio(self, ratio: f64) -> NameQuery {
type IndexReader (line 176) | pub struct IndexReader {
method open (line 225) | pub fn open<P: AsRef<Path>>(dir: P) -> Result<IndexReader> {
method search (line 242) | pub fn search(&self, query: &NameQuery) -> SearchResults<NameID> {
method docid_to_nameid (line 258) | fn docid_to_nameid(&self, docid: DocID) -> NameID {
method document_length (line 268) | fn document_length(&self, docid: DocID) -> u64 {
type Config (line 216) | struct Config {
type CollectTopK (line 281) | struct CollectTopK {
method new (line 297) | fn new(k: usize) -> CollectTopK {
method collect (line 308) | fn collect(mut self, searcher: &mut Searcher) -> SearchResults<NameID> {
type Searcher (line 391) | struct Searcher<'i> {
function new (line 405) | fn new(idx: &'i IndexReader, query: &NameQuery) -> Searcher<'i> {
function index (line 448) | fn index(&self) -> &'i IndexReader {
type Item (line 454) | type Item = Scored<DocID>;
method next (line 456) | fn next(&mut self) -> Option<Scored<DocID>> {
type Disjunction (line 479) | struct Disjunction<'i> {
function new (line 507) | fn new(
function empty (line 523) | fn empty(index: &'i IndexReader, scorer: NameScorer) -> Disjunction<'i> {
function skip_to (line 539) | fn skip_to(&mut self, target_docid: DocID) -> Option<Scored<DocID>> {
type Item (line 581) | type Item = Scored<DocID>;
method next (line 583) | fn next(&mut self) -> Option<Scored<DocID>> {
type PostingIter (line 670) | struct PostingIter<'i> {
type Posting (line 704) | struct Posting {
method read (line 715) | fn read(slice: &[u8]) -> Option<Posting> {
function new (line 731) | fn new(
function posting (line 779) | fn posting(&self) -> Option<Posting> {
function len (line 785) | fn len(&self) -> usize {
function docid (line 791) | fn docid(&self) -> Option<DocID> {
function score (line 797) | fn score(&self) -> Option<Scored<DocID>> {
function score_okapibm25 (line 809) | fn score_okapibm25(&self) -> Option<Scored<DocID>> {
function score_tfidf (line 829) | fn score_tfidf(&self) -> Option<Scored<DocID>> {
function score_jaccard (line 848) | fn score_jaccard(&self) -> Option<Scored<DocID>> {
function score_query_ratio (line 857) | fn score_query_ratio(&self) -> Option<Scored<DocID>> {
type Item (line 863) | type Item = Posting;
method next (line 865) | fn next(&mut self) -> Option<Posting> {
method eq (line 884) | fn eq(&self, other: &PostingIter<'i>) -> bool {
method cmp (line 890) | fn cmp(&self, other: &PostingIter<'i>) -> cmp::Ordering {
method partial_cmp (line 900) | fn partial_cmp(&self, other: &PostingIter<'i>) -> Option<cmp::Ordering> {
type IndexWriter (line 914) | pub struct IndexWriter {
method open (line 986) | pub fn open<P: AsRef<Path>>(
method finish (line 1013) | pub fn finish(mut self) -> Result<()> {
method insert (line 1056) | pub fn insert(&mut self, name_id: NameID, name: &str) -> Result<()> {
method insert_term (line 1078) | fn insert_term(&mut self, docid: DocID, term: &str) {
method next_docid (line 1089) | fn next_docid(&mut self, name_id: NameID) -> Result<DocID> {
method num_docs (line 1104) | fn num_docs(&self) -> u32 {
type Postings (line 975) | struct Postings {
method posting (line 1112) | fn posting(&mut self, docid: DocID) -> &mut Posting {
type NameScorer (line 1127) | pub enum NameScorer {
method possible_names (line 1147) | pub fn possible_names() -> &'static [&'static str] {
method as_str (line 1154) | pub fn as_str(&self) -> &'static str {
method fmt (line 1171) | fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
method default (line 1165) | fn default() -> NameScorer {
type Err (line 1177) | type Err = Error;
method from_str (line 1179) | fn from_str(s: &str) -> Result<NameScorer> {
type NgramType (line 1201) | pub enum NgramType {
method possible_names (line 1231) | pub fn possible_names() -> &'static [&'static str] {
method as_str (line 1236) | pub fn as_str(&self) -> &'static str {
method iter (line 1248) | fn iter<'t, F: FnMut(&'t str)>(&self, size: usize, text: &'t str, f: F) {
method iter_window (line 1255) | fn iter_window<'t, F: FnMut(&'t str)>(
method iter_edge (line 1271) | fn iter_edge<'t, F: FnMut(&'t str)>(
method fmt (line 1304) | fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
constant MIN_EDGE_NGRAM_SIZE (line 1227) | const MIN_EDGE_NGRAM_SIZE: usize = 3;
method default (line 1298) | fn default() -> NgramType {
type Err (line 1310) | type Err = Error;
method from_str (line 1312) | fn from_str(s: &str) -> Result<NgramType> {
function normalize_query (line 1321) | fn normalize_query(s: &str) -> String {
function read_le_u32 (line 1327) | fn read_le_u32(slice: &[u8]) -> u32 {
function create_index (line 1340) | fn create_index(index_dir: &Path, names: &[&str]) -> IndexReader {
function name_query (line 1355) | fn name_query(name: &str) -> NameQuery {
function ids (line 1359) | fn ids(results: &[Scored<NameID>]) -> Vec<NameID> {
constant BRUCES (line 1366) | const BRUCES: &'static [&'static str] = &[
function names_bruces_1 (line 1377) | fn names_bruces_1() {
function names_bruces_2 (line 1393) | fn names_bruces_2() {
function names_bruces_3 (line 1408) | fn names_bruces_3() {
function names_bruces_4 (line 1419) | fn names_bruces_4() {
function ngrams_window (line 1432) | fn ngrams_window(n: usize, text: &str) -> Vec<&str> {
function ngrams_edge (line 1438) | fn ngrams_edge(n: usize, text: &str) -> Vec<&str> {
function ngrams_window_zero_banned (line 1446) | fn ngrams_window_zero_banned() {
function ngrams_window_weird_sizes (line 1451) | fn ngrams_window_weird_sizes() {
function ngrams_window_ascii (line 1467) | fn ngrams_window_ascii() {
function ngrams_window_non_ascii (line 1481) | fn ngrams_window_non_ascii() {
function ngrams_edge_ascii (line 1494) | fn ngrams_edge_ascii() {
function ngrams_edge_non_ascii (line 1506) | fn ngrams_edge_non_ascii() {
FILE: imdb-index/src/index/rating.rs
constant RATINGS (line 19) | const RATINGS: &str = "ratings.fst";
type Index (line 24) | pub struct Index {
method open (line 30) | pub fn open<P: AsRef<Path>>(index_dir: P) -> Result<Index> {
method create (line 41) | pub fn create<P1: AsRef<Path>, P2: AsRef<Path>>(
method rating (line 70) | pub fn rating(&self, id: &[u8]) -> Result<Option<Rating>> {
function read_rating (line 82) | fn read_rating(bytes: &[u8]) -> Result<Rating> {
function write_rating (line 100) | fn write_rating(rat: &Rating, buf: &mut Vec<u8>) -> Result<()> {
function read_votes_value (line 112) | fn read_votes_value(slice: &[u8]) -> Result<u32> {
function write_votes_value (line 119) | fn write_votes_value(votes: u32, buf: &mut Vec<u8>) {
function read_rating_value (line 123) | fn read_rating_value(slice: &[u8]) -> Result<f32> {
function write_rating_value (line 130) | fn write_rating_value(rating: f32, buf: &mut Vec<u8>) {
function basics (line 140) | fn basics() {
FILE: imdb-index/src/index/tests.rs
type Result (line 12) | pub type Result<T> = std::result::Result<T, Box<dyn std::error::Error>>;
type TestContext (line 20) | pub struct TestContext {
method new (line 32) | pub fn new(name: &str) -> TestContext {
method data_dir (line 40) | pub fn data_dir(&self) -> &Path {
method index_dir (line 45) | pub fn index_dir(&self) -> &Path {
type TempDir (line 56) | pub struct TempDir(PathBuf);
method new (line 67) | pub fn new(prefix: &str) -> Result<TempDir> {
method path (line 89) | pub fn path(&self) -> &Path {
method drop (line 59) | fn drop(&mut self) {
FILE: imdb-index/src/index/writer.rs
type CursorWriter (line 13) | pub struct CursorWriter<W> {
function from_path (line 21) | pub fn from_path<P: AsRef<Path>>(path: P) -> Result<Self> {
function new (line 29) | pub fn new(wtr: W) -> CursorWriter<W> {
function position (line 34) | pub fn position(&self) -> usize {
function write_u16 (line 39) | pub fn write_u16(&mut self, n: u16) -> io::Result<()> {
function write_u32 (line 44) | pub fn write_u32(&mut self, n: u32) -> io::Result<()> {
function write_u64 (line 49) | pub fn write_u64(&mut self, n: u64) -> io::Result<()> {
function write (line 55) | fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
function flush (line 61) | fn flush(&mut self) -> io::Result<()> {
FILE: imdb-index/src/record.rs
type Title (line 15) | pub struct Title {
type TitleKind (line 72) | pub enum TitleKind {
method as_str (line 100) | pub fn as_str(&self) -> &'static str {
method is_tv_series (line 117) | pub fn is_tv_series(&self) -> bool {
method fmt (line 128) | fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
method cmp (line 134) | fn cmp(&self, other: &TitleKind) -> cmp::Ordering {
method partial_cmp (line 140) | fn partial_cmp(&self, other: &TitleKind) -> Option<cmp::Ordering> {
type Err (line 146) | type Err = Error;
method from_str (line 148) | fn from_str(ty: &str) -> Result<TitleKind, Error> {
type AKA (line 173) | pub struct AKA {
type Episode (line 211) | pub struct Episode {
type Rating (line 235) | pub struct Rating {
function number_as_bool (line 247) | fn number_as_bool<'de, D>(de: D) -> Result<bool, D::Error>
function optional_number_as_bool (line 254) | fn optional_number_as_bool<'de, D>(de: D) -> Result<Option<bool>, D::Error>
FILE: imdb-index/src/scored.rs
type SearchResults (line 8) | pub struct SearchResults<T>(Vec<Scored<T>>);
function new (line 12) | pub fn new() -> SearchResults<T> {
function from_min_heap (line 17) | pub fn from_min_heap(
function push (line 32) | pub fn push(&mut self, scored: Scored<T>) {
function normalize (line 42) | pub fn normalize(&mut self) {
function rescore (line 60) | pub fn rescore<F: FnMut(&T) -> f64>(&mut self, mut rescore: F) {
function trim (line 70) | pub fn trim(&mut self, size: usize) {
function len (line 77) | pub fn len(&self) -> usize {
function is_empty (line 82) | pub fn is_empty(&self) -> bool {
function as_slice (line 87) | pub fn as_slice(&self) -> &[Scored<T>] {
function into_vec (line 93) | pub fn into_vec(self) -> Vec<Scored<T>> {
type IntoIter (line 99) | type IntoIter = vec::IntoIter<Scored<T>>;
type Item (line 100) | type Item = Scored<T>;
method into_iter (line 102) | fn into_iter(self) -> vec::IntoIter<Scored<T>> {
type Scored (line 113) | pub struct Scored<T> {
function new (line 120) | pub fn new(value: T) -> Scored<T> {
function score (line 130) | pub fn score(&self) -> f64 {
function set_score (line 137) | pub fn set_score(&mut self, score: f64) {
function with_score (line 146) | pub fn with_score(mut self, score: f64) -> Scored<T> {
function map (line 154) | pub fn map<U, F: FnOnce(T) -> U>(self, f: F) -> Scored<U> {
function map_score (line 162) | pub fn map_score<F: FnOnce(f64) -> f64>(self, f: F) -> Scored<T> {
function value (line 168) | pub fn value(&self) -> &T {
function into_value (line 174) | pub fn into_value(self) -> T {
function into_pair (line 180) | pub fn into_pair(self) -> (f64, T) {
method default (line 186) | fn default() -> Scored<T> {
method eq (line 194) | fn eq(&self, other: &Scored<T>) -> bool {
method cmp (line 201) | fn cmp(&self, other: &Scored<T>) -> cmp::Ordering {
method partial_cmp (line 207) | fn partial_cmp(&self, other: &Scored<T>) -> Option<cmp::Ordering> {
function never_nan_1 (line 219) | fn never_nan_1() {
function never_nan_2 (line 225) | fn never_nan_2() {
function never_nan_3 (line 231) | fn never_nan_3() {
FILE: imdb-index/src/search.rs
type Searcher (line 28) | pub struct Searcher {
method new (line 39) | pub fn new(idx: Index) -> Searcher {
method search (line 67) | pub fn search(
method index (line 84) | pub fn index(&mut self) -> &mut Index {
method search_with_name (line 88) | fn search_with_name(
method search_exhaustive (line 110) | fn search_exhaustive(
method search_with_tvshow (line 169) | fn search_with_tvshow(
method similarity (line 190) | fn similarity(&self, query: &Query, name: &str) -> f64 {
type Query (line 213) | pub struct Query {
method new (line 234) | pub fn new() -> Query {
method is_empty (line 252) | pub fn is_empty(&self) -> bool {
method name (line 270) | pub fn name(mut self, name: &str) -> Query {
method name_scorer (line 288) | pub fn name_scorer(mut self, scorer: Option<NameScorer>) -> Query {
method similarity (line 304) | pub fn similarity(mut self, sim: Similarity) -> Query {
method size (line 315) | pub fn size(mut self, size: usize) -> Query {
method kind (line 327) | pub fn kind(mut self, kind: TitleKind) -> Query {
method year_ge (line 337) | pub fn year_ge(mut self, year: u32) -> Query {
method year_le (line 345) | pub fn year_le(mut self, year: u32) -> Query {
method votes_ge (line 351) | pub fn votes_ge(mut self, votes: u32) -> Query {
method votes_le (line 357) | pub fn votes_le(mut self, votes: u32) -> Query {
method season_ge (line 365) | pub fn season_ge(mut self, season: u32) -> Query {
method season_le (line 373) | pub fn season_le(mut self, season: u32) -> Query {
method episode_ge (line 381) | pub fn episode_ge(mut self, episode: u32) -> Query {
method episode_le (line 389) | pub fn episode_le(mut self, episode: u32) -> Query {
method tvshow_id (line 398) | pub fn tvshow_id(mut self, tvshow_id: &str) -> Query {
method matches (line 407) | fn matches(&self, ent: &MediaEntity) -> bool {
method matches_title (line 416) | fn matches_title(&self, title: &Title) -> bool {
method matches_rating (line 434) | fn matches_rating(&self, rating: Option<&Rating>) -> bool {
method matches_episode (line 447) | fn matches_episode(&self, ep: Option<&Episode>) -> bool {
method name_query (line 466) | fn name_query(&self) -> Option<NameQuery> {
method has_filters (line 490) | fn has_filters(&self) -> bool {
method needs_only_title (line 501) | fn needs_only_title(&self) -> bool {
method needs_rating (line 506) | fn needs_rating(&self) -> bool {
method needs_episode (line 511) | fn needs_episode(&self) -> bool {
method deserialize (line 528) | fn deserialize<D>(d: D) -> result::Result<Query, D::Error>
method fmt (line 616) | fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
method default (line 227) | fn default() -> Query {
method serialize (line 519) | fn serialize<S>(&self, s: S) -> result::Result<S::Ok, S::Error>
type Err (line 542) | type Err = Error;
method from_str (line 544) | fn from_str(qstr: &str) -> Result<Query> {
type Similarity (line 665) | pub enum Similarity {
method possible_names (line 682) | pub fn possible_names() -> &'static [&'static str] {
method is_none (line 687) | pub fn is_none(&self) -> bool {
method similarity (line 696) | pub fn similarity(&self, q1: &str, q2: &str) -> f64 {
method fmt (line 730) | fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
method default (line 724) | fn default() -> Similarity {
type Err (line 741) | type Err = Error;
method from_str (line 743) | fn from_str(s: &str) -> Result<Similarity> {
type Range (line 758) | struct Range<T> {
function none (line 764) | pub fn none() -> Range<T> {
function is_none (line 768) | pub fn is_none(&self) -> bool {
function contains (line 774) | pub fn contains(&self, t: Option<&T>) -> bool {
function fmt (line 789) | fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
type Err (line 803) | type Err = Error;
method from_str (line 805) | fn from_str(range: &str) -> Result<Range<T>> {
function ranges (line 845) | fn ranges() {
function query_parser (line 863) | fn query_parser() {
function query_parser_error (line 911) | fn query_parser_error() {
function query_parser_weird (line 918) | fn query_parser_weird() {
function query_display (line 927) | fn query_display() {
function query_serialize (line 942) | fn query_serialize() {
function query_deserialize (line 960) | fn query_deserialize() {
FILE: imdb-index/src/util.rs
constant IMDB_BASICS (line 15) | pub const IMDB_BASICS: &str = "title.basics.tsv";
constant IMDB_AKAS (line 21) | pub const IMDB_AKAS: &str = "title.akas.tsv";
constant IMDB_EPISODE (line 29) | pub const IMDB_EPISODE: &str = "title.episode.tsv";
constant IMDB_RATINGS (line 35) | pub const IMDB_RATINGS: &str = "title.ratings.tsv";
type NiceDuration (line 39) | pub struct NiceDuration(pub time::Duration);
method fmt (line 42) | fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
method since (line 50) | pub fn since(t: time::Instant) -> NiceDuration {
method fractional_seconds (line 57) | pub fn fractional_seconds(&self) -> f64 {
function csv_reader_builder (line 65) | pub fn csv_reader_builder() -> csv::ReaderBuilder {
function csv_mmap (line 77) | pub unsafe fn csv_mmap<P: AsRef<Path>>(
function csv_file (line 88) | pub fn csv_file<P: AsRef<Path>>(path: P) -> Result<csv::Reader<File>> {
function mmap_file (line 97) | pub unsafe fn mmap_file<P: AsRef<Path>>(path: P) -> Result<Mmap> {
function create_file (line 105) | pub fn create_file<P: AsRef<Path>>(path: P) -> Result<File> {
function open_file (line 112) | pub fn open_file<P: AsRef<Path>>(path: P) -> Result<File> {
function fst_set_builder_file (line 119) | pub fn fst_set_builder_file<P: AsRef<Path>>(
function fst_set_file (line 131) | pub unsafe fn fst_set_file<P: AsRef<Path>>(path: P) -> Result<fst::Set<M...
function fst_map_builder_file (line 142) | pub fn fst_map_builder_file<P: AsRef<Path>>(
function fst_map_file (line 154) | pub unsafe fn fst_map_file<P: AsRef<Path>>(path: P) -> Result<fst::Map<M...
FILE: src/download.rs
constant IMDB_BASE_URL (line 11) | const IMDB_BASE_URL: &'static str = "https://datasets.imdbws.com";
constant DATA_SETS (line 16) | const DATA_SETS: &'static [&'static str] = &[
function download_all (line 28) | pub fn download_all<P: AsRef<Path>>(dir: P) -> anyhow::Result<bool> {
function update_all (line 41) | pub fn update_all<P: AsRef<Path>>(dir: P) -> anyhow::Result<()> {
function download_one (line 53) | fn download_one(outdir: &Path, dataset: &'static str) -> anyhow::Result<...
function non_existent_data_sets (line 70) | fn non_existent_data_sets(dir: &Path) -> anyhow::Result<Vec<&'static str...
function dataset_path (line 83) | fn dataset_path(dir: &Path, name: &'static str) -> PathBuf {
function write_sorted_csv_records (line 96) | fn write_sorted_csv_records<R: io::Read, W: io::Write>(
FILE: src/logger.rs
function init (line 9) | pub fn init() -> anyhow::Result<()> {
type Logger (line 18) | struct Logger(());
method init (line 26) | fn init() -> std::result::Result<(), log::SetLoggerError> {
constant LOGGER (line 20) | const LOGGER: &'static Logger = &Logger(());
method enabled (line 32) | fn enabled(&self, _: &log::Metadata) -> bool {
method log (line 38) | fn log(&self, record: &log::Record) {
method flush (line 45) | fn flush(&self) {
function should_log (line 50) | fn should_log(record: &log::Record) -> bool {
FILE: src/main.rs
function main (line 20) | fn main() {
function try_main (line 32) | fn try_main() -> anyhow::Result<()> {
type Args (line 113) | struct Args {
method from_matches (line 132) | fn from_matches(matches: &clap::ArgMatches) -> anyhow::Result<Args> {
method create_index (line 195) | fn create_index(&self) -> anyhow::Result<Index> {
method open_index (line 202) | fn open_index(&self) -> anyhow::Result<Index> {
method searcher (line 206) | fn searcher(&self) -> anyhow::Result<Searcher> {
method download_all (line 210) | fn download_all(&self) -> anyhow::Result<bool> {
method download_all_update (line 214) | fn download_all_update(&self) -> anyhow::Result<()> {
function app (line 219) | fn app() -> clap::App<'static, 'static> {
function collect_paths (line 341) | fn collect_paths(paths: Vec<&OsStr>, follow: bool) -> Vec<PathBuf> {
function is_pipe_error (line 361) | fn is_pipe_error(err: &anyhow::Error) -> bool {
FILE: src/rename.rs
type RenameProposal (line 15) | pub struct RenameProposal {
method new (line 59) | fn new(
method rename (line 75) | pub fn rename(&self) -> anyhow::Result<()> {
method src (line 119) | pub fn src(&self) -> &Path {
method dst (line 128) | pub fn dst(&self) -> &Path {
type RenameAction (line 23) | pub enum RenameAction {
method fmt (line 33) | fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
method is_link (line 44) | fn is_link(&self) -> bool {
type Renamer (line 143) | pub struct Renamer {
method propose (line 175) | pub fn propose(
method propose_one (line 237) | fn propose_one(
method find_any (line 322) | fn find_any(
method find_episode (line 361) | fn find_episode(
method find_tvshow_for_episode (line 396) | fn find_tvshow_for_episode(
method find_unknown (line 434) | fn find_unknown(&self) -> anyhow::Result<MediaEntity> {
method candidate (line 453) | fn candidate(&self, path: &Path) -> anyhow::Result<Candidate> {
method episode_parts (line 490) | fn episode_parts(
method name_query (line 526) | fn name_query(&self, name: &str) -> Query {
method choose_one (line 539) | fn choose_one(
method search (line 558) | fn search(
type Candidate (line 577) | struct Candidate {
type CandidatePath (line 598) | struct CandidatePath {
method from_path (line 656) | fn from_path(path: &Path) -> anyhow::Result<CandidatePath> {
method imdb_name (line 691) | fn imdb_name(&self, ent: &MediaEntity) -> String {
type CandidateKind (line 611) | enum CandidateKind {
type CandidateAny (line 632) | struct CandidateAny {
type CandidateEpisode (line 644) | struct CandidateEpisode {
type RenamerBuilder (line 713) | pub struct RenamerBuilder {
method new (line 724) | pub fn new() -> RenamerBuilder {
method build (line 736) | pub fn build(&self) -> anyhow::Result<Renamer> {
method force (line 758) | pub fn force(&mut self, entity: MediaEntity) -> &mut RenamerBuilder {
method min_votes (line 769) | pub fn min_votes(&mut self, min_votes: u32) -> &mut RenamerBuilder {
method good_threshold (line 782) | pub fn good_threshold(&mut self, threshold: f64) -> &mut RenamerBuilder {
method regex_episode (line 791) | pub fn regex_episode(&mut self, pattern: &str) -> &mut RenamerBuilder {
method regex_season (line 800) | pub fn regex_season(&mut self, pattern: &str) -> &mut RenamerBuilder {
method regex_year (line 809) | pub fn regex_year(&mut self, pattern: &str) -> &mut RenamerBuilder {
method default (line 816) | fn default() -> RenamerBuilder {
FILE: src/util.rs
function choose (line 16) | pub fn choose(
function read_number (line 35) | pub fn read_number(start: usize, end: usize) -> anyhow::Result<usize> {
function read_yesno (line 57) | pub fn read_yesno(msg: &str) -> anyhow::Result<bool> {
function write_tsv (line 73) | pub fn write_tsv<W: io::Write>(
function write_tsv_title (line 104) | fn write_tsv_title<W: io::Write>(
function write_tsv_episode (line 127) | fn write_tsv_episode<W: io::Write>(
Condensed preview — 46 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (310K chars).
[
{
"path": ".github/FUNDING.yml",
"chars": 21,
"preview": "github: [BurntSushi]\n"
},
{
"path": ".github/workflows/ci.yml",
"chars": 2405,
"preview": "name: ci\non:\n pull_request:\n push:\n branches:\n - master\n schedule:\n - cron: '00 01 * * *'\n\n# The section is "
},
{
"path": ".gitignore",
"chars": 66,
"preview": "/target\n/imdb-eval/target\n/imdb-index/target\n**/*.rs.bk\ntags\n/tmp\n"
},
{
"path": "COPYING",
"chars": 126,
"preview": "This project is dual-licensed under the Unlicense and MIT licenses.\n\nYou may use this code under the terms of either lic"
},
{
"path": "Cargo.toml",
"chars": 1019,
"preview": "[package]\nname = \"imdb-rename\"\nversion = \"0.1.6\" #:version\nauthors = [\"Andrew Gallant <jamslam@gmail.com>\"]\ndescription"
},
{
"path": "LICENSE-MIT",
"chars": 1081,
"preview": "The MIT License (MIT)\n\nCopyright (c) 2015 Andrew Gallant\n\nPermission is hereby granted, free of charge, to any person ob"
},
{
"path": "README.md",
"chars": 8874,
"preview": "imdb-rename\n===========\nA command line tool to rename media files based on titles from IMDb.\nimdb-rename downloads the o"
},
{
"path": "UNLICENSE",
"chars": 1211,
"preview": "This is free and unencumbered software released into the public domain.\n\nAnyone is free to copy, modify, publish, use, c"
},
{
"path": "data/eval/truth.toml",
"chars": 2879,
"preview": "[[task]]\nquery = \"the matrix\"\nanswer = \"tt0133093\"\n\n[[task]]\nquery = \"homey the clown\"\nanswer = \"tt0701128\"\n\n[[task]]\nqu"
},
{
"path": "data/test/small/title.akas.tsv",
"chars": 1701,
"preview": "titleId\tordering\ttitle\tregion\tlanguage\ttypes\tattributes\tisOriginalTitle\ntt0096697\t10\tSimpsonovi\tSI\t\\N\timdbDisplay\t\\N\t0\nt"
},
{
"path": "data/test/small/title.basics.tsv",
"chars": 5492,
"preview": "tconst\ttitleType\tprimaryTitle\toriginalTitle\tisAdult\tstartYear\tendYear\truntimeMinutes\tgenres\ntt0348034\ttvEpisode\tSimpsons"
},
{
"path": "data/test/small/title.episode.tsv",
"chars": 1495,
"preview": "tconst\tparentTconst\tseasonNumber\tepisodeNumber\ntt0348034\ttt0096697\t1\t1\ntt0701059\ttt0096697\t1\t5\ntt0701060\ttt0096697\t3\t4\nt"
},
{
"path": "data/test/small/title.ratings.tsv",
"chars": 482,
"preview": "tconst\taverageRating\tnumVotes\ntt0000001\t5.8\t1356\ntt0000002\t6.5\t157\ntt0000003\t6.6\t939\ntt0000004\t6.4\t93\ntt0000005\t6.2\t1630"
},
{
"path": "imdb-eval/COPYING",
"chars": 126,
"preview": "This project is dual-licensed under the Unlicense and MIT licenses.\n\nYou may use this code under the terms of either lic"
},
{
"path": "imdb-eval/Cargo.toml",
"chars": 778,
"preview": "[package]\nname = \"imdb-eval\"\nversion = \"0.1.2\"\nauthors = [\"Andrew Gallant <jamslam@gmail.com>\"]\ndescription = \"\"\"\nA comm"
},
{
"path": "imdb-eval/LICENSE-MIT",
"chars": 1081,
"preview": "The MIT License (MIT)\n\nCopyright (c) 2015 Andrew Gallant\n\nPermission is hereby granted, free of charge, to any person ob"
},
{
"path": "imdb-eval/README.md",
"chars": 807,
"preview": "imdb-eval\n=========\nA command line tool for evaluating imdb-rename's search functionality.\n\n[\n\nCopyright (c) 2015 Andrew Gallant\n\nPermission is hereby granted, free of charge, to any person ob"
},
{
"path": "imdb-index/README.md",
"chars": 738,
"preview": "imdb-index\n==========\nA library for reading and writing an IMDb index, with a focus on IMDb titles.\nIn particular, this "
},
{
"path": "imdb-index/UNLICENSE",
"chars": 1211,
"preview": "This is free and unencumbered software released into the public domain.\n\nAnyone is free to copy, modify, publish, use, c"
},
{
"path": "imdb-index/src/error.rs",
"chars": 7212,
"preview": "use std::fmt;\nuse std::path::{Path, PathBuf};\n\n/// A type alias for handling errors throughout imdb-index.\npub type Resu"
},
{
"path": "imdb-index/src/index/aka.rs",
"chars": 10613,
"preview": "use std::io;\nuse std::iter;\nuse std::path::Path;\n\nuse memmap::Mmap;\n\nuse crate::error::{Error, Result};\nuse crate::index"
},
{
"path": "imdb-index/src/index/episode.rs",
"chars": 11040,
"preview": "use std::cmp;\nuse std::path::Path;\nuse std::u32;\n\nuse fst::{IntoStreamer, Streamer};\nuse memmap::Mmap;\n\nuse crate::error"
},
{
"path": "imdb-index/src/index/id.rs",
"chars": 2077,
"preview": "use std::fs::File;\nuse std::io;\nuse std::path::Path;\n\nuse memmap::Mmap;\n\nuse crate::error::{Error, Result};\nuse crate::u"
},
{
"path": "imdb-index/src/index/mod.rs",
"chars": 20242,
"preview": "use std::fs;\nuse std::io;\nuse std::path::{Path, PathBuf};\nuse std::thread;\nuse std::time::Instant;\n\nuse memmap::Mmap;\nus"
},
{
"path": "imdb-index/src/index/names.rs",
"chars": 58682,
"preview": "use std::cmp;\nuse std::collections::{binary_heap, BinaryHeap};\nuse std::fmt;\nuse std::fs::File;\nuse std::io::{self, Writ"
},
{
"path": "imdb-index/src/index/rating.rs",
"chars": 4728,
"preview": "use std::path::Path;\n\nuse fst::{IntoStreamer, Streamer};\nuse memmap::Mmap;\n\nuse crate::error::{Error, Result};\nuse crate"
},
{
"path": "imdb-index/src/index/tests.rs",
"chars": 2849,
"preview": "use std::path::{Path, PathBuf};\n\n/// Create an error from a format!-like syntax.\n#[macro_export]\nmacro_rules! err {\n "
},
{
"path": "imdb-index/src/index/writer.rs",
"chars": 1672,
"preview": "use std::fs::File;\nuse std::io::{self, Write};\nuse std::path::Path;\n\nuse crate::error::Result;\nuse crate::util::create_f"
},
{
"path": "imdb-index/src/lib.rs",
"chars": 1059,
"preview": "/*!\nThis crate provides an on-disk indexing data structure for searching IMDb.\nSearching is primarily done using informa"
},
{
"path": "imdb-index/src/record.rs",
"chars": 8426,
"preview": "use std::cmp;\nuse std::fmt;\nuse std::str::FromStr;\n\nuse serde::{Deserialize, Deserializer, Serialize};\n\nuse crate::error"
},
{
"path": "imdb-index/src/scored.rs",
"chars": 6737,
"preview": "use std::cmp;\nuse std::collections::BinaryHeap;\nuse std::num::FpCategory;\nuse std::vec;\n\n/// A collection of scored valu"
},
{
"path": "imdb-index/src/search.rs",
"chars": 33556,
"preview": "use std::cmp;\nuse std::f64;\nuse std::fmt;\nuse std::result;\nuse std::str::FromStr;\n\nuse lazy_static::lazy_static;\nuse reg"
},
{
"path": "imdb-index/src/util.rs",
"chars": 6334,
"preview": "use std::fmt;\nuse std::fs::File;\nuse std::io;\nuse std::path::Path;\nuse std::time;\n\nuse memmap::Mmap;\n\nuse crate::error::"
},
{
"path": "rustfmt.toml",
"chars": 44,
"preview": "max_width = 79\nuse_small_heuristics = \"max\"\n"
},
{
"path": "src/download.rs",
"chars": 4663,
"preview": "use std::fs::{self, File};\nuse std::io;\nuse std::path::{Path, PathBuf};\n\nuse {anyhow::Context, flate2::read::GzDecoder};"
},
{
"path": "src/logger.rs",
"chars": 1533,
"preview": "// This module defines a super simple logger that works with the `log` crate.\n// We don't need anything fancy; just basi"
},
{
"path": "src/main.rs",
"chars": 13091,
"preview": "use std::env;\nuse std::ffi::OsStr;\nuse std::io::{self, Write};\nuse std::path::PathBuf;\nuse std::process;\n\nuse imdb_index"
},
{
"path": "src/rename.rs",
"chars": 30170,
"preview": "use std::collections::{HashMap, HashSet};\nuse std::fmt;\nuse std::fs;\nuse std::path::{Path, PathBuf};\nuse std::sync::Mute"
},
{
"path": "src/util.rs",
"chars": 4665,
"preview": "use std::io::{self, Write};\n\nuse imdb_index::{Episode, MediaEntity, Scored, Searcher, Title};\nuse tabwriter::TabWriter;\n"
}
]
About this extraction
This page contains the full source code of the BurntSushi/imdb-rename GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 46 files (289.3 KB), approximately 75.3k tokens, and a symbol index with 473 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.