rga: ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc.
Go to file
2019-06-13 16:10:00 +02:00
.vscode pass around config object 2019-06-07 19:00:24 +02:00
ci fixes 2019-06-12 22:32:20 +02:00
exampledir tesseract single threaded 2019-06-12 17:44:47 +02:00
src doc 2019-06-13 15:22:33 +02:00
.gitignore pass around config object 2019-06-07 19:00:24 +02:00
.travis.yml skip arm build for now 2019-06-12 23:05:13 +02:00
Cargo.lock (cargo-release) start next development iteration 0.7.0 2019-06-13 16:10:00 +02:00
Cargo.toml (cargo-release) start next development iteration 0.7.0 2019-06-13 16:10:00 +02:00
LICENSE.md update readme 2019-06-13 16:06:43 +02:00
README.md (cargo-release) start next development iteration 0.7.0 2019-06-13 16:10:00 +02:00
rustfmt.toml initial working version 2019-06-04 20:08:26 +02:00

rga - ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc

rga is a line-oriented search tool that allows you to look for a regex in a multitude of file types. It is a wrapper around the awesome ripgrep that enables it to search in pdf, docx, pptx, movie subtitles (mkv, mp4), sqlite, etc.

Linux build status Crates.io

Future Work

  • I wanted to add a photograph adapter (based on object classification / detection) for fun, based on something . It worked with YOLO, but something more useful and state-of-the art like this proved very hard to integrate.
  • 7z adapter (couldn't find a nice to use Rust library)
  • allow per-adapter configuration options (probably via env (RGA_ADAPTER_CONF=json))

Examples

Say you have a large folder of papers or lecture slides, and you can't remember which one of them mentioned LSTMs. With rga, you can just run this:

rga "LSTM|GRU" collection/
[results]

and it will recursively find a regex in pdfs and pptx slides, including if some of them are zipped up.

You can do mostly the same thing with pdfgrep -r, but it will be much slower and you will miss content in other file types.

title: Searching in 20 pdfs with 100 slides each
subtitle: lower is better
data:
   - pdfgrep: 123s
   - rga (first run): 10.3s
   - rga (subsequent runs): 0.1s

On the first run rga is mostly faster because of multithreading, but on subsequent runs (on the same files but with any query) rga will cache the text extraction because pdf parsing is slow.

Setup

rga should compile with stable Rust. To install it, simply run (your OSes equivalent of)

apt install build-essential pandoc poppler-utils
cargo install ripgrep_all

rga --help # works! :)

Technical details

rga simply runs ripgrep (rg) with some options set, especially --pre=rga-preproc and --pre-glob.

rga-preproc [fname] will match an adapter to the given file based on either it's filename or it's mime type (if --accurate is given).

Some rga adapters run external binaries

Development

To enable debug logging:

export RUST_LOG=debug
export RUST_BACKTRACE=1

Also rember to disable caching with --rga-no-cache or clear the cache in ~/.cache/rga to debug the adapters.

Similar tools