ripgrep-all/README.md

# rga - ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc

rga is a line-oriented search tool that allows you to look for a regex in a multitude of file types. It is a wrapper around the awesome [ripgrep] that enables it to search in pdf, docx, pptx, movie subtitles (mkv, mp4), sqlite, etc.

[![Linux build status](https://api.travis-ci.org/phiresky/ripgrep_all.svg)](https://travis-ci.org/phiresky/ripgrep_all)
[![Crates.io](https://img.shields.io/crates/v/ripgrep_all.svg)](https://crates.io/crates/ripgrep_all)

## Future Work

- I wanted to add a photograph adapter (based on object classification / detection) for fun, based on something . It worked with [YOLO](https://pjreddie.com/darknet/yolo/), but something more useful and state-of-the art [like this](https://github.com/aimagelab/show-control-and-tell) proved very hard to integrate.
- 7z adapter (couldn't find a nice to use Rust library)
- allow per-adapter configuration options (probably via env (RGA_ADAPTER_CONF=json))
- there's some more (mostly technical) todos in the code

## Examples

Say you have a large folder of papers or lecture slides, and you can't remember which one of them mentioned `LSTM`s. With rga, you can just run this:

```
rga "LSTM|GRU" collection/
[results]
```

and it will recursively find a regex in pdfs and pptx slides, including if some of them are zipped up.

You can do mostly the same thing with [`pdfgrep -r`][pdfgrep], but it will be much slower and you will miss content in other file types.

```barchart
title: Searching in 20 pdfs with 100 slides each
subtitle: lower is better
data:
   - pdfgrep: 123s
   - rga (first run): 10.3s
   - rga (subsequent runs): 0.1s
```

On the first run rga is mostly faster because of multithreading, but on subsequent runs (on the same files but with any query) rga will cache the text extraction because pdf parsing is slow.

## Setup

rga should compile with stable Rust. To install it, simply run (your OSes equivalent of)

```bash
apt install build-essential pandoc poppler-utils ffmpeg
cargo install ripgrep_all

rga --help # works! :)
```

You don't necessarily need to install any dependencies, but then you will see an error when trying to read from the corresponding file type (e.g. poppler-utils for pdf).

## Technical details

`rga` simply runs ripgrep (`rg`) with some options set, especially `--pre=rga-preproc` and `--pre-glob`.

`rga-preproc [fname]` will match an "adapter" to the given file based on either it's filename or it's mime type (if `--accurate` is given). You can see all adapters currently included in [src/adapters](src/adapters).

Some rga adapters run external binaries to do the actual work (such as pandoc or ffmpeg), usually by writing to stdin and reading from stdout.

Most adapters read the files from a [Read](https://doc.rust-lang.org/std/io/trait.Read.html), so they work completely on streamed data (that can come from anywhere including within nested archives). rga-preproc writes

## Development

To enable debug logging:

```bash
export RUST_LOG=debug
export RUST_BACKTRACE=1
```

Also rember to disable caching with `--rga-no-cache` or clear the cache in `~/.cache/rga` to debug the adapters.

# Similar tools

- [pdfgrep][pdfgrep]
- [this gist](https://gist.github.com/phiresky/5025490526ba70663ab3b8af6c40a8db) has my proof of concept version of a caching extractor to use ripgrep as a replacement for pdfgrep.
- [this gist](https://gist.github.com/ColonolBuendia/314826e37ec35c616d70506c38dc65aa) is a more extensive preprocessing script by [@ColonolBuendia](https://github.com/ColonolBuendia)

[pdfgrep]: https://pdfgrep.org/
[ripgrep]: https://github.com/BurntSushi/ripgrep
readme 2019-06-12 19:29:56 +00:00			`# rga - ripgrep, but also search in PDFs, E-Books, Office documents, zip, tar.gz, etc`

update readme 2019-06-13 14:06:43 +00:00			`rga is a line-oriented search tool that allows you to look for a regex in a multitude of file types. It is a wrapper around the awesome [ripgrep] that enables it to search in pdf, docx, pptx, movie subtitles (mkv, mp4), sqlite, etc.`
fixes 2019-06-12 20:32:20 +00:00
readme 2019-06-12 21:06:50 +00:00			`[![Linux build status](https://api.travis-ci.org/phiresky/ripgrep_all.svg)](https://travis-ci.org/phiresky/ripgrep_all)`
badges 2019-06-12 20:06:21 +00:00			`[![Crates.io](https://img.shields.io/crates/v/ripgrep_all.svg)](https://crates.io/crates/ripgrep_all)`
travis.yml 2019-06-12 19:55:42 +00:00
update readme 2019-06-13 14:06:43 +00:00			`## Future Work`
finally fix tar 2019-06-06 21:19:59 +00:00
update readme 2019-06-13 14:06:43 +00:00			`- I wanted to add a photograph adapter (based on object classification / detection) for fun, based on something . It worked with [YOLO](https://pjreddie.com/darknet/yolo/), but something more useful and state-of-the art [like this](https://github.com/aimagelab/show-control-and-tell) proved very hard to integrate.`
			`- 7z adapter (couldn't find a nice to use Rust library)`
			`- allow per-adapter configuration options (probably via env (RGA_ADAPTER_CONF=json))`
more documentation 2019-06-13 14:26:03 +00:00			`- there's some more (mostly technical) todos in the code`
finally fix tar 2019-06-06 21:19:59 +00:00
update readme 2019-06-13 14:06:43 +00:00			`## Examples`
tar adapter (broken compression) 2019-06-06 15:59:15 +00:00
update readme 2019-06-13 14:06:43 +00:00			Say you have a large folder of papers or lecture slides, and you can't remember which one of them mentioned `LSTM`s. With rga, you can just run this:
better arg parsing and passing 2019-06-07 19:46:03 +00:00
update readme 2019-06-13 14:06:43 +00:00			```
			`rga "LSTM\|GRU" collection/`
			`[results]`
			```

			`and it will recursively find a regex in pdfs and pptx slides, including if some of them are zipped up.`

			You can do mostly the same thing with [`pdfgrep -r`][pdfgrep], but it will be much slower and you will miss content in other file types.

			```barchart
			`title: Searching in 20 pdfs with 100 slides each`
			`subtitle: lower is better`
			`data:`
			`- pdfgrep: 123s`
			`- rga (first run): 10.3s`
			`- rga (subsequent runs): 0.1s`
			```
readme 2019-06-12 19:37:15 +00:00
update readme 2019-06-13 14:06:43 +00:00			`On the first run rga is mostly faster because of multithreading, but on subsequent runs (on the same files but with any query) rga will cache the text extraction because pdf parsing is slow.`

			`## Setup`

			`rga should compile with stable Rust. To install it, simply run (your OSes equivalent of)`
travis.yml 2019-06-12 19:55:42 +00:00
			```bash
more documentation 2019-06-13 14:26:03 +00:00			`apt install build-essential pandoc poppler-utils ffmpeg`
travis.yml 2019-06-12 19:55:42 +00:00			`cargo install ripgrep_all`

(cargo-release) start next development iteration 0.7.0 2019-06-13 14:10:00 +00:00			`rga --help # works! :)`
travis.yml 2019-06-12 19:55:42 +00:00			```

more documentation 2019-06-13 14:26:03 +00:00			`You don't necessarily need to install any dependencies, but then you will see an error when trying to read from the corresponding file type (e.g. poppler-utils for pdf).`

update readme 2019-06-13 14:06:43 +00:00			`## Technical details`

			`rga` simply runs ripgrep (`rg`) with some options set, especially `--pre=rga-preproc` and `--pre-glob`.

more documentation 2019-06-13 14:26:03 +00:00			`rga-preproc [fname]` will match an "adapter" to the given file based on either it's filename or it's mime type (if `--accurate` is given). You can see all adapters currently included in [src/adapters](src/adapters).

			`Some rga adapters run external binaries to do the actual work (such as pandoc or ffmpeg), usually by writing to stdin and reading from stdout.`
(cargo-release) start next development iteration 0.7.0 2019-06-13 14:10:00 +00:00
more documentation 2019-06-13 14:26:03 +00:00			`Most adapters read the files from a [Read](https://doc.rust-lang.org/std/io/trait.Read.html), so they work completely on streamed data (that can come from anywhere including within nested archives). rga-preproc writes`
readme 2019-06-12 19:37:15 +00:00
update readme 2019-06-13 14:06:43 +00:00			`## Development`
better arg parsing and passing 2019-06-07 19:46:03 +00:00
			`To enable debug logging:`

more options, less constants 2019-06-07 21:04:18 +00:00			```bash
add tesseract adapter 2019-06-12 15:23:30 +00:00			`export RUST_LOG=debug`
better arg parsing and passing 2019-06-07 19:46:03 +00:00			`export RUST_BACKTRACE=1`
more options, less constants 2019-06-07 21:04:18 +00:00			```
rename crate 2019-06-11 18:35:20 +00:00
			Also rember to disable caching with `--rga-no-cache` or clear the cache in `~/.cache/rga` to debug the adapters.
fixes 2019-06-12 20:32:20 +00:00
			`# Similar tools`

update readme 2019-06-13 14:06:43 +00:00			`- [pdfgrep][pdfgrep]`
fixes 2019-06-12 20:32:20 +00:00			`- [this gist](https://gist.github.com/phiresky/5025490526ba70663ab3b8af6c40a8db) has my proof of concept version of a caching extractor to use ripgrep as a replacement for pdfgrep.`
			`- [this gist](https://gist.github.com/ColonolBuendia/314826e37ec35c616d70506c38dc65aa) is a more extensive preprocessing script by [@ColonolBuendia](https://github.com/ColonolBuendia)`
update readme 2019-06-13 14:06:43 +00:00
			`[pdfgrep]: https://pdfgrep.org/`
			`[ripgrep]: https://github.com/BurntSushi/ripgrep`