Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ pip install semantra

## Usage

Semantra operates on collections of documents — text or PDF files — stored on your local computer.
Semantra operates on collections of documents — text or PDF files, or directories containing documents — stored on your local computer.

At its simplest, you can run Semantra over a single document by running:

Expand All @@ -67,6 +67,12 @@ You can run Semantra over multiple documents, too:
semantra report.pdf book.txt
```

Directories are expanded recursively, so you can also run Semantra over a folder of documents:

```sh
semantra notes/
```

Semantra will take some time to process the input documents. This is a one-time operation per document (subsequent runs over the same document collection will be near instantaneous).

Once processing is complete, Semantra will launch a local webserver, by default at [localhost:8080](http://localhost:8080). On this web page, you can interactively query the passed in documents semantically.
Expand Down Expand Up @@ -116,7 +122,7 @@ Another difference is that Semantra will not necessarily find exact text matches
## Command-line reference

```sh
semantra [OPTIONS] [FILENAME(S)]...
semantra [OPTIONS] [FILENAME(S) OR DIRECTORY(S)]...
```

## Options
Expand Down
19 changes: 19 additions & 0 deletions src/semantra/semantra.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,21 @@ def get_text_content(md5, filename, semantra_dir, force, silent, encoding):
return Content(rawtext, filename)


def expand_input_filenames(filenames):
expanded_filenames = []
for filename in filenames:
if os.path.isdir(filename):
for root, dirnames, child_filenames in os.walk(filename):
dirnames.sort()
for child_filename in sorted(child_filenames):
child_path = os.path.join(root, child_filename)
if os.path.isfile(child_path):
expanded_filenames.append(child_path)
else:
expanded_filenames.append(filename)
return tuple(expanded_filenames)


TRANSFORMER_POOL_DEFAULT = 15000


Expand Down Expand Up @@ -580,6 +595,10 @@ def main(
if filename is None or len(filename) == 0:
raise click.UsageError("Must provide a filename to process/query")

filename = expand_input_filenames(filename)
if len(filename) == 0:
raise click.UsageError("No files found to process/query")

processed_windows = list(process_windows(windows))

if transformer_model is not None:
Expand Down