Travdata - getting data out of the PDFs

huin80

Banded Mongoose
Announcing: Travdata. The primary feature of note will extract tables from MgT2 PDF files for your own usage ONLY.

Latest release: 0.6.2.

NOTE: It requires a Java runtime to be installed on your computer (required by the library that it uses to pull tabular data out of PDF files).

Further documentation available both in the program ZIP files, and at the project's README on Github.

To set expectations:
  • Scope: so far I've only configured the program to extract some of the tables from the Core Rulebook Update 2022 PDF. More can be added over time - most of my effort so far as been proving the concept, and setting up a releasable build.
  • Compatibility/stability: it's possible that I may change some details of the CSV data output format, as well as the output directory structure, and the configuration data itself. So any utilities using this data may break until I stabilise things to see what works.
  • Portability: I've only run the release executables on my own computers (Linux and Windows 10), and so far only for 64-bit AMD/Intel architectures. I've also released executables for MacOS, but have no ability to test them myself.

Expected usage​

You own a legal copy of a MgT2 book in PDF format. You want to be able to use the table data for your own direct purposes, but copying the data out by hand is laborious. It would be breaking copyright to extract the data yourself and send it to others with the same need, so we'd be otherwise stuck facing a choice of independently repeating the work of extracting data by hand, or just making do with lots of by-hand usage of the data.

With this tool, you can extract the data for your own purposes, write utilities, spreadsheets, etc.

In this way, the means to extract data for your fair usage are distributed, but not the data itself.

I hope to see utilities grow into this space to make referee and player lives easier by consuming this CSV data, as provided by their direct user. (Auto-Jimmy? VTTs?)

Reporting issues​

Report any problems you encounter or feature requests to https://github.com/huin/travdata/issues.
Please include:
  • information about which operating system you are using the program on,
  • steps to reproduce the problem,
  • what you expected to happen,
  • what actually happened.
Ideally include text output of any error messages from the program, and/or screenshots to demonstrate the problem if text output is not relevant.
 
Last edited:
This is very cool ... and might be just what I am looking for to compliment the Campaign Manager/Tool I've been playing around with. My struggle has been how to get the data from the books into the tool. This might be the ticket. Going to play around with it a bit and may reach out to you with some questions.

Nice work!
 
This is very cool ... and might be just what I am looking for to compliment the Campaign Manager/Tool I've been playing around with. My struggle has been how to get the data from the books into the tool. This might be the ticket. Going to play around with it a bit and may reach out to you with some questions.

Nice work!
Thanks for the reply! I'm glad that I am [attempting to] solve a genuine problem shared by others :)

In the meantime, I'm working on newer releases that cover more tables for extraction (currently pending adding configuration for 61 more tables from the core rules, which more than doubles the current 41). Also pondering including metadata in the output files.
 
Correction, 118 more tables - taking the total to 159 so far from the core rulebook.

Yikes, this has a lot more tables than I expected (edit: which kinda speaks to the point of all this - MgT2 is very data-heavy and crunchy, and really benefits from being able to streamline things for the referee).
 
Last edited:
To sketch out some more of my thoughts for how I expect this to work, user-journey-wise:

End-user​

An end user wants to use some MgT2 utility program that needs MgT2 data.

They install the utility which will get the data directly from the end user's legally owned PDFs by either:
  1. Requests using Travdata to extract the data.
  2. Uses Travdata as a library, and extracts the required data itself. (Speculative)

Utility developer​

Writes a program (or an element of one, like a VTT plugin) that can consume Travdata extracted data. When they distribute the utility to end-users, the program does not contain the data itself (see the "end-user" user journey above).

Probably needs metadata about each table that is either:
  1. Embedded in the table data itself.
  2. Present in Travdata's own configuration (which mirrors the directory structure of the extracted data).
The most useful aspect of the metadata will be tags, such as "type/career-progress" or "rank/law-enforcement" (to use two examples added in v0.3.2) to know what the table is in relation to.

There is some other structure, which are books, with recursive groups of tables, of which the top-level is _loosely_ structured around book chapters. However, this is an informal structure, which is more arbitrary, less suitably structured for programmatic usage, and likely unstable and subject to the whims of filesystem reorganisation.
 
This works for me as I am loading the tables it generates into my own data structures. The database stays local to the users device and will act as a repository for the Traveller data as well as their own. For instance, I have all of the character generation methods built into a rules engine that allows the GM to pick and choose how they want to do character generation. You can even mix & match the methods (die rolling , boon dice, re-rolling, point-buy, package base, etc) What I was lacking was a way to ensure the PDF exists and that's where the Utility side will come in.

Perfect match for what I am looking for.
 
Update on upcoming feature work on this:
  • Core Rulebook 2022:
    • Cover tables in the remaining chapters.
    • Cover tables in previous chapters, possibly including table-like data. This has been blocked on implementing more configurable table transforms.
  • Speculative but likely: include metadata in the output data. Specifically including tags, which would allow consuming programs to know what the table contains, without guessing based on the arbitrary file name and path within the output directory.
    • I'm interested in feedback on this, and in particular how the metadata might be presented. For CSV files, I've got two options in mind:
      • Include metadata at the end of the file after a blank row and a row with a magic value meaning that metadata rows follow.
      • Include metadata as an external YAML file at the top-level of the output subdirectory for the book. This would be similar or a subset of the information in the extraction configuration - primarily representing the tags and directory/file structure.
  • Very speculative, but interested in feedback:
    • Output in other formats, e.g. YAML, JSON, something else?
    • Output data in an archive file (ZIP file, with metadata inside). This might be useful to be able to present to a consuming tool as a single file rather than a directory. Especially something like a Web-based VTT that might take the data as an upload. Probably one ZIP per extracted book?
 
Just released version 0.4.0.
  • 257 tables from Core Rulebook Update 2022 (pretty much everything table-like up to page 187).

Highlights from latest release:​

Extraction configuration:​

  • Multiple new chapters configured for extraction from Core Rules Update 2022.
    • Equipment.
    • Vehicles.
    • Spacecraft operations.
    • Space combat.
    • Bringing the total extracted tables to 257.
  • More tagging of new and existing tables (still to do: surface this tagging information more in the output and GUI).
  • Fixes and improvements for existing tables.
    • This does change the extracted structure for some tables, but they should be more consistent moving forwards.
  • Configuration is now released as a separate download from the program downloads.
    • Configuration remains bundled with binaries, however, if you wish to use a later configuration, you may only need to download the configuration.
    • Note that configurations may use new features from later versions of the program releases, so you may still need to download an updated version of the program you're using if the configuration does not load correctly.

General:​

  • Can now load configuration in ZIP files.
    • No need to unzip a configuration file (although that will still work), the program can read configuration directly from a ZIP file.
    • Less files and directories cluttering your filesystem.

GUI:​

  • More detailed progress.
  • Error handling should surface more things, and more visibly as errors.
 
Last edited:
Released version 0.4.1.

Highlights​

  • GUI can now output to ZIP files.
  • Output directory/ZIP file now contains index.csv, containing metadata about extracted tables, including:
    • Path to CSV within directory/ZIP file.
    • Page number(s) from which the table data was sourced (separated by semicolons ;).
    • Tag(s) identifying/grouping tables.
No extraction configuration changes.
 
Last edited:
I'd like to check in and ask:
  1. Is the program working?
    I don't have telemetry in it, so I don't have data on if it crashes on startup for people, if it fails to function correctly.
  2. What's the most important thing that's missing from it?
    1. Particular output formats?
    2. Particular tables/books that are of most worth to be able to extract data from?
    3. Unsupported platform(s)?
    4. Something else?
 
Released version 0.4.2.

Many thanks to to a user for reporting 2 bugs affecting the Windows build.

Full Changelog: v0.4.1...v0.4.2

Highlights​

  • Fix #14 (attribute error when running CLI on Windows).
  • Fix #15 (reading configuration when running CLI or GUI on Windows).
  • Fix #16 (prevent bugs like #14 and #15 from slipping into releases, where they would have been caught by existing checks that were not run on release).
 
Released version 0.5.0.

This was quite delayed because I spent a lot of time speculatively porting the codebase to Rust, that work is incomplete (more work remains around the GUI and the release/distribution process). I figured that I'd like to make sure that I not hold off on completing configuration of the extraction of the Core Rulebook 2022 - so here it is! Topping out at 441 extracted tables or table-like things.

There's still potentially more data that could be pulled from the Core Rulebook that is table-like. However, I figure that I'll do that on an on-request basis, rather than speculatively add things that might go unused.

Full Changelog: v0.4.1...v0.5.0

Highlights​

  • Completed extraction for the remaining tables in Core Rulebook 2022, specifically by completing the final chapters:
    • Common Spacecraft
    • Psionics
    • Trade
    • World Creation
  • Minor addition to extraction code to support another table transformation.
 
Last edited:
@Anstett thanks, this is pretty cool, could be useful, though - i see @huin80 is porting this to Rust which is more in my skillset (we see Python sadly as for school children mostly, so I've skipped py3) but Rust is more my bag. Great work though, really good work.

I'm building a new tool to build the content for FG, but also for agnostic use for other VTTs that we may move into. It's proving slow as FG is XML based and thus chunky. I shall read the python code though and see how things work as my own extraction tool (in NodeJS) is doing almost everything but not everything as some tables when converted to text are mis-aligned.

Cheers,
MBM
 
@Anstett thanks, this is pretty cool, could be useful, though - i see @huin80 is porting this to Rust which is more in my skillset (we see Python sadly as for school children mostly, so I've skipped py3) but Rust is more my bag. Great work though, really good work.

I'm building a new tool to build the content for FG, but also for agnostic use for other VTTs that we may move into. It's proving slow as FG is XML based and thus chunky. I shall read the python code though and see how things work as my own extraction tool (in NodeJS) is doing almost everything but not everything as some tables when converted to text are mis-aligned.

Cheers,
MBM
The Rust port (on the `rust-experiment` branch) is approximately complete in terms of CLI-based extraction, if that helps. It's less complete in terms of the GUI and release for distribution.
 
I don't suppose the list of skills lends itself to being extracted? Would be nice to just run a macro in Foundry to pull in names and descriptions of the MgT2e-specifc skill list instead of having to start from the Cepheus ones.
 
Back
Top