Travdata - getting data out of the PDFs

huin80

Mongoose
Announcing: Travdata. The primary feature of note will extract tables from MgT2 PDF files for your own usage ONLY.

Latest release: 0.4.0.

NOTE: It requires a Java runtime to be installed on your computer (required by the library that it uses to pull tabular data out of PDF files).

Further documentation available both in the program ZIP files, and at the project's README on Github.

To set expectations:
  • Scope: so far I've only configured the program to extract some of the tables from the Core Rulebook Update 2022 PDF. More can be added over time - most of my effort so far as been proving the concept, and setting up a releasable build.
  • Compatibility/stability: it's possible that I may change some details of the CSV data output format, as well as the output directory structure, and the configuration data itself. So any utilities using this data may break until I stabilise things to see what works.
  • Portability: I've only run the release executables on my own computers (Linux and Windows 10), and so far only for 64-bit AMD/Intel architectures. I've also released executables for MacOS, but have no ability to test them myself.

Expected usage​

You own a legal copy of a MgT2 book in PDF format. You want to be able to use the table data for your own direct purposes, but copying the data out by hand is laborious. It would be breaking copyright to extract the data yourself and send it to others with the same need, so we'd be otherwise stuck facing a choice of independently repeating the work of extracting data by hand, or just making do with lots of by-hand usage of the data.

With this tool, you can extract the data for your own purposes, write utilities, spreadsheets, etc.

In this way, the means to extract data for your fair usage are distributed, but not the data itself.

I hope to see utilities grow into this space to make referee and player lives easier by consuming this CSV data, as provided by their direct user. (Auto-Jimmy? VTTs?)

Reporting issues​

Report any problems you encounter or feature requests to https://github.com/huin/travdata/issues.
Please include:
  • information about which operating system you are using the program on,
  • steps to reproduce the problem,
  • what you expected to happen,
  • what actually happened.
Ideally include text output of any error messages from the program, and/or screenshots to demonstrate the problem if text output is not relevant.
 
Last edited:
This is very cool ... and might be just what I am looking for to compliment the Campaign Manager/Tool I've been playing around with. My struggle has been how to get the data from the books into the tool. This might be the ticket. Going to play around with it a bit and may reach out to you with some questions.

Nice work!
 
This is very cool ... and might be just what I am looking for to compliment the Campaign Manager/Tool I've been playing around with. My struggle has been how to get the data from the books into the tool. This might be the ticket. Going to play around with it a bit and may reach out to you with some questions.

Nice work!
Thanks for the reply! I'm glad that I am [attempting to] solve a genuine problem shared by others :)

In the meantime, I'm working on newer releases that cover more tables for extraction (currently pending adding configuration for 61 more tables from the core rules, which more than doubles the current 41). Also pondering including metadata in the output files.
 
Correction, 118 more tables - taking the total to 159 so far from the core rulebook.

Yikes, this has a lot more tables than I expected (edit: which kinda speaks to the point of all this - MgT2 is very data-heavy and crunchy, and really benefits from being able to streamline things for the referee).
 
Last edited:
To sketch out some more of my thoughts for how I expect this to work, user-journey-wise:

End-user​

An end user wants to use some MgT2 utility program that needs MgT2 data.

They install the utility which will get the data directly from the end user's legally owned PDFs by either:
  1. Requests using Travdata to extract the data.
  2. Uses Travdata as a library, and extracts the required data itself. (Speculative)

Utility developer​

Writes a program (or an element of one, like a VTT plugin) that can consume Travdata extracted data. When they distribute the utility to end-users, the program does not contain the data itself (see the "end-user" user journey above).

Probably needs metadata about each table that is either:
  1. Embedded in the table data itself.
  2. Present in Travdata's own configuration (which mirrors the directory structure of the extracted data).
The most useful aspect of the metadata will be tags, such as "type/career-progress" or "rank/law-enforcement" (to use two examples added in v0.3.2) to know what the table is in relation to.

There is some other structure, which are books, with recursive groups of tables, of which the top-level is _loosely_ structured around book chapters. However, this is an informal structure, which is more arbitrary, less suitably structured for programmatic usage, and likely unstable and subject to the whims of filesystem reorganisation.
 
This works for me as I am loading the tables it generates into my own data structures. The database stays local to the users device and will act as a repository for the Traveller data as well as their own. For instance, I have all of the character generation methods built into a rules engine that allows the GM to pick and choose how they want to do character generation. You can even mix & match the methods (die rolling , boon dice, re-rolling, point-buy, package base, etc) What I was lacking was a way to ensure the PDF exists and that's where the Utility side will come in.

Perfect match for what I am looking for.
 
Update on upcoming feature work on this:
  • Core Rulebook 2022:
    • Cover tables in the remaining chapters.
    • Cover tables in previous chapters, possibly including table-like data. This has been blocked on implementing more configurable table transforms.
  • Speculative but likely: include metadata in the output data. Specifically including tags, which would allow consuming programs to know what the table contains, without guessing based on the arbitrary file name and path within the output directory.
    • I'm interested in feedback on this, and in particular how the metadata might be presented. For CSV files, I've got two options in mind:
      • Include metadata at the end of the file after a blank row and a row with a magic value meaning that metadata rows follow.
      • Include metadata as an external YAML file at the top-level of the output subdirectory for the book. This would be similar or a subset of the information in the extraction configuration - primarily representing the tags and directory/file structure.
  • Very speculative, but interested in feedback:
    • Output in other formats, e.g. YAML, JSON, something else?
    • Output data in an archive file (ZIP file, with metadata inside). This might be useful to be able to present to a consuming tool as a single file rather than a directory. Especially something like a Web-based VTT that might take the data as an upload. Probably one ZIP per extracted book?
 
Just released version 0.4.0.
  • 257 tables from Core Rulebook Update 2022 (pretty much everything table-like up to page 187).

Highlights from latest release:​

Extraction configuration:​

  • Multiple new chapters configured for extraction from Core Rules Update 2022.
    • Equipment.
    • Vehicles.
    • Spacecraft operations.
    • Space combat.
    • Bringing the total extracted tables to 257.
  • More tagging of new and existing tables (still to do: surface this tagging information more in the output and GUI).
  • Fixes and improvements for existing tables.
    • This does change the extracted structure for some tables, but they should be more consistent moving forwards.
  • Configuration is now released as a separate download from the program downloads.
    • Configuration remains bundled with binaries, however, if you wish to use a later configuration, you may only need to download the configuration.
    • Note that configurations may use new features from later versions of the program releases, so you may still need to download an updated version of the program you're using if the configuration does not load correctly.

General:​

  • Can now load configuration in ZIP files.
    • No need to unzip a configuration file (although that will still work), the program can read configuration directly from a ZIP file.
    • Less files and directories cluttering your filesystem.

GUI:​

  • More detailed progress.
  • Error handling should surface more things, and more visibly as errors.
 
Last edited:
Back
Top