Travdata - getting data out of the PDFs

I don't suppose the list of skills lends itself to being extracted? Would be nice to just run a macro in Foundry to pull in names and descriptions of the MgT2e-specifc skill list instead of having to start from the Cepheus ones.
I just took a quick look.

The short version: yes, this looks feasible. I've created more ways to transform data as I went through the core rulebook, and started extracting things like bulleted lists or regular paragraphs. Although it can get a bit fiddlier (the current transformation system is functional, but at times restrictive).

In terms of extracting the data I can see it being most feasible to extract two types of table:
  1. A table containing all top-level skills, with columns:
    • skill name,
    • skill description.
  2. One table per top-level skill that has specialities, with columns:
    • speciality name,
    • description and examples of a skill check combined.
The "description" columns would, for the time being, lose newlines that might be regarded as significant parts of the typesetting. I might need to revisit how table transformations are performed more generally.

If the above sounds useful, I can try to take a crack at that to see how well it goes, free time permitting.
 
I just took a quick look.

The short version: yes, this looks feasible. I've created more ways to transform data as I went through the core rulebook, and started extracting things like bulleted lists or regular paragraphs. Although it can get a bit fiddlier (the current transformation system is functional, but at times restrictive).

In terms of extracting the data I can see it being most feasible to extract two types of table:
  1. A table containing all top-level skills, with columns:
    • skill name,
    • skill description.
  2. One table per top-level skill that has specialities, with columns:
    • speciality name,
    • description and examples of a skill check combined.
The "description" columns would, for the time being, lose newlines that might be regarded as significant parts of the typesetting. I might need to revisit how table transformations are performed more generally.

If the above sounds useful, I can try to take a crack at that to see how well it goes, free time permitting.
Sorry for the silence on this, it's been a busy week. I've created https://github.com/huin/travdata/issues/18 to track this.
 
Worked up a proof-of-concept macro to map equipment from this into items in Foundry. A finished product will require almost per-table logic, but nothing too complex. For instance, in a few cases like the armor table I tested on, it needs to know how to break up where the same item is listed across multiple tech levels, as well as knowing how to handle "(+x vs lasers only)" entries on the same line as normal protection... but it's do-able.

1720211748543.png
 
Last edited:
Worked up a proof-of-concept macro to map equipment from this into items in Foundry. A finished product will require almost per-table logic, but nothing too complex. For instance, in a few cases like the armor table I tested on, it needs to know how to break up where the same item is listed across multiple tech levels, as well as knowing how to handle "(+x vs lasers only)" entries on the same line as normal protection... but it's do-able.

View attachment 1984
That's really cool, and exactly the sort of thing I wanted to see happen :)

(I use Foundry myself, but want to be VTT-agnostic with Travdata itself)

As for my work on Travdata, I'll be a little quiet on it for the next week or two, as I've got some backlogged life admin to work through (which typically means that I hack on bits of it during my work commutes only). I've got the skills extraction at the top of my mind, but as it turns out, I'm going to need to expand some capabilities of the program itself to make that work well.
 
Worked up a proof-of-concept macro to map equipment from this into items in Foundry. A finished product will require almost per-table logic, but nothing too complex. For instance, in a few cases like the armor table I tested on, it needs to know how to break up where the same item is listed across multiple tech levels, as well as knowing how to handle "(+x vs lasers only)" entries on the same line as normal protection... but it's do-able.

View attachment 1984
As for the multiple-things-on-one-line bit, that's a bit of a pain - I think that's actually a mistake on my end. In hindsight I would probably have arranged for the armour table to have 3 separate rows for the Vacc suit, however this is somewhat complicated by how the "Hostile Environment Vacc Suit" name is split over two lines, but has separate line items for the 4 TLs that it has - it might need more consideration.

The work that I'm doing to make things work for the skills extraction will likely be a bit of a sea change in how things can be extracted, and hopefully allow extraction configurations to be released without needing to upgrade the Travdata program for every new configuration version.

Would it be useful to retain some newlines within the extracted CSV table cells? It's something that the CSV format supports, but some parsers might not support it well. I've also pondered if JSON would be a more appropriate output format (maybe even allowing for more structured output), but for now have stuck with the flat CSV output.
 
Would it be useful to retain some newlines within the extracted CSV table cells? It's something that the CSV format supports, but some parsers might not support it well. I've also pondered if JSON would be a more appropriate output format (maybe even allowing for more structured output), but for now have stuck with the flat CSV output.
If with the newlines you're referring to using them in place of the spaces where there are multiple inline values, it doesn't really matter for my uses as long as it's something I can do a clean string.split() on.

JSON would be great for folks working with JavaScript; probably not so much for those who want to load this info into a spreadsheet.
 
Will download the latest update and work the changes into my Traveller tool. This is exactly what I need to pull data from the PDFs to grant access to application functionality.

Anything in JSON is preferred as it is easily incorporated into the tool.
 
Agreed on supporting those wanting to load data into a spreadsheet, as that's an aim, not just supporting more general programmatic access. I'd like to support both. I vaguely have in mind that a "default" JSON format could have the structure `string[][]` (in JS typing) (Python: `list[list[str]]`, Rust: `Vec<Vec<String>>`) - which exactly mirrors the CSV structure, just with a different file format. But there are occasions where JSON might lend itself to something more deeply structured. However, for now this is just speculative.
 
CSV works and likely is the best bang for the buck. JSON would be nice, but if I can get it into a CSV, I can put it in any other format.

Thanks for the continued work on this!
 
Okay, some good news. Managed to get some headway on life-admin, and have got some code on an experimental branch that has some success in extracting skills and specialties into a single table, with headings:
  • Skill (name of the skill)
  • Speciality (name of the speciality)
  • Description (description of the skill or speciality)
I've omitted examples of a skill or speciality, as that was more fiddly to distinguish reliably between an example for the skill versus the last speciality.

The "Skill" and "Speciality" columns are mutually exclusive (one or other is populated, but not both on the same row). I've coded this in such a way that a JSON version could also be exported in future with a structure:

{"name": "Skill Name", "desc": "Skill description...", "specs": [{"name": "First Speciality Name", "desc": "First speciality description...", ...]}

This has required me to use some new technology (I embedded JavaScript into the extraction configuration), and it feels potentially fragile if the input text changes in structure (although this is always a risk with this project). However, it seems to be working pretty well for now. Will need a few more days to polish things up and eyeball the data to check if it looks sane.

If it looks good, then this will likely be a v0.6.0 release.
 
And some annoying news... Ran into a crash with the new build under the GUI. That'll take a bit of investigation to find possible cause and fix.
 
Released version 0.6.0.

Full Changelog: https://github.com/huin/travdata/compare/v0.5.0...v0.6.0

Highlights​

  • Significantly breaks compatibility with old configurations.
  • Support ECMAScript (JavaScript) based table transforms, which should have multiple benefits in flexibility moving forwards (does increase the program download size by approximately 10MiB).
  • New extractions from core rulebook 2022:
    • Support extraction of skills and specialities.
    • Support extraction of skill packages.
  • Caching of previous extraction data from Tabula (which is one of the slowest interactions currently). This probably won't affect most users unless they are repeatedly performing extractions to experiment with adjusting the extraction configuration.
 
Back
Top