Travdata - getting data out of the PDFs

huin80 · Jun 25, 2024

drl2 said:
I don't suppose the list of skills lends itself to being extracted? Would be nice to just run a macro in Foundry to pull in names and descriptions of the MgT2e-specifc skill list instead of having to start from the Cepheus ones.

I just took a quick look.

The short version: yes, this looks feasible. I've created more ways to transform data as I went through the core rulebook, and started extracting things like bulleted lists or regular paragraphs. Although it can get a bit fiddlier (the current transformation system is functional, but at times restrictive).

In terms of extracting the data I can see it being most feasible to extract two types of table:

A table containing all top-level skills, with columns:
- skill name,
- skill description.
One table per top-level skill that has specialities, with columns:
- speciality name,
- description and examples of a skill check combined.

The "description" columns would, for the time being, lose newlines that might be regarded as significant parts of the typesetting. I might need to revisit how table transformations are performed more generally.

If the above sounds useful, I can try to take a crack at that to see how well it goes, free time permitting.

huin80 · Jun 28, 2024

huin80 said:
I just took a quick look.

The short version: yes, this looks feasible. I've created more ways to transform data as I went through the core rulebook, and started extracting things like bulleted lists or regular paragraphs. Although it can get a bit fiddlier (the current transformation system is functional, but at times restrictive).

In terms of extracting the data I can see it being most feasible to extract two types of table:

A table containing all top-level skills, with columns:

skill name,

skill description.

One table per top-level skill that has specialities, with columns:

speciality name,

description and examples of a skill check combined.

The "description" columns would, for the time being, lose newlines that might be regarded as significant parts of the typesetting. I might need to revisit how table transformations are performed more generally.

If the above sounds useful, I can try to take a crack at that to see how well it goes, free time permitting.

Sorry for the silence on this, it's been a busy week. I've created https://github.com/huin/travdata/issues/18 to track this.

drl2 · Jul 5, 2024

Worked up a proof-of-concept macro to map equipment from this into items in Foundry. A finished product will require almost per-table logic, but nothing too complex. For instance, in a few cases like the armor table I tested on, it needs to know how to break up where the same item is listed across multiple tech levels, as well as knowing how to handle "(+x vs lasers only)" entries on the same line as normal protection... but it's do-able.

huin80 · Jul 5, 2024

drl2 said:
Worked up a proof-of-concept macro to map equipment from this into items in Foundry. A finished product will require almost per-table logic, but nothing too complex. For instance, in a few cases like the armor table I tested on, it needs to know how to break up where the same item is listed across multiple tech levels, as well as knowing how to handle "(+x vs lasers only)" entries on the same line as normal protection... but it's do-able.

View attachment 1984

That's really cool, and exactly the sort of thing I wanted to see happen

(I use Foundry myself, but want to be VTT-agnostic with Travdata itself)

As for my work on Travdata, I'll be a little quiet on it for the next week or two, as I've got some backlogged life admin to work through (which typically means that I hack on bits of it during my work commutes only). I've got the skills extraction at the top of my mind, but as it turns out, I'm going to need to expand some capabilities of the program itself to make that work well.

huin80 · Jul 5, 2024

drl2 said:
Worked up a proof-of-concept macro to map equipment from this into items in Foundry. A finished product will require almost per-table logic, but nothing too complex. For instance, in a few cases like the armor table I tested on, it needs to know how to break up where the same item is listed across multiple tech levels, as well as knowing how to handle "(+x vs lasers only)" entries on the same line as normal protection... but it's do-able.

View attachment 1984

As for the multiple-things-on-one-line bit, that's a bit of a pain - I think that's actually a mistake on my end. In hindsight I would probably have arranged for the armour table to have 3 separate rows for the Vacc suit, however this is somewhat complicated by how the "Hostile Environment Vacc Suit" name is split over two lines, but has separate line items for the 4 TLs that it has - it might need more consideration.

The work that I'm doing to make things work for the skills extraction will likely be a bit of a sea change in how things can be extracted, and hopefully allow extraction configurations to be released without needing to upgrade the Travdata program for every new configuration version.

Would it be useful to retain some newlines within the extracted CSV table cells? It's something that the CSV format supports, but some parsers might not support it well. I've also pondered if JSON would be a more appropriate output format (maybe even allowing for more structured output), but for now have stuck with the flat CSV output.

drl2 · Jul 6, 2024

huin80 said:
Would it be useful to retain some newlines within the extracted CSV table cells? It's something that the CSV format supports, but some parsers might not support it well. I've also pondered if JSON would be a more appropriate output format (maybe even allowing for more structured output), but for now have stuck with the flat CSV output.

If with the newlines you're referring to using them in place of the spaces where there are multiple inline values, it doesn't really matter for my uses as long as it's something I can do a clean string.split() on.

JSON would be great for folks working with JavaScript; probably not so much for those who want to load this info into a spreadsheet.

Talon Greyfeather · Jul 6, 2024

Will download the latest update and work the changes into my Traveller tool. This is exactly what I need to pull data from the PDFs to grant access to application functionality.

Anything in JSON is preferred as it is easily incorporated into the tool.

huin80 · Jul 6, 2024

Agreed on supporting those wanting to load data into a spreadsheet, as that's an aim, not just supporting more general programmatic access. I'd like to support both. I vaguely have in mind that a "default" JSON format could have the structure `string[][]` (in JS typing) (Python: `list[list[str]]`, Rust: `Vec<Vec<String>>`) - which exactly mirrors the CSV structure, just with a different file format. But there are occasions where JSON might lend itself to something more deeply structured. However, for now this is just speculative.

Talon Greyfeather · Jul 6, 2024

CSV works and likely is the best bang for the buck. JSON would be nice, but if I can get it into a CSV, I can put it in any other format.

Thanks for the continued work on this!

huin80 · Jul 8, 2024

Okay, some good news. Managed to get some headway on life-admin, and have got some code on an experimental branch that has some success in extracting skills and specialties into a single table, with headings:

Skill (name of the skill)
Speciality (name of the speciality)
Description (description of the skill or speciality)

I've omitted examples of a skill or speciality, as that was more fiddly to distinguish reliably between an example for the skill versus the last speciality.

The "Skill" and "Speciality" columns are mutually exclusive (one or other is populated, but not both on the same row). I've coded this in such a way that a JSON version could also be exported in future with a structure:

{"name": "Skill Name", "desc": "Skill description...", "specs": [{"name": "First Speciality Name", "desc": "First speciality description...", ...]}

This has required me to use some new technology (I embedded JavaScript into the extraction configuration), and it feels potentially fragile if the input text changes in structure (although this is always a risk with this project). However, it seems to be working pretty well for now. Will need a few more days to polish things up and eyeball the data to check if it looks sane.

If it looks good, then this will likely be a v0.6.0 release.

huin80 · Jul 9, 2024

And some annoying news... Ran into a crash with the new build under the GUI. That'll take a bit of investigation to find possible cause and fix.

huin80 · Jul 14, 2024

Released version 0.6.0.

Full Changelog: https://github.com/huin/travdata/compare/v0.5.0...v0.6.0

Highlights

Significantly breaks compatibility with old configurations.
Support ECMAScript (JavaScript) based table transforms, which should have multiple benefits in flexibility moving forwards (does increase the program download size by approximately 10MiB).
New extractions from core rulebook 2022:
- Support extraction of skills and specialities.
- Support extraction of skill packages.
Caching of previous extraction data from Tabula (which is one of the slowest interactions currently). This probably won't affect most users unless they are repeatedly performing extractions to experiment with adjusting the extraction configuration.

huin80 · Jul 16, 2024

Released version 0.6.1.

Full Changelog: v0.6.0...v0.6.1
Recommended update if you are using any previous version.

Highlights

Security fix: update dependency setuptools to 70.3.0, resolving CVE-2024-6345 for this project.
Update many other dependencies.

huin80 · Aug 11, 2024

I've held a release back for a while in case something else needed to fold into it, but figured that I'd release it. It's a pretty minor change, but does fix how armour equipment is extracted. Otherwise I've been slowly working on the Rust port.

If anyone would benefit from addtional or tweaked configuration, let me know. I can set up configuration to extract from any of the following books, for which I own a copy of the PDF:

Central Supply Catalogue Update 2023
Field Catalogue
High Guard Update 2022
Referee's Briefing 1-3
Sector Construction Guide
Spinward Marches 1: The Bowman Arm
Spinward Marches 2: The Lunion Shield Worlds
Starship Operator's Manual
The Marches Adventures 1-5
Traveller Companion
Traveller Companion Update 2024
Vehicle Handbook
World Builder's Handbook
(some others as well, but the above are most likely of interest)

If any table would be of use from that, please let me know which book and which table (including page number).

Released version 0.6.2.

Full Changelog: v0.6.1...v0.6.2

This only changes the configuration, so anyone with a copy of 0.6.x can download the configuration only (recommended 0.6.1 or later).

Highlights:

fix: equipment/armour table

drl2 · Dec 14, 2024

I haven't tried it but... how badly will the new core rules update PDF break this?

huin80 · Dec 14, 2024

drl2 said:
I haven't tried it but... how badly will the new core rules update PDF break this?

I tested this a couple of days ago. It's broken multiple tables. I haven't done more than a quick check, so it's mostly a case of trying it and seeing if the tables you need are working or not.

Or use an older version if you have one available.

I'm mostly focused on the Rust port at the moment. If there are specific requests to fix specific tables, that's feasible. But a large scale fix might have to wait for me to improve my tooling.

drl2 · Dec 14, 2024

huin80 said:
I tested this a couple of days ago. It's broken multiple tables. I haven't done more than a quick check, so it's mostly a case of trying it and seeing if the tables you need are working or not.

Or use an older version if you have one available.

I'm mostly focused on the Rust port at the moment. If there are specific requests to fix specific tables, that's feasible. But a large scale fix might have to wait for me to improve my tooling.

I've kept the previous PDF version around so I'm good for the moment. Hoping to get some time over the next few weeks to make some progress on my macros, but I can use what I already have for that.

huin80 · Dec 15, 2024

drl2 said:
I've kept the previous PDF version around so I'm good for the moment. Hoping to get some time over the next few weeks to make some progress on my macros, but I can use what I already have for that.

Out of curiousity, what's the coverage of the macros, and what do they do? Is this data importing, or macros for gameplay in Foundry? Also, are they something you're planning to release to the community?

huin80 · Dec 15, 2024

To update on the status of this project:

I've mostly been focused on rewriting in Rust.
- This has progressed pretty well - now at the point where it can extract nearly identically to the Python version (the one tiny difference is arguably better in the Rust version).
- The code likely needs a tidy-up, but it's functioning, at least on my Linux development environment.
I need to work out how I'm going to distribute the Rust binary.
Largely spurred on by the new core rulebook 2022 release, I need to improve my tooling that allows me to configure table extraction from the GUI itself.
- It's currently a tedious process that involves copying data from one 3rd party tool into JSON files, and then editing a bunch of files, and re-testing the extraction.
- As it stands, the current process is a pain for me, and too much to ask other people to contribute fixes, improvements, and new extraction configurations.
Longer term, I'd like to get rid of the Java dependency, but this is likely a bigger task, as it would mean reimplementing parts of the Tabula project in Rust.

My development time is largely on my commute to work on a train, so not super productive overall. Slow and steady.

drl2 · Dec 24, 2024

huin80 said:
Out of curiousity, what's the coverage of the macros, and what do they do? Is this data importing, or macros for gameplay in Foundry? Also, are they something you're planning to release to the community?

Data import to create compendiums of skills and items (maybe more eventually) in Foundry from the core rulebook extract. I do plan to release them when they’re complete enough to be useful, but they’re one of 4 simultaneous programming projects I’m currently procrastinating on

Travdata - getting data out of the PDFs

Banded Mongoose

Banded Mongoose

Mongoose

Banded Mongoose

Banded Mongoose

Mongoose

Banded Mongoose

Banded Mongoose

Banded Mongoose

Banded Mongoose

Banded Mongoose

Banded Mongoose

Highlights​

Banded Mongoose

Highlights​

Banded Mongoose

Highlights:​

Mongoose

Banded Mongoose

Mongoose

Banded Mongoose

Banded Mongoose

Mongoose

Similar threads

Highlights

Highlights

Highlights: