Monday, February 15, 2010

The agonizing grind........

Now that 2.0 is in public testing I switched back to looking at the parser. There have been a number of posts about random pieces of compendium data which do not scrape correctly. I decided to investigate these by attempting to scrape every published book and seeing what breaks and why.

Since the compendium feature is a new component of the parser, I never went back and ran serious testing against all the modules I had completed. In fact I've only been using it on material published after Divine Power. Any publication that predates that I parsed using either a purchased pdf or OCR from my purchased printed copy. This means several of the "big" books (PHB, MM, etc) never got the full testing.

That's not say I ran no testing on them, but I typically ran a sampling, because extracting and processing something like the entire Adventurer's vault is a pain. It's even more of a pain when your extraction process is still a bit suspect and you know you're going to have to do it more than once.

Anyhow, I decided to bite the bullet and go through each and every book and note what's broken, what I can fix, and what's beyond my control. I've been going in alphabetical order (for lack of a better system) and I am currently on the Player's Handbook and Player's Handbook 2. Both had issues with certain rituals which I think are fixed. I still need to re-scrape/parse them both and make sure the issue is taken care of and I didn't break anything else along the way.

I'm hopeful that I'll be able to complete the process this week.

11 comments:

Olodrin said...

Fantastic!

Unknown said...

Excelent! This is really great news. Thank you very much. I started to play via FG II because of your excelent parser. Best regards, ShadoWWW

Unknown said...

Just out of curiosity, how do you go about cleaning up the scraped information?

I messed around with doing my own version some time ago, but never went beyond one type of book.

Lowlander said...

Heya. I might have missed it somewhere, but would you consider adding the "Output for 1.5.1" and "Root version 2.2" switches to either the command line or the settings file?

Lowlander said...

Hey Todd,

I spent some time doing this with pdfs as well, but RL demands cut my time short. What I used for cleaning up text were regular expressions. They are very powerful. I had an xml file where I listed pairs of find-replace regular expressions for things I encountered. I could copy/paste the MM and automatically clean it up pretty much to the point that it was ready for parsing.

If you added some sort of compendium entry identifier to find/replace pairs you could make a flexible file for fixing specific errors in the compendium while scraping, and if the entry got fixed you'd just remove the corresponding xml element.

J said...

Cleaning scrape output is mostly figuring out what new and strange HTML they decided to add to the compendium. If you actually look at the powers in Arcane Power for example you'll find the compendium uses 2 different formats of HTML to render powers that look the same.

Why? I dnot know. Weak input standards?

Here's another one, some items have "Added", "Deleted", or "Revision" followed by a date to indicate the entry was changed. The scrape just ignores these (because I really don't want to see the edit history in FGII). This was all working fine.....until they decided to add "Updated" out of the blue. How is it different than "Revision"? Only WOTC knows.

There are other puzzling and random things, like the NPC stat blocks for mounts are missing a tag that is present in all other npc block.

Cleaning the output was basically a matter of going through these cases and correcting them whenever possible. It's mostly a frustrating time consuming process that has me staring at 2 blocks of HTML trying to figure out what is different.

Ablefish said...

Just a quick thought for those 'comparison' sessions - I use Notepad++ for my scripting at work here and it has a comparison plugin that does a nice job of visually displaying what's different between two different blocks of code.

Hugh Walters said...

Thanks for this, your effort is appreciated. I just managed a full scrape and Parse of PHB1. It now works great! I can drag and drop powers into my character sheet.

Though I did have to find a Free XML checking editor to go through and correct the "\" errors in the names. It's an excellent program, you can get from

www.philo.de/xmledit/

I gave up on PHB2. There are about 100 errors relating to expecting an "=" sign.

I may post an overiew of using the app on the forums. There's no general overiew of how it works in one place, why it does it they way it does, and where the files are. I had to pick up the info from about 10 different posts. I was totally confused at first, since it's realy a scraper and parser in one app, and its probably not obvious to most people that it needs to read in the text files it generated itself from the "scrape" phase.

I was expecting it to Parse the DDI info directly into a module at first.

I can appreciate how madening it is when WOTC "adjust" something, only to then break the app. WOTC need to think about some sort of API key and some standards like EVE online uses.

3rd parties there have really made some amazing apps, but the EVE devs are happy to help. WOTC really need to work out some decent srat regarding their IP, or some roadmap, at the moment it appears they are making it up as they go along.

Thanks to people like yourself (iplay4e also) and the devs at FG2 we have a fighting chance at a decent VTT experience.

Robaticus said...

This is awesome and has saved me countless hours manually transcribing from the compendium.

One suggestion: give an option to turn off the verbose reporting. I've found this can increase the amount of time that takes to process data by 2-3 times. Either that, or throw the updates on a different thread to free up the main thread to process at full speed.

J said...

Actually, the scrape routine has an intentionally created delay. I did this so the program wouldn't pound the WOTC webservers with thousands of requests.

This delay is the main slow down in working with the compendium.

Robaticus said...

Ah, makes sense. Thanks!