The Problem: 50 States, 50 Formats
Every year, each state publishes an Intended Use Plan (IUP) for its Drinking Water and Clean Water State Revolving Fund programs. These documents list the specific water infrastructure projects a state plans to fund: project names, utility names, funding amounts, project descriptions, priority scores, and timelines. For anyone trying to understand where water infrastructure dollars are going, IUPs are the single most valuable primary source.
There is just one problem: there is no federal standard for how these documents should be formatted. The EPA requires states to submit IUPs, but the format is entirely at each state's discretion. The result is a landscape of 100+ documents per year (DWSRF and CWSRF for each state, plus amendments) in wildly different formats.
Some states publish clean, digitally-native PDFs with well-structured tables. Others publish scanned images of printed spreadsheets. A few publish what appear to be Word documents converted to PDF with inconsistent column alignment. At least one state we encountered publishes its project list as a series of paragraphs rather than a table.
When we set out to build United Current's project tracking database, we knew that parsing these documents reliably and at scale would be one of our hardest engineering challenges. Here is how we approached it.
Stage 1: Document Acquisition and Classification
The first step is simply getting the documents. Not all states publish their IUPs in the same location. Some post them on dedicated SRF program pages; others bury them in board meeting agendas; a few require you to navigate through a document management system with session-based URLs that break if you try to bookmark them.
We built a document acquisition pipeline that checks known URLs on a scheduled basis, detects when new documents or amendments have been posted, and downloads them into our processing queue. For states with particularly unreliable web infrastructure, we supplement automated checks with manual monitoring.
Once a document is acquired, we classify it into one of several format categories:
- Digitally-native PDF with tagged tables: the best case. These have actual table structures embedded in the PDF that can be extracted programmatically.
- Digitally-native PDF with visual tables: the text is selectable, but there are no tagged table structures. The columns are aligned visually using spaces or tab characters.
- Scanned image PDF: the document is a raster image, typically from a scanner. No text is selectable. Requires OCR.
- Mixed format: some pages are digital, others are scanned. Or the narrative sections are digital but the project tables are scanned appendices.
- Non-standard layout: project data is embedded in paragraphs, footnotes, or other non-tabular formats.
Classification is done using a combination of PDF metadata inspection, text extraction attempts, and image analysis. If a page yields no extractable text, we flag it as a scanned image. If text extraction succeeds but produces no coherent table structure, we flag it for visual table processing.
Stage 2: Text and Table Extraction
For digitally-native PDFs, we use a combination of pdfplumber and custom heuristics to extract table data. Pdfplumber works well when tables have clear cell boundaries, but many state IUPs use borderless tables where columns are defined only by whitespace alignment. For these, we built a column-detection algorithm that analyzes the horizontal position of text elements across multiple rows to infer column boundaries.
This sounds straightforward, but the edge cases are extensive. A single project description that wraps across two lines can break column detection if the second line's starting position does not match the expected column boundary. Merged cells, footnote markers, and inconsistent use of abbreviations all create noise.
For scanned image PDFs, we run Tesseract OCR with custom preprocessing. Our preprocessing pipeline includes deskewing (many scanned documents are slightly rotated), adaptive thresholding to handle uneven lighting or faded print, and resolution upscaling for low-quality scans. After OCR, we apply the same column-detection heuristics as for visual tables, but with additional error correction to handle OCR artifacts.
The OCR stage is where we see the most quality variation. A cleanly-scanned document from a modern multifunction printer might yield 98%+ character accuracy. A faded photocopy of a dot-matrix printout from a state that has not updated its process since the 1990s might yield 85% accuracy, requiring significant post-processing.
Stage 3: Entity Resolution
Extracting text from a table is only half the battle. The next challenge is entity resolution: mapping the extracted text to known entities in our database.
Consider utility names. The same utility might appear as:
- "City of Springfield Water Department"
- "Springfield, City of"
- "Springfield Municipal Water"
- "Springfield WD"
- "SPRINGFIELD W.D."
These all refer to the same entity, but a naive text match would treat them as five different utilities. We need to resolve them to a single canonical record.
Our entity resolution pipeline uses a multi-signal approach. We start with fuzzy string matching using Jaro-Winkler distance and token-based similarity. We then layer in geographic signals; if the IUP is from Illinois and the utility name contains "Springfield," we can narrow the candidate set significantly. We cross-reference against EPA's Safe Drinking Water Information System (SDWIS) identifiers when available, and against our own curated database of utility names and aliases.
For project descriptions, we use a lightweight NLP classification model to extract structured attributes from free-text descriptions. A project described as "Replace 2,400 LF of 8-inch cast iron water main on Oak Street" needs to be tagged with attributes like project type (pipe replacement), material (cast iron), pipe diameter (8 inches), and length (2,400 linear feet). These attributes feed into our search and filtering interfaces.
Stage 4: Validation and Quality Assurance
Every parsed document goes through an automated validation step before entering our production database. Validation checks include:
- Row count verification: does the number of extracted projects match the total count stated in the IUP narrative?
- Financial reconciliation: does the sum of individual project amounts match the stated total funding amount?
- Entity coverage: what percentage of extracted utility names resolved to known entities versus flagged as new or ambiguous?
- Field completeness: are there rows with missing critical fields (utility name, funding amount, project description)?
Documents that fail validation are routed to a human review queue where a team member inspects the extraction results against the source PDF, corrects errors, and feeds corrections back into our parsing models as training data.
We track validation metrics per state over time. This lets us detect when a state changes its IUP format (which happens more often than you might expect) and prioritize parser updates accordingly.
Stage 5: Change Detection and Amendment Tracking
IUPs are not static documents. States publish amendments throughout the year as projects are added, removed, or modified. A project that appeared on the initial priority list might be removed in an amendment because the utility withdrew its application, or a new project might be added because emergency funding was approved.
We run diff analysis between successive versions of each state's IUP to detect additions, removals, and modifications. This is non-trivial because amendments sometimes change the table format, reorder rows, or renumber priority rankings. Our diff algorithm operates at the entity level rather than the text level, comparing resolved utility-project pairs rather than raw strings.
The Payoff
All of this engineering work serves a single purpose: giving our users a clean, searchable, continuously-updated database of funded water infrastructure projects across every state. When a vendor searches for "membrane filtration projects in the Southeast with funding over $1 million," they get results drawn from dozens of state IUP documents that were published in different formats, parsed through different extraction pathways, and unified into a single data model.
It is not glamorous work. Parsing PDFs rarely is. But it is the foundation that makes actionable water infrastructure intelligence possible. Every state that publishes its IUP as a structured, machine-readable dataset makes our job easier and makes the water sector more transparent. We would love to see more states move in that direction, but until they do, we will keep parsing.
Related Resources
- Data Coverage → — See the full scope of our 30+ data sources, including every state IUP we parse.
- How Much BIL Funding Has Actually Reached Water Utilities? — IUP data is how we track BIL disbursements at the project level.
- United Current Now Covers 5 States with Deep Data — Our coverage expansion story, now with deep data in PA, NJ, NY, and CA plus 7 expanding states.
- Careers at United Current → — We're building the data engineering team behind this pipeline. Come help us parse the unparseable.
- Product Overview → — See how parsed IUP data flows through our three-layer intelligence pipeline.
