Austin Campaign Contributions Excel or CSV Files Now Available

Austin Campaign Contributions Excel or CSV Files Now Available

Background – Election Documents, Austin City Clerk, PDF, OCR, and more

Though Austin mayoral and council candidate campaign finance reports (CFRs) are available on the City Clerk’s website, the files are not easy to work with. Candidates bring printed copies to the Austin City Clerk, who scans the printouts and puts the resulting PDF files online for the public to view. The PDF files can be large, slow-scrolling, and difficult to read.

Unfortunately, there is not yet an electronic entry and submission process for Austin campaign data, though it has been discussed for years. One can find articles on that topic, going back well over a decade, from election transparency advocate and local Austin attorney, Fred Lewis. Such a process would allow for the creation of CSV or Excel files for the public to use, or an online database searchable by field name. It would eliminate the huge problems that come with relying on OCR’d documents.

In 2012, the Austin City Council passed a resolution to create an online searchable and downloadable database for election documents and data by 2013 – a requirement which is still incomplete. Check out this article for more information on that effort. Ultimately, the resolution was repealed and replaced by an alternative idea to get the State to open up their software, though that idea stalled without any progress.

Though there is no election database that allows searching by individual fields, there is a way to search all election data – campaign finance reports, treasurer appointments, and so on – using the city website. I keep reading articles saying you cannot search Austin’s election data, so I suspect many people are unaware of this capability. Later in this article I will explain how to do it.

Contribution Data Now Available In CSV Format

Even with the searchable interface the city provides, there is still the need for Excel files (or CSV files) that an individual can download and manipulate – sort, search, etc. While I have not been able to completely solve that problem, I have been able to create a partial solution that sort of works for contribution data (only) for some CFRs.

Below are links to .txt files that I have created for the 2014 election. Each file contains the contribution data (and only actual contribution data – no pledges, no expenditures) for all the readable (ie not handwritten) CFRs for all candidates during the given time period. (note that in this article, I refer to these files as either “the .txt files” or the “CSV files” or “CSV data” or some variation thereof.)

Here are the links to the .txt files:

Austin July 15th 2014 data

Austin 30 Days 2014 data

Austin 8 Days 2014 data

Update: I have produced an xls of this data, saving people the headache of importing into Excel: all 3 contribution data files into a single .xls file (each data file as its own sheet)

In almost all cases, if a candidate starts off with a printed/typed report (which usually means he’s using the Ethics Commission software to produce his report), then he stays with that and does not switch over to handwritten reports at a later point. One notable exception is Jay Wiley, who is included in the July 15th .txt file, but then he went to handwritten and is not included in the later files. Perhaps he did that to obscure his data some, perhaps he just got lazy, or perhaps his computer broke – who knows.

CSV Data File Format (.txt file format)

Each of the three .txt files I have provided is actually a CSV file (but using the vertical bar “|” as the separator rather than a comma). Most of the fields should be self-explanatory. The DOCID field is the document ID# that the Clerk assigns. You can find this number by hovering your mouse over a document link on the Clerk’s CFR page. The PAGE field is the page number in the CFR PDF file that that data came from. CFR’s contain more than just Schedule A contribution data, so there will be pages in the PDF file that are not referenced by my .txt files (for example, the Schedule A data might be pages 4 thru 29 of a 40 page PDF).

I attempt to grab all the data fields for each contribution, except for the two checkboxes, which are too small to get (one is about out-of-state PACs and the other is about travel).

Errors, Stray Characters, and Gibberish In The Data

In order to get the data out of the Clerk’s PDFs and into a more usable format I reprocessed the PDF files. First, I re-OCR’d every CFR pdf file, in order to improve the quality of the Contributor Name data. I applied a number of filters to correct skewed documents and other issues. Even with those adjustments, there are many, many errors in some parts of the data, and periodic errors in all of the data.  Sometimes the separator lines and boxes from the form itself can get OCR’d and mixed in with the candidate data. When this happens you will often see characters like these at the beginning or end of data:  l 1 I i . , – ~ ‘

Despite the various data errors, most of the data is good enough to be useful. Or, with a few minutes of cleanup, you can significantly improve the data for any given report.

In order to help you deal with fields that have stray characters or are otherwise indecipherable, I have made available all of the CFR PDFs as individual pages. In other words, there a is a small pdf file for every page in the original PDF.  I have placed these PDF files in a directory hierarchy that you can browse and drill down into:

CFR PDF files – individual pages

Thoughts On Using The Data Files

You need to be a little thoughtful and careful in how you evaluate the data in a given .txt file. For example, you cannot just total up all the contribution amounts for a given candidate and expect it to be roughly accurate, because some of the submitted CFRs documents are corrected versions of previously submitted files. Thus, you would have duplicates of many contributions. Note that sometimes candidates submit corrected versions that contain all the submissions from the prior report, while other times candidates submit only the corrected one or two items (the delta). There is no consistency, so pay attention.

Here are a few examples of what you could easily do now using the .txt files, that might not have been so easy before:

  • Just by quickly paging through the July 15th file and eyeing the amounts, I caught a couple of $49 contributions. $49 is an extremely odd contribution, so I stopped and looked closer and saw that Melissa Zone had listed a couple of “anonymous” contributions of $49 — apparently trying to get under a $50 reporting requirement that she has mistakenly interpreted to mean anonymous contributions are allowed as long as they’re under $50.
  • While paging quickly through the data, I noticed Sheryl Cole’s latest 8-day report has many contributions over $200 that are missing the required occupation and employer info. This is something that almost everyone is guilty of once or twice, but nothing like the number of times Cole has violated that rule in that particular report — better fix it quick, Sheryl.
  • Sorting by contributor name (and perhaps combining all 3 .txt files), may help you find violations of the $350/individual contribution limit. For example, you can easily see that Jim Arnold contributed $150 on the 0715 report and $227 on the 8day report, for a total of $377, to Katrina Daniel. Or, in the 8-day report is a $700 contribution from Mr. & Mrs. William Reagan, followed by a $350 contribution from William Reagan at the same address, which looks like it might be an over-limit violation. (Note that it is difficult to be 100% certain on over-limit violations – there is always the possibility that a parent and a child share the same first and last names, have the same address, and each contributed to a candidate, which is legal, but would look like an over-limit violation in the report.)

Election Data In The Future and Other/Alternative Efforts

I will continue to make these CSV files for each election until they are no longer needed. However, I think a better option is to incentivize campaigns to export and publish their own campaign data, which is easily done from the Ethics Commission software.

A group called Open Austin questioned several 2014 candidates about exporting and self-publishing, so the idea is out there, we just need candidate agreement and compliance. I propose creating an “Election Transparency” badge that campaigns can place on their websites, much like the my-website-is-secure type badges seen on some websites, as an incentive.

How to Search Election Documents on the City Website

Earlier, I mentioned that you can actually search Austin Mayoral and City Council candidate and election documents on the city website. First, go to the search page. (if that link doesn’t work, try this one). Scroll to the bottom and select the Municipal Election Documents radio button. Next, click Start Advanced Search. The resulting screen will be an advanced search form allowing you to search filed election documents.

On the advanced search form, you can narrow your search by form type and/or by candidate. Click the Help button to the right of each field to learn the syntax. For example, in the keyword field you can put something like Kath* <Near/3> Tov* which gets you variations of words within 3 words of each other, like Kathy Tovo, or Katherine Beth Tovo.

It is important to understand that when you are searching, you are searching the OCR’d values of words from the PDF documents. But, when you view the PDF, you are seeing the scanned image (the OCR’d text is not visible without copy/paste or searching). You need to get creative with your searching and use the advanced word-stemming/wildcard features. For example, “Hillco Partners” is actually read-in as “Hilico Partners” by the OCR process.

Till the Next Post

This is a long post and I still have a bunch more I want to say – particularly about this idiotic pledge that the more conservative/Republican/Tea Party Council candidates have been signing. That pledge is straight out of Grover Norquist’s playbook. If you want Austin to be run like the federal government, or one of the Republican Texas cities like Houston or Dallas, then vote for someone who signed that stupid pledge.  But, I will save that rant, and a discussion about the Effective Rate, for another post…

Dylan Tynan

Consultant, software developer, political wonk, etc.

Many years ago I worked for the Lee Cooke for Mayor campaign, but since then I have not worked for, nor given any money to, any political campaign.

Austin TX

11 Comments

  1. In b4 paranoid ATXSLUTH.

  2. Cool! I have been playing with the data. I just made a chart showing local/Texas/out-of-state donations for the mayoral candidates: http://i.imgur.com/9MR4FfB.png

    Surprised to see Adler has 2x the donations of Martinez.

    • Awesome! I’m so glad to hear that someone is using it.

      Bold choice going with the mayoral candidates first – that’s a ton of data, and I remember Adler’s having a good number of stray characters, so that’s huge that you were able to make a chart out of it. Yeah, that’s good, much easier to digest that way .. nice work!

      • I geocoded addresses with “Austin” in them and used COA’s single_member_district gis data to sort the contributions by district. This also helps find “Austin” addresses that are actually in places like West Lake Hills or unincorporated areas, which are a significant chunk of the donations. http://i.imgur.com/y11NzZj.png http://www.reddit.com/r/Austin/comments/2kztun/2014_mayoral_campaign_contributions_per_district/

        (also, that graph in my first comment is wrong: I swapped Cole’s in-state/out-of-state money by mistake)

        • ilikeyourdata, somehow I missed your Nov 2 2014 comment last week and just noticed it. Wow. I’m impressed. Very nice use of the data.

          And, the commentary on reddit sounds accurate to me .. there will be some office addresses in there, particularly attorneys & lobbyists, that tend to use their office addresses & tend to be downtown. A good number may be PO boxes also. I think your methodology is sound & a great way to go about it. Certainly the results aren’t perfect (and never will be really), but they’re good enough for debate & discussion in my opinion.

          Excellent work.

    • Maybe you could tweet it to #atxcouncil and/or #atxelections ? Apparently I am too twitter challenged to get any of my tweets to appear out there and Twitter support doesn’t respond to my questions/pleas for help…

    • I assume you looked for Austin in the address field to determine whether or not it counted as in-Austin vs. other-Texas? Or, did you try to look for zips starting with “787” ?

      I tried the “787” path the other day on a couple reports and I noticed a lot of mistakes in the zip code data due to it reading the zips as “767” rather than “787”. Pretty easy fix, but something to watch out for…

      • Yep I just looked for Austin/TX/Texas in the address field to determine location. I did notice lots of bad numbers with Adler especially. I edited several dollar values by hand (ones that were easy to guess, like “35000” instead of “$350.00”) but also excluded some that were really messed up, for me to fix at some point in the future.

        • I just noticed there are several duplicates caused by the “correction” reports. For example in the July 15 data, Robert Thomas has triples of most entries because he has three CFRs filed. Can you just include the latest CFR in those cases, or is that incomplete?

          I guess I could edit out duplicates in your text reports by looking for exact matches and only keeping the latest CFR, but that’s not ideal since the OCR results don’t always match.

        • Yeah. It’s unfortunate, but I couldn’t find a good way to deal with that issue. I tried to explain it some in the 1st paragraph under the “Thoughts on Using the Data Files” section.

          Essentially I punt on that issue and leave it to the user of the file to “fix” things by removing the right rows from the file so that nothing is lost & nothing is dup’d.

          There are too many variations by candidates to just pick a way and do it every time (like only using the latest file) – at least not without some significant detection and logic – and even then I’m not sure it’s doable.

          Let me give you can example of what I mean.

          Go to the July 15th reports (http://tinyurl.com/l5nzmt4) and scroll down to Katrina Daniel. She files a CFR on 07/15 and a corrected one on 07/23. Both are 42 pages. Obviously you would throw away the 07/15 and use the newest one, because it’s a straight replacement.

          Then, scroll down to the 30 day reports and find Katrina Daniel again. You’ll see a 10/06 report that’s 43 pages. You’ll see a 10/13 correction report that’s only 2 pages. If you look inside it, you’ll see that she just submitted only the 1 page that had a change on it. So, in this case I can’t throw out the first report. Even if the program were smart enough to figure that out it would then need to extract that new data & replace just the right rows from the prior report somehow. That would be damn tricky, probably not even possible since the two files won’t OCR the same way probably and I can’t depend on something like page number. … So, I don’t see a way to do it.

          Essentially, anyone using the data and doing anything that involves more than 1 report (more than one DOCID) must first first go look and see what kind of files were submitted by the candidate & what’s in them and & then make appropriate adjustments to their data file. That’s not TOO bad if you’re just dealing with a single candidate, but if you were dealing with aggregate – like all 30-day reports, or all District 1 reports, then you’re gonna need a few minutes to figure things out. Make sense?

Submit a Comment