Government Stuff #1
A savage journey to the heart of methods of benefits program effectiveness evaluation
Something I deeply appreciate about Matt Levine’s writing is that it contains two elements: a certain sense of intellectual curiosity paired with (hrm, how should I describe it) well, a bit of a non-judgmental appreciation of the complexity of the really existing world as it is.
I recently heard him say this on a podcast:
A lot of my readers work in tech and are like, "I don't care anything about finance, I have no background in finance, I’m not that interested in it, but I like when you talk about complicated things." It's like the aesthetic appreciation of systems and complexity and the moving parts of the economic drivers and tools.
My impression is that there are a lot of people in the world who want to read about structures and there is not a lot of writing. So they're like, "Oh, great! I get to read about a derivative! Fabulous!"
…If I get to explain a complicated thing, this is going to be fun, this is going to be good.
- Matt Levine
As one of those tech folks who has an aesthetic appreciation of systems, I agree!
So let me try something.
Government Stuff: payments, quality, and payment quality
Let’s say you administer some public benefit program. A common form it will take is that it will be means-tested and it will be progressive. What do we mean by that?
Means-tested means what benefit (if any) someone gets is based on some calculation of income (or wealth, or some other means definition.)
Progressive means that the less you have in income, the more you get in benefit. Someone who has absolutely no income will get a lot, and someone riiiiight near the income limit will get a little.
A lot of these programs are fundamentally about subsidizing demand. “Well, huh. Folks making $20,000 can’t afford enough widgets. Let’s give them some money for widgets. This could be a coupon or a discount voucher or semi-liquid funds. Having taken Econ 101, we are so very content! We doing our proper role as government: stepping in and fixing a market failure by providing purchasing power to those not getting sufficient in the labor market.”
One of the first questions that will pop up in designing and administering this benefit will likely be — how do we know it’s functioning?
Functioning here could mean lots of things, but for a moment let’s say we’re not talking about some sort of external outcomes (“early widget access radically improves childhood education outcomes!”)
Rather, functioning means we’ve got some machine to process cases of people who have income under a certain level, and we give them money for widgets, and how much we give them is based on how much income they make.
What we want to know is how effective our machine is at taking people with income who ask for widget help and generating payments based on that income.
You can almost hear one obvious metric in there: payment accuracy. We made payments. How accurate were they? Boom, done! But you need more than a metric, you need a methodology for calculating it, specifically.
So we design a methodology. Again, a reasonable way to do it would be:
Step 1: Take a handful of payments we made. They each have a payment, $X1.
Step 2: Now, for each case, look at it a second time, really closely, and calculate out what the payment should be again. Now we have our extra careful, extra diligent payment amount, $X2.
Step 3: Take all the X2’s, subtract out X1s.
Sometimes, we’ll see X2 < X1. That means we gave someone more than we should have. So let’s call that an overpayment
Other times, we’ll see X2 > X1. That means in that situation we didn’t give that person enough. Let’s call that an underpayment.
Step 4: Add it all up. Because it’s not like one person getting $50 too much cancels out another person getting $50 too little (both are not great!) we don’t simply add them. Instead we calculate underpayment and overpayment rates based on the total amount given out. Then add those up, and we have (tada!) a payment error rate.
Let’s do an example!
We looked at 10 cases, in total paying out $1000 to help folks buy widgets.
One person was underpaid $50.
One person was overpaid $100.
So now we have an underpayment rate of 5%, and an overpayment rate of 10%.
Add ‘em up, and our total payment error rate is 15%.1
Look at what a great system we’ve got!
Now we can do this once a year, take a look at why the payments were off, and do some quality improvement work.
Maybe that guy (let’s call him Dave) who we overpaid got too much because he did report how much he made at work, but he also drove for Uber and said he made about $200 a month. But when we dug in and asked him for all his Uber receipts, it turned out he made more like $400 a month for that specific month we looked at!
So maybe what we want to do is have a new procedure: if someone drives for Uber, let’s not let them tell us what it is, let’s make sure folks like Dave (damnit Dave!) give us all their Uber receipts.2
There’s another aspect to it: what if a case we look at should not have gotten paid at all?Really bad!
Let’s say another guy named Dave (what is it with Daves!) got $500 for widgets. But he never should have gotten ANY because it turns out we have a specific rule which is that if you’re a Mets fan you are categorically ineligible for benefits. Our staff asked him, but Dave heard “Jets” and is a Giants fan, not a Jets fan, so accidentally answered no.
In this case, that particular $500 is 100% error. So if it was out of a total sample of, say, $1,000, it would be a 50% overpayment rate! These “complete errors” quite bad!
But let’s also look at another situation: someone (Steve) who was eligible for the maximum benefit, but who wasn’t given benefits at all. So, maybe staff processing it misread $1,000 monthly income as being weekly income.
This is not all that different a situation: it’s a case where if we’re evaluating the machine we would want to look at what went wrong, look at acute causes, and figure out some ways to improve our machine based on that. That’s the point of why we’re doing this! Improving the machine.
And at its base, Steve’s situation is quite analytically similar to our Dave #2’s “complete” overpayment error — this time, Steve should have gotten $500 but in fact got $0.
But here’s a question: how is Steve’s situation accounted for in our methodology above? What effect does it have on our payment error rate?
None. It doesn’t count at all.
This is not some dastardly conspiracy — it’s simply an accounting identity of the (again, quite reasonable!) methodology we put together to assess our program. Why? Because we’re looking at payments as the fundamental unit of how we evaluate the administrative machine.
Again, let’s illustrate it with an example for clarity. Instead of just payments, let’s say we looked at all cases and found the following 4 instances in our close analysis of them:
A. Complete overpayment (given $500 but should have gotten $0)
B. Small overpayment (given $500 but should have gotten $400 — overpaid by $100)
C. Small underpayment (given $400 but should have gotten $500 — underpaid by $100)
D. Complete underpayment (given $0 but should have gotten $500)
In our payment error methodology, A, B, and C count — but D doesn’t.
If the total payments we looked at here were $1,000, then:
Our overpayment rate would be 60% (A and B)
But our underpayment rate would only be 10%! (C only)
The biggest underpayment (D) was not counted — not because it’s not useful, but because a payment error methodology starts with a universe of payments made, not the total universe of what the machine does.
This is interesting! Accounting identities and methodological decisions make reality in a very material way!
But this is also theoretical. “Dave,” you might ask, “does it play out this way in reality?”
Well, let me grab some data! Here’s from a benefit program in 2019, showing:
Overpayment rate (%) | Underpayment rate (%) | Total Payment Error Rate (%)
The overpayment rate looks much higher! And, again, from our accounting identities, we can see that while there is useful signal here, there is also an analytic gap that requires contextualization in looking at these numbers: “underpayments” here don’t count the potentially biggest underpayments — situations where someone got 0, but should have gotten a lot more.
Again, this is not a conspiracy! It’s a deeply banal fact of how we are doing our accounting!
What takeaways might we get from this?
One might be: payment errors are one measure, but by their very definition they also omit useful signal about how well our machine (benefit program) is operating. We also need complementary metrics other than payment errors that cover more surface area of how our benefit machines are working. (And maybe in a future Government Stuff™3 I’ll go into the weeds of some of those.4)
Another might be: payment errors create directional pressure. You noticed that in our example of the guy who drove for Uber sometimes, the way we improved our machine after finding we got something wrong was by making people submit documentation of their Uber payments.
If we assume this kind of thing does affect people in an time- and effort-scarce reality, one question might be:
When does asking for more documentation actually lead to eligible people not getting benefits due to that extra burden?
This would be a different flavor of our “paid $0 but should have gotten $X” underpayments — the ones we are not counting in this particular methodology.
I go into all this because payment accuracy is really the thing people look at most for benefit programs. And so it’s a very strong systematic force over the long run. I mean heck there’s a website called PaymentAccuracy.gov. This is not satire!
But as we’ve seen (and could see more) this one particular lens on our machine shows us some of what we care about, but it also misses a fair bit of what we might want to know about our machine to assess it and improve it.
Next Time
I’ve been down with Covid and then (quite extreme!) fatigue since recovering from the main symptoms. But I hope to use this medium much more, and other more, shall we say, erratic media a bit less.
Some things I plan (hope) to write a bit about in the near future:
The AI executive order: lots to unpack, but I think not many have unpacked the bits on government usage of AI, and it’s interesting!
What specifically goes wrong in large government IT projects, rather than the lossy summaries we’re used to that read as morality plays rather than useful diagnoses (“I prefer to ask myself: who made what specific decision how that led to this? Rarely is there some villain somewhere making rogue edits in Jira.")
Themes I am seeing from helping people navigate food stamps on Reddit and Facebook
Potpourri
The edit is what makes this.
(As always, if this sparked a thought or tickled you, I never mind a reply.)
All of this is a stylized example. In actual payment error calculations there are many more details. For example, often there is an “ehhh you got close enough” threshold and if the error is under that, it doesn’t count. There are also lots of methodological details for exactly how one samples and weights samples and even state samples vs. federal sample and projections based on the gap between those. I have a ~400 page manual I can recommend.
(These will probably be emails printed out? Or lots of screenshots? Hrm, kind of a pain…
Trademark exclusively for comedic effect.
Have you heard of CAPER? Delicious on a sandwich and useful for assessing the procedural accuracy of negative actions!
Dave, this was a great read. The concrete (and, extrapolating, Kafka-esque) outcomes that can result from a simple, reasonable-seeming accounting choice are wild. Drawing attention to it feels impactful—I’d love to read more approachable, counter-intuitive insights like this from your experience.
I have a paper about SNAP and gig work coming out in AJPH soon, FWIW.