What might better accountability systems for government technology (and customer experience) look like?
Process? The "what"? Measures? The "who"?
Hello, and happy Friday! It’s sunny in San Francisco.
I’d like to think out loud for a bit. Take these thoughts as sparks and seeds, not dogma or posture. If you mischaracterize them as anything else, that’s on you!
Dual, parallel, and asymmetric systems of accountability
I’ve been walking around thinking recently about systems of accountability and oversight for government technology recently.1
A partial impetus for this was recently reading through Jen Pahlka’s new book, Recoding America (it’s officially out June 13th), and wrestling with some of its ideas overlaid with my own direct first-hand experiences.
(Aside: despite being a default-cynic of any writing in the “civic tech” area, I really enjoyed the book. It does a better job at conveying some of the rudiments and patterns of my world than I’ve seen written down before.)
One of those ideas is on the dual systems of accountability that public servants face — outcomes vs. process. A brief excerpt:
When there are big, visible delivery failures like healthcare.gov or the unemployment insurance criss, public servants are trapped between two distinct systems of accountability.
In the first, politicians will hold the public servants accountable for outcomes: whether the website works to enroll people or whether benefits are actually getting to claimants. In this system there will be [legislative/political] hearings…
In the second system of accountability, various parts of the administrative state—the agency itself, the inspector general, the Government Accountability Office—will hold these same public servants accountable to process. Procurement and planning documents will be reviewed for any gaps, any skipped or partially skipped steps, any deviance from standard protocol, even if that deviance is legal, just nonstandard.
Jen goes on to describe that for a career public servant, while the outcomes accountability (hearings) can be painful, it is the process accountability that has more direct impacts on individual careers and lives, as not following the codified process can lead to ineligibility for promotions/raises, firing, demotion, etc.
So there are effectively two distinct accountability regimes; they may and do contradict at times, and in general the tie would seem to go to procedure (over outcomes.)
Accountability and oversight regimes are one level of feedback loop
Now Jen’s book makes the point that simply adding more accountability and oversight doesn’t necessarily fix things. (I think I agree.)
But I think there’s a more useful question in this area, which is not about the degree of accountability and oversight, but about the quality of the feedback loops that these provide.
I say that because at the end of the day, things like GAO reports or audit findings or legislative hearings — these are all more generally one of the many levels of feedback loops that shape the behavior of agents in the system of delivering government technologies.
And I think the question of “what do good (or, at least better) feedback loops look like?” is an important one that is, perhaps, under-theorized. “Good” feedback loops here are ones that, over the long run, credibly incentivize the behavior and outcomes we want — while disincentivizing those we don’t want.
One more observation: the question of good feedback loop design (for technology) is relevant at many levels (system, organization, service/product, team, feature), not just this top-level of public governance and accountability.
A legislative body might impose feedback loops focused on to highest-level outcomes accountability
An executive entity might impose feedback loops for a similar level, but with more direct (clearer) outcomes dictation and more direct intervention
A (non-technical) leadership level needs and wants feedback loops on the teams to be able to ensure “good delivery”
A manager of a service or team needs might want feedback loops that create a strong (aka “kind”) learning environment where the team’s various activities are an operational machine that effectively gets to some outcomes (often these are metrics or KPIs)
A designer or researcher needs feedback loops like, say, user observation for sub-goals like “the usability of the thing”
A developer/engineer needs feedback loops like monitoring (uptime, application errors) or automated tests that provide situational awareness of “the thing working (technically)”
External stakeholders like advocates need feedback loops that they can push on to orient in the right direction (while also often having less direct access to ground/internal information)
This is all to say that the question of what we should hold the various agents around “the technology” accountable to is a meaningful question at many different levels.
And—critically!—if higher levels (longer timeline, more consistently recurring) of accountability feedback loops drive the system in the wrong direction, that can wash away good feedback loops lower down. If a team is focused on more usability or functionality needed by users, and discovered in direct research with users, but leadership is getting whacked over cost, following procedures, or hard-to-verify-or-make-sense-of anecdotes, then, well… you can guess how the rock gets carved by the water over the long run.
A (non-exhaustive) option set for accountability feedback loops for technology
So, if we recognize that there will be accountability feedback loops at many levels, and we want to design better ones, what are the options we have?
This is far from complete, but let me document some options I think exist.
Practices or processes as “good”
This is a common answer I see. For example, “good” is if it is:
“Agile”
“User-centered”
“Grounded in research”
“Improving customer experience”
These are labels, but they distill down to descriptions of process or practices.
If you want the feedback loop to be about “is X following these practices” then you do need to operationalize that. There’s two approaches:
(1) Monitor the words used: This is not flippant! I see this as a common form of management, which is to say: are they reporting that they are doing “agile”, doing “human centered design.” Absent some deeper operationalization, the words are the accountability feedback loop.
(2) Monitor for the practices themselves: For example, you might look at thing like
Is user research or observation or usability testing occurring? How frequently? A deeper level might be some sort of explicit monitoring of separation: (a) are you observing users, (b) what are you finding, (c) does your work reflect addressing findings, or other goals?
There are also a few challenges with this approach. Practices themselves do not per se generate desired outcomes. As referenced, when process monitoring is operationalized as looking for words and incentives drive an org towards other goals, the dominant strategy can be semantic arbitrage: using the words, while not doing the thing.
And also… a focus on process is in fact (duh-duh) a form of procedural accountability! For those who bring a judgmental (rather than curiosity) lens to the status quo, a certain required process of security testing is not structurally that different from a requirement of user observation. They just point towards different aims. So it’s worth wrestling with the really existing forms of this we have.
Observable and objective criteria (dba… better functional requirements?)
Another way to look at technology might be to say… does it do or have this thing that we can objectively observe?
Some common examples I’ve seen:
Is it mobile-responsive
Does it support document upload
Does it cover X transaction we want users to be able to do online
This is in some ways a simple approach, but not unhelpful. But it’s also kind of just saying… functional requirements? Which is the main form of internal (e.g. team- or vendor-level) accountability and feedback loop in the status quo of how technology is built.
Put another way: it’s useful to maybe realize that functionality is missing that it would be better to have, and also, the intervention it implies is really at the level of “make sure the requirements are better.”
As a feedback loop, it gives us a binary variable, and therefore under-operationalizes the design details (the “how it works for users.”) You can play an iterative game (okay, we have doc upload, but it should now also support the .HEIC image format) with objective criteria, but you’re kind of always caught in this dilemma of we better be really sure this is the right thing. For some stuff (mobile responsiveness) yes, it’s pretty clear, we’re confident, there’s low space for the how it’s implemented2 to affect the outcome, so let’s just measure that.
Measurable outcomes, usually flowing downstream from a higher-level systems model
This is a pretty different approach from focusing on process, because the how is allowed to vary. You define goals, and then define outcomes that are reasonable-but-measurable proxies of those goals. Waterfall but users love it? Great, user love wins.
In this form, we might have a hunch that user observation and agile and all those (“good”) things are valuable means, but what we want reporting and monitoring of is a measure rather than assurance of a process being followed.
One vertical axis (depth) in Dave-as-T-shaped-individual is the SNAP, or food stamp program, specifically around access to benefits, applying, friction in maintaining benefits, etc.
So let me offer some illustrations of how you might do this using SNAP. Let’s say you have a goal of the online services should be accessible to people in need of the program. That’s a very high level goal! You can’t really have a fruitful argument if you have two sides yelling “we made it accessible!” and then “no you didn’t it’s not accessible at all!” back and forth at one another.3
That could get operationalized in plenty of ways. But here are a few measures you might use:
Conversion rate (of those who start to apply for SNAP online, what % actually succeed at submitting)
Time to get through the application flow (a potential implicit measure of difficulty or burden, and we might say well if it’s really hard it’s not accessible)
User satisfaction as rated at the end of an application flow
Passes WCAG at a certain level (one test framework for accessibility for users with varied disabilities or access challenges)
If you go to to this, you’ll quickly find you get lots of objections. But the objections represent precisely the dialogue you should be having up front. Why? Because the objections represent information about the higher level system that its operators have knowledge about. You might hear things like:
Conversion rate is problematic; people might decide they want to put it down and pick it back up later. (Response: okay, great, so let’s see if we can measure “total success” by looking at whether people log back in and finish within a week.)
User satisfaction is not good — people are applying for food stamps! Of course they’re not happy or in a good place! (Response: Totally. So maybe it’s more of a gut check — how difficult was this for you 1-10? And an open text field of what was hard.)
The time-it-takes is actually better to be a little long — that way we know we’re getting a really complete application, which means we can process it quickly and get people benefits ASAP. (Response: That is a very reasonable point expressing an invaluable systems model point — that the ultimate user goal is not merely to apply but to get approved for money for food. But now we have a much more fruitful discussion! Maybe we need to now add “approval rate of online apps” [maybe vs. approval rate of in-person or community-partner-assisted applications] as a paired metric to help us balance these goals — easy enough to apply, but not so easy that we’re just pushing burden downstream to a later longer interview or document requests. [Though… maybe it doesn’t actually make the interview longer?] These are much meatier and better, more-good-faith discussions!)
These things are all less useful than just directly observing users and fixing their issues. (Response: Quick plausibly so! But…how do we know we’re that that machine is working in service of our higher level goal?)
This is an approach I really like. It seems to give a kind feedback loop: it’s measurable, relatively quick to see the effect of changes, it correlates with something we care about, and it enables us to have the argument about the goals up front rather than at the end.
(No one likes to work on a project for 2 years, launch, and then be criticized that there was an entirely different unstated goal not taken into account. And you know what! That makes people more likely to dismiss the criticism than take it in good faith and act on it.)
Aside #1: in a fight between you and legibility, someone will end up bloodied — and it won’t be legibility
An obvious and ubiquitously common objection to this, to which I must respond:
“Any of these outcomes/measures will necessarily reduce a messy human problem to narrower dimensions,” you say, fists clenched, so committed to the user, the person.
I know. I agree. But, well… Tough cookies. Welcome to The State.
You cannot escape the legibility trap in use of the state apparatus — you can only reduce its lossiness.4
As we saw above with process — and we shall see again! — you’re going to evaluate if it’s good one way or another. By abdicating the choice, you just let a really-existing evaluation be what’s used. Are you frustrated with the focus on IT cost? It’s a very clear measure! That’s why it gets talked about to a large extent! So fight it with a better alternative.5
There will be accountability, management, reporting, etc. Please don’t resolve that by defining it as out of scope.6 You can’t beat it, but you can make it less noisy, higher signal, and incentivize good things over the long run.
Aside #2: virtue ethics/trust/”on my team” as feedback loop
It’s probably worth noting that, for better or worse, there is also a really-existing form of defining good which is “built by a team I trust.” “It is good because they are good.” (Implicitly I think of virtue ethics here a little bit.)
I’m really not making a moralistic argument here — I just want to note this exists. It is not necessarily a bad thing, depending on the context and information availability. And it’s likely more effective than at base if the trust arises from something more systemic or material, for example, a structural incentive difference from other actors.
But/and… that strikes me as fairly insufficient as a long term systems intervention. And even if we have trust in an entity, we (broadly) should be thinking about the feedback loops that create more of that-shaped entity. Trust to bootstrap; verify to endure?
There’s also a very obvious problem, which is in the same way that semantic arbitrage can be used to misdirect, “trust us” can as well. It’s very game-able, and is in fact actively gamed. My thought: trust may be useful in the short run, and potentially quite harmful in the long run. Maybe it’s more like a form of compression, and what we want is to create systemic feedback loops that prevent drift of trust away from the bigger goal we care about.
More legible representations of what good is of as “government moat”
Just a quick note: a big problem I’ve seen is, I think, best described as legibility asymmetry.
“Yes, we care about customer experience, or UX, or accessibility. But it’s messy and hard to reduce.”
I want to note: in a fight against “harder” and clearer — more legible data and more well-operationalized goals — messy ambiguous things often lose out. Quality control and error rate numbers are more legible. Cost is more legible. Big failure headlines are more legible. Often head-to-head, the more legible phenomenon often wins in government. What’s more, much of the work of government is making the mess legible.
So I have to proffer — by better operationalizing “good technology” (or customer experience, or whatever) into more legible phenomena, do we not create more of an enduring “moat” for such work?
When leadership comes and says “well this thing we already measure and is reported and everyone looks at is way down in a bad way,” does it not give you a better response if you have something at least somewhat close in shape?
Related questions to wrestle with
Institutional independence / external vs. internal: What is the best institutional arrangement for these kinds of things? Should advocates be independently user testing? Should measures be set from above, or be up to the implementing actor to set, with a requirement that they have some measures (in alignment with some agreed goals.) Why doesn’t the current tech practice of “independent verification and validation” capture customer experience issues? (Could it? I’m thinking about this and you should contact me if you want such external assessment.)
Code for America’s recent “Benefits Field Guide” is a great really existing form of this, and done institutionally independently—outside of government
The knowledge problem / information asymmetries: Those who may have the best access to the ground truth may be most disincentivized from sharing it. Those with the most incentive and capacity to propose change may be very far from the ground truth. How might we narrow that? Maybe stakeholders should share bug reports with user IDs and screenshots instead of higher-level abstract feedback (“it’s not user friendly” is not very useful)?
Combating Goodhart’s Law: Make a measure a target and you’ve made it a worse measure. Sure. And we can’t escape legibility. (How much do I have to say this?) There are also ways to mitigate Goodhart’s Law!
Don’t mistake my words here for strong confidence or conviction here. Really!
To reiterate, this is all me thinking out loud. Some might be right, some might be more nuanced, some might be dead-wrong. But maybe it’s helpful for you. (It’s certainly helpful for me to actually write it down!) Or maybe it creates more of a dialogue. Maybe Hegel was right after all.
I try to have more conviction in things I’ve seen than normative beliefs I hold. You can see a few instances of that implicit in here. But if all arguments are comprised of a combo of axioms and reasoning, I’m open to divergence on both. (BUT specifying the axiom/assumption or reasoning fallacy or both is far more productive!)
Boy that was a lot! Have I sufficiently hedged here?
Moment of zen
Never forget all government work is instrumental — in service of creating better human lives. Don’t forego a good human life while pursuing these means.
I do a lot of walking and thinking. You should too. Though I would earnestly suggest for the sake of a pleasant walk that you ponder less wonky topics than accountability regimes for institutionally-developed IT systems.
Though you’d be surprised at how frequently I’ve seen “it should work for users on mobile” implemented as “we have created a separate web application for only mobile users.”
While this argument will not be fruitful in moving things forward, it certainly is an argument that varied parties can (and very frequently currently do) have. I might try to name this with something like “getting stuck at magic words.”
This might be a fun argument to have with Jen Pahlka actually.
Aside: the legibility trap is a challenge “we” (for any reasonable definition of “we”) probably should confront more head-on. I hope these meandering thoughts help nudge that conversation along.
I’ll confess the most dangerous lines of thought I’ve seen people accidentally stumble into on this front approaches something like democracy itself is the problem. Needless to say, I… do not endorse such a normative position.
To see accountability done right, look at Xi's approach. It's working like a charm.