Sunday, June 15, 2025

Federal News Network: Coming to an algorithm near you: A big, federally-focused training data set

Contractors trying to develop artificial intelligence applications for the government face a challenge, namely a good data set for training the algorithms. Now a big new federally-oriented data set is coming from an unlikely source. The Federal Drive with Tom Temin got the details from Bloomberg Government reporter Josh Axelrod…

Josh Axelrod: So there’s an influential group at Stanford that has created a new data set that they call “pile of law”, it’s especially oriented towards legal and governmental contexts. Let me back up though, and tell you about foundation models, which they’re trying to make a better foundation model is basically, if you take like a row of Encyclopedia Britannica, and you try to ingest it all into a machine, and then use that machine to make decisions, learn something from the information you’ve provided it, a lot of foundation models, which is a term that comes from this Stanford group, and is now widely used across the industry. They ingest information from the public internet. So the internet, you know, I’m as much a fan of the internet as the next guy, but there’s a lot of crap on there. And foundation models have historically been trained using social media, Wikipedia, Reddit. So for example, yeah, if you’re trying to teach a model how humans speak, Facebook is littered with hate speech. And so that could encode  bad grammar as well, which you can’t have in a model. So these researchers attempted to use a different type of data, they turned to casebooks, legal code, regulatory documents, again, things that are more suited to these legal and governmental contexts to do the work of natural language processing more effectively, which is really in use across the whole of government right now.

Tom Temin: Interesting. So in some ways, they are taking the approach that IBM did a number of years ago with a project they called Watson, where they would take all of the vetted material or peer reviewed material from a given domain and put that into a database, they had trouble selling it. I’m not sure how well it worked, it won Jeopardy. But that’s kind of what it sounds like here.

Josh Axelrod: That’s exactly what they’ve done is assemble this corpus of data. It’s actually 250-plus gigabytes. And it’s all open source. So programmers can come and look at that data, tinker with it, use it to build their own models. And again, it’s going to be better suited towards some of these regulatory contacts. There’s really five key areas where the government’s using AI and natural language processing, which this could really augment… Read the full article here.

[related-post]

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Innovation in Action: Advancing Government Health with Philips

FORUM is proud to partner with Philips for a series of articles on their groundbreaking innovations in health technology that serve public- and private sector citizens and service members. Please take a look to learn more about how Philips is advancing modern and efficient health care, while improving lives for generations to come.

Don’t Miss A Thing

Jackie Gilbert
Jackie Gilbert
Jackie Gilbert is a Content Analyst for FedHealthIT and Author of 'Anything but COVID-19' on the Daily Take Newsletter for G2Xchange Health and FedCiv.

Subscribe to our mailing list

* indicates required