The Department of Veterans Affairs is leveraging data and AIOps to better respond to and address unanticipated abnormalities in the agency’s IT network.
“The Operations Triage Group … [was] challenged with how we could better coordinate,” Jay Paluch, the group’s director, said during a Business of VA interview Friday. “We had a couple [system problems] where they were large reaching across our enterprise, and it affected many systems. … How might we better be able to not only recognize those, but alert ourselves that there’s a problem, and start putting tools in place that let us know when there was a system problem and what it might be? All of those together then allows us to be able to find problems quicker or find them before they start to impact our users.”…
Paluch noted that 20,000 minutes of productivity are lost for every one minute of down time on a single system. These enhanced monitoring techniques are supporting and improving many of VA’s programs and services. SREs work closely with system owners to build telemetry from performance logs to then determine where abnormalities exist.
“Our [Site Reliability Engineers (SREs)] have been involved in many of VA’s most critical systems. That would include those supporting veteran benefits, emergency room management, user authentication, to name a few. In many cases we transcend the [Office of Information and Technology] organization to support where we can as best we can,” Paluch told GovCIO Media & Research… Read the full article here.