Responsible AI Beyond Research and Regulations

Pedro Saleiro

^Opnova

AI is moving at an incredible pace, and it’s accelerating, with trillions of dollars under investment. It’s reshaping industries and even the way we work and live our lives. However, as AI becomes more influent, the importance of Responsible AI becomes more obvious.

As researchers, engineers, and industry leaders, we need to ensure our AI systems aren’t just theoretically good—they need to be proven trustworthy when used in everyday situations. There’s no shortage of research on fairness, safety, robustness, explainability, and privacy in AI, but there’s still a big gap between research work and real-world practice. With the EU AI Act on the horizon and ongoing research in this field, we must ask ourselves: are we truly prepared to implement these principles effectively?

The EU AI Act, set to become a landmark regulation in AI governance, aims to create standards for developing and using AI systems in Europe. However, there’s a potential risk that this regulation, if not implemented effectively, will focus more on legalistic compliance—producing mountains of paperwork rather than ensuring that AI systems are thoroughly and comprehensively tested.

The issue here is not the intent of the regulation but how it’s implemented. We’ve seen similar challenges in other industries where innovation was stifled by excessive regulation that focused more on process than outcomes.

Testing is the only way to ensure that AI behaves reliably in a range of environments, including edge cases that could present significant risks. A good analogy is to look at mission-critical industries like aerospace or nuclear energy, where failure is not an option. In these sectors, thorough testing is built into every stage of the development process, from initial design to final implementation. AI should be no different.

Traditional AI testing methods, which focus on specific datasets and controlled environments, won’t be sufficient for modern AI systems, particularly those we classify as "agentic" – systems capable of perceiving their environment, making decisions, delegating, and taking actions to achieve specific goals.

Take, for example, a fraud detection agent in a banking system. This agent could request a fraud score from a tabular fraud detection model, visually scan transaction histories, cross-reference credit card data, search for patterns across different devices, and even contact the account holder. Based on all this information, it might decide whether or not to block the account. This type of system is incredibly powerful but also incredibly risky if not properly tested.

In the case of agentic AI, this means testing not just individual tasks but the entire decision-making pipeline. For instance, we need to simulate complex, real-world fraud scenarios, such as coordinated attacks across multiple accounts, to ensure the AI behaves as expected under these high-stakes conditions.

These AI agents, built on large multimodal models, present several unique challenges:

Non-determinism: Unlike traditional software, agentic AI systems may produce different outputs for the same input, making reproducibility and bug identification more complex.
Non-stationarity: These systems can learn and adapt over time, potentially changing their behavior in ways that may not be immediately apparent or predictable.
Complexity: The intricate, multi-component nature of these systems makes it difficult to isolate and test individual parts without considering the whole.
Contextual performance: The performance of agentic AI can vary significantly based on the context in which it operates, requiring testing across a wide range of scenarios.
Ethical considerations: As these systems make increasingly consequential decisions, we must test not just for functionality, but also for alignment with human values and ethical principles.

To address these challenges, we need a concerted effort from both academia and industry to develop comprehensive, standardized open-source AI testing frameworks that should enable:

Statistical testing: Given the non-deterministic nature of these systems, we need to move beyond simple input-output testing to statistical approaches that can quantify behavior across distributions of outcomes.
Continuous testing: As AI agents learn and evolve, testing must be an ongoing process throughout the AI lifecycle, from development to deployment and beyond.
Multi-component testing: Frameworks should allow for testing of individual components as well as the system as a whole, helping to isolate issues and understand complex interactions.
Ethical evaluation: Beyond functional testing, we need methodologies to assess the ethical implications of AI decisions and behaviors.
Scenario-based testing: Tools should support the creation and execution of diverse, realistic scenarios to evaluate AI performance across different contexts.

For the EU AI Act to really work, we have to go beyond just meeting the basic legal requirements. Instead, we need to create a culture of continuous and serious testing that keeps up with AI’s fast pace. This is not just about following rules. These tools will not only help us create more reliable and trustworthy AI systems but will also accelerate the pace of innovation by providing developers with the confidence to push the boundaries of what's possible.

It’s time for European researchers, engineers, and industry leaders to walk the talk. Responsible AI requires more than regulation—it requires a commitment to continuous testing and improvement, so that we can build a future where AI enhances human capabilities while safeguarding human rights and the EU societal values.