Your AI has a Passport Problem

Jun 11

I was on a demo call a few weeks ago with a solution provider, and they demoed the variations in the solution they had made available for different geographies and markets (i.e., having certain functionalities available in one geography vs. another, etc.).

I thought this was a brilliant move because: 1) it didn’t restrict the tool’s capabilities to the lowest common regulatory denominator, and 2) they prioritized geographic differences in design. It also had me going down a rabbit hole about the role geography plays in HR technology design and deployment, which led me straight to the rabbit hole of cultural and regional bias.

All said and done, I’ve arrived at the conclusion that AI has a passport problem, and that problem is creeping its way into HR technology without most of us even knowing about it.

Why Your AI Speaks With an American Accent

All this stems from the fact that Large Language Models, the backbone of Generative AI and any AI-enabled HR tool, are trained on vast text corpora that are not globally representative.

Research has shown that LLM training data are skewed towards Western, English-speaking sources, and the outputs from the LLMs reflect cultural values resembling those of English-speaking and Protestant societies. In other words, the responses from LLMs we use daily often align with Western norms (e.g., individualism, secularism, liberal social attitudes, etc.) due to the internet content used as their training data.

This skew can subtly manifest in the model’s tone, examples/use cases, and assumptions, even if the prompt, user, or intended audience is from a different culture.

The Research Bombshells

Here’s a sample of research findings on cultural bias in LLMs. I’ve included the links to the published findings as well, in case you’re interested in exploring further:

OpenAI models = Protestant Europe values: All GPT models from OpenAI provided default answers that are aligned with the values of English-speaking and Protestant European countries (PNAS Nexus, 2024)
Bias in interview write-ups: When evaluated for candidate interview report outputs, across Claude 3.5, GPT-4, Google Gemini, and Llama 3.1, certain models produced more favorable evaluations for some demographics (arXiv, 2024)
English prompt = American answer: ChatGPT’s answers in English were aligned with American values, even when asked about the values of other countries. Meaning that if you are interacting with the model in English, you are given an Americanized perspective by default (ACL, 2023)
AI suggestions flatten culture: AI-assisted outputs homogenize written content towards Western styles and diminish cultural nuances. In other words, AI-produced content makes you sound American and will lean towards Western cultural references such as Pizza Parties or Christmas holidays (Cornell, 2025)

If you only serve a single-country workforce, you can breathe easier…for now. Everyone else, read on.

Here’s why the AI-passport problem could very soon become YOUR HR tech problem:

Just about every AI-enabled HR tool uses one of the big LLMs as its base model. It makes sense because: 1) LLMs are insanely expensive to build and train, and 2) why reinvent the wheel when you don’t have to?
When you build HR technology off of a core LLM, the tool inherits the assumptions, characteristics, etc. of the original LLM. Of course, one can argue that you can avoid the biases with enough finetuning and training, but how sure are we actually about that? Especially when research studies are showing that even when tools pass surface-level bias audits, the model still isn’t bias-free in more complex tasks
LLMs release new versions and update their knowledge base quite often. Every update could introduce new data and biases to the core model. Depending on how an AI tool is built, it could introduce new biases into its outputs based on the core LLMs releases

Where AI Bias Shows Up in HR

So, all said and done, this is not a “we trained it once and passed an audit that one time, so don't worry about it” type of problem. Especially when we start to look at how these biases can manifest themselves in daily HR operations:

HR Content Creation

Be it a job description, an employee handbook, or an email communication. Content creation is one of the most cited use cases for AI in HR. This means that when an LLM generates HR content, there is a risk that the content will reflect the AI’s ingrained cultural assumptions rather than the intended audience’s context.

I’m sure we can all see a problem here when you are trying to generate content for a multi-national or multi-regional audience. But remember the thing that we were all trying to do a few years ago in making job descriptions more gender-neutral and not leaning too heavily into masculine terms? Well, your LLM could just reverse all that effort because of the massive volumes of JD data it was trained on from the internet. Now, let’s take this a step further and assume that the AI-generated JD, with its trained values and assumptions, was fed into an AI-enabled resume screening system that is also built on an LLM that skews towards a particular set of values and biases. I think we can all see how this could compound into some unintended consequences.

Talent Acquisition

This is definitely one of the most talked-about areas when it comes to AI biases in tech. We know LLMs can inadvertently learn and produce biases that are present in training data, hiring data, and/or societal stereotypes. In a study done in 2024 (cited above), researchers found that LLM-generated interview evaluation reports showed biases linked to candidates’ gender, race, and age. For example, the LLM might produce more positive language in the report for a male candidate, and more doubt-markers for a female candidate (e.g., she might be a good fit), which mirrors historical biases.

Without explicit instructions, AI models can also give an unfair edge to backgrounds they are familiar with. For example, if a region or a school dominated the training dataset for a certain role, candidates from that region or school could have an unfair edge just because AI has seen more text about the former and is more familiar with it. In-group bias can still very much be a real thing in the AI world.

Performance Management

This one gets interesting for me. It’s not just the potential for biased outputs based on skewed training data here, but rather how convincing AI outputs can be. Someone once told me that the power of Generative AI’s influence comes down to language. As humans, we have an unconscious bias towards technology that can communicate with us in a language we understand. So when an LLM presents a summarized performance review, 360 feedback, recommendations, or anything along those lines, even if it is biased, the LLM still presents it authoritatively. So, when a manager sees a well-worded AI-generated evaluation, they might accept its framing instead of questioning it.

Predictive Analytics

I know this is one of the favorite use cases for people analytics practitioners when it comes to AI. While I am fully onboard with getting smart about your data and using technology to expedite predictive analytics work, we need to remember that AI predictions are primarily based on historical data.

I think of it like a robo-advisor that can pick stocks for you. The advisor buys and sells stocks on your behalf based on historical patterns and trends and what it knows of your risk tolerance. While it works well most of the time, it doesn’t do well during Black Swan events. Given the unprecedented amount of change in the economy, labor market, and across the macro-environment, your predictive model might not be as accurate as you’d like it to be.

Sentiment Analysis

“Our AI-enabled solution will help you get a constant pulse on your workforce” is a value proposition that People Analytics practitioners may be all too familiar with. Using AI for sentiment analysis, especially for open-ended survey responses, seems like a great idea on the surface. But expressions of sentiment are culturally coded. For example, in some cultures, employees use milder language even for serious concerns, whereas others might use strong words for minor issues. An AI not attuned to this could misclassify satisfaction levels.

There is also an issue of language coverage here. Suppose a model is trained primarily in English. In that case, responses from other languages may get less accurate sentiment scores, which creates a bias where English-speaking employees’ opinions are measured more reliably than others.

Five Fixes HR Can Start Tomorrow

So, with all this in mind, what can you do about it? The good news is that we don’t need to toss AI out the window and pretend like it never happened, or spend another few hundred thousand dollars on a new solution (phew!). As HR practitioners, we need to be smart about how we use AI and vigilant about reviewing its outcomes.

Few things to start you off:

Run a Multilingual Mirror Test: If you work across cultures or languages, then input your core prompts in two or three languages (even via Google Translate) and compare outputs. If the recommendations swing wildly, you might have caught a potential bias in the model, and you should spend more time with the outputs and tweak them manually if needed.
Ask the Awkward Question (to your vendors): When you are evaluating tools, go with a “What’s the geographic distribution of your training data tokens, and when did you last audit it?”, and watch the body language. They might need to get someone on their Engineering or Product team to answer the question, but stay with it because they need to be able to answer this question if you’re looking for a reliable partner.
Test Globally, Fine-tune Locally: When you deploy an AI solution, make sure you allocate the time, budget, and resources to ensure some region-specific data/input can be brought into the training process to nudge the model towards balance.
Human in the Loop: Make sure people are included during various process stages with AI by defining the checkpoint and not just creating a single sign-off point. By pairing representative reviewers with your AI outputs, you can catch any potential skewness and biases before they perpetuate further.
Amplify the Under-represented: Encourage employees to contribute glossaries, slang guides, or culturally specific examples to your prompt library so their perspectives can be included in your training dataset and in model outputs going forward.

The next time your “neural” AI hands you a one-size-fits-all policy, remember that algorithms don’t just crunch data for outputs, they amplify voices. The question we need to reflect on is: whose voices get the mic? Check the passport of your favorite AI model before you stamp its advice into policy.

Lydia Wu