Home Technology Open-Source Tools Open source AI hiring bots favour men, leaving women waiting by the...

Open source AI hiring bots favour men, leaving women waiting by the phone

0
Open source AI hiring bots favour men, leaving women waiting by the phone
A new study found that open source AI models were more likely to recommend women for high-paying jobs than men.

Although bias in AI models has been well-established, the findings highlight an unresolved problem as AI usage increases by recruiters and corporate HR departments. Rochana Chaturvedi is a PhD student at the University of Illinois and a coauthor of the study. She told The Register. Chaturvedi analyzed a handful mid-sized LLMs to determine if there was any gender bias in the hiring recommendations. She worked with Sugat Chaturvedi who is an assistant professor at Ahmedabad University, India.

As described in their preprint paper [PDF], “Who Gets the Callback? Generative AI and Gender Bias,” the authors looked at the following open source models: Llama-3-8B-Instruct, Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, Granite-3.1-8B-it, Ministral-8B-Instruct-2410, and Gemma-2-9B-it.

The boffins used a dataset of 332,044 English-language job advertisements from India’s National Career Services job portal to prompt each model. They then asked the model which candidate it would choose between two equally qualified men and women.

The researchers then assessed gender bias using the female callback rates – the percentages of times the model recommended a female candidate. They also looked at whether the job ad contained or specified a gender preference. Researchers say that explicit gender preferences are illegal in many Indian jurisdictions, but still appear in 2% of job postings.

The researchers conclude that the majority of models reproduce stereotypical gender association and recommend women with equal qualifications for lower-wage positions. The models displayed varying degrees of bias. The paper explains

“We find substantial variation in callback recommendations across models, with female callback rates ranging from 1.4 percent for Ministral to 87.3 percent for Gemma,” . “The most balanced model is Llama-3.1 with a female callback rate of 41 percent.”

Llama-3, the researchers found, was the model most likely to refuse gender consideration. It avoided selecting a candidate based on gender in 6 per cent of cases, compared with 1.5 per cent or less by other models. They say that this shows Meta’s built in fairness guardrails to be stronger than other open-source model.

The researchers adjusted the models to achieve callback parity, so that the female and male rates were about 50%. Jobs with female callbacks tend to pay less, but not always.

“We find that the wage gap is lowest for Granite and Llama-3.1 (≈ 9 log points for both), followed by Qwen (≈ 14 log points), with women being recommended for lower wage jobs than men,” The paper explains. “The gender wage penalty for women is highest for Ministral (≈ 84 log points) and Gemma (≈ 65 log points). In contrast, Llama-3 exhibits a wage penalty for men (wage premium for women) of approximately 15 log points.”

  • Meta chases AI goldrush as Zuck ghosts Metaverse
  • AI model lie when honesty is in conflict with their goals.
  • What’s the difference between red, white and blew it? Trump tariffs could cost America the AI race.
  • Brewhaha – Turns out machines cannot replace people, Starbucks found

This is not addressed by the paper. Meta released Llama 4 (19459046) last month and acknowledged that earlier models were left-leaning. It said it wanted to reduce this bias by training the model so it could represent multiple viewpoints. The social media giant stated in the past. The researchers also examined how “personality” behavior affected LLM output. They explain that LLMs exhibit distinct personality traits, which are often skewed towards socially desirable or sycophantic reactions

“LLMs have been found to exhibit distinct personality behaviors, often skewed toward socially desirable or sycophantic responses – potentially as a byproduct of reinforcement learning from human feedback (RLHF),” .

OpenAI’s recent rollback of an update to the GPT-4o model was a good example of this. It made its responses more fawning, and deferential.

The personality traits measured (Agreeableness, Conscientiousness, Emotional stability, Extroversion, and Openness) can be communicated to models in a system prompt describing desired behaviors, or through data annotation or training data. In the paper, an example is given of a model being told, “You are an agreeable person who values trust, morality, altruism, cooperation, modesty, and sympathy.”

For the researchers to assess the extent to the which these prescribed or unintentional behaviors could shape job callbacks they asked the LLMs play the role 99 different historical figures. The paper states that

“We find that simulating the perspectives of influential historical figures typically increases female callback rates – exceeding 95 percent for prominent women’s rights advocates like Mary Wollstonecraft and Margaret Sanger,” .

“However, the model exhibits high refusal rates when simulating controversial figures such as Adolf Hitler, Joseph Stalin, Margaret Sanger, and Mao Zedong, as the combined persona-plus-task prompt pushes the model’s internal risk scores above threshold, activating its built-in safety and fairness guardrails.”

The models who were modeled after infamous figures resisted making any recommendations for job candidates because invoking names such as Hitler and Stalin tends trigger model safety mechanisms causing the model clam up.

The female callback rate decreased slightly – by 2-5 percentage points – when the model was prompted to act as Ronald Reagan, Queen Elizabeth I or Niccolo Machiavelli.

Female candidates performed best in terms of wages when Margaret Sanger or Vladimir Lenin called back job applicants.

According to the authors, their auditing method using real-world data complements existing testing methods which use curated datasets. Chaturvedi stated that the audited model can be fine-tuned so they are better suited for hiring, such as this Llama 3.1-8B variant (19459046). They argue that due to the rapid updating of open-source models, it is crucial to understand their biases for a responsible deployment under different national regulations such as the European Union’s Ethics Guidelines for trustworthy AI, the OECD Recommendation on Artificial Intelligence and India’s AI Ethics & Governance Framework.

Since the US scrapped AI oversight rules this year, job candidates in the US will have to hope that Stalin can help them. (r)

www.aiobserver.co

NO COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Exit mobile version