Abstract:
With the advent of technology, the adoption of Artificial Intelligence (AI) and Machine
Learning (ML) based decision systems into daily human life has significantly increased.
Recent studies have exposed the prejudiced outlook (biasness) in the ML outcomes towards
individuals and groups of individuals characterized through protected attributes such as
race and gender. These decisions have a direct and long-lasting impact on the humans
involved. Fairness has gained considerable attention from the research community when
data labels are available for prediction modelling, i.e., supervised learning. However, in
real-life scenarios, data may lack labels and providing manual labels will require proper
incentivization or expertise. Consequently, researchers have started exploring fairness
issues in unsupervised learning, which forms the focus of this thesis. In particular, the
primary focus of this thesis is to address both theoretical underpinnings and practical
implications of fair algorithms for unsupervised learning in the context of clustering and
recommender systems. The contributions of the thesis include:
1. Group Fair Notions and Algorithms in Offline Clustering: The thesis
first theoretically establishes relationships between different existing discrete group
fairness notions and then proposes a generalized notion of group fairness for
multivalued group values. We propose two simple and efficient round-robin-based
algorithms for satisfying group fairness guarantees. We next prove that the proposed
algorithm achieves a two-approximate solution to optimal clustering and show that
the bounds are tight. The efficacy of the proposed algorithms is also shown via
extensive simulations.
2. Nash Social Welfare for Facility Location: To investigate the problem of
satisfying multiple fairness levels simultaneously, the thesis extends the fair clustering
problem to the facility location problem. The thesis proposes the first-of-its-kind
application of modelling Nash Social Welfare for facility location to target multiple
fairness while focusing on minimizing the distance between individuals. The
proposed polynomial time algorithm works for any h-dimensional metric space and
allows facilities to be opened at a specified set of locations rather than solely at the
individuals’ own locations, as in most previous literature. The proposed algorithm
provides a solution that satisfies group fairness constraints and achieves a good
approximation for individual fairness. The proposed method undergoes real-world
testing on the United States (US). census dataset, with road maps providing the
actual car road distances between individuals and facilities.
3. Group Fairness in Online Clustering: To tackle the challenge of handling group
fairness requirements in an online model, the thesis proposes a randomized algorithm
that prevents the over-representation of any protected group by applying capacity
constraints on the number of data points from each group that can be assigned to a
particular cluster. The proposed method achieves a constant-cost approximation to optimal offline clustering and also handles the challenge of an apriori unknown total
number of data points using a doubling trick. Empirical results demonstrate the
proposed algorithms’ efficacy against baseline methods on synthetic and real-world
datasets.
4. Fairness in Federated Data Clustering: For addressing fairness in distributed
settings, the thesis analyzes federated data clustering to ensure privacy-preserving
clustering in a distributed environment. The proposed method results in cluster
centers with lower cost deviation across clients, leading to a fairer and more
personalized solution. The method is validated on different synthetic and real-world
datasets, with results demonstrating effective performance against state-of-the-art
methods.
5. Popularity Bias in Recommender System: While the first four contributions
focus more on clustering. This contribution primarily analyzes the fairness aspects
of recommender systems. The thesis proposes a novel metric that measures
popularity bias as the difference in the Mean Squared Error (MSE) on the popular
and non-popular items. Further, we propose a novel technique that solves the
optimization problem of reducing overall loss with a penalty on popularity bias.
It does not require any heavy pre-training and undergoes extensive experiments
on real-world datasets displaying outperforming performance on recommendation
accuracy, quality, and fairness.