In the United States, there is a long history of fairness issues in lending.
For example, redlining:
‘In 1935, the Federal Home Loan Bank Board asked the Home Owners’ Loan Corporation to look at 239 cities and create “residential security maps” to indicate the level of security for real-estate investments in each surveyed city. On the maps, “Type D” neighborhoods were outlined in red and were considered the most risky for mortgage support..
‘In the 1960s, sociologist John McKnight coined the term “redlining” to describe the discriminatory practice of fencing off areas where banks would avoid investments based on community demographics. During the heyday of redlining, the areas most frequently discriminated against were black inner city neighborhoods…’
Redlining is clearly unfair, since the decision to invest was not based on an individual homeowner’s ability to repay the loan, but rather on location; and that basis systematically denied loans to one racial group, black people. In fact, part 1 of a Pulitzer Prize-winning series in the Atlanta Journal-Constitution in 1988 suggests that location was more important than income: “Among stable neighborhoods of the same income [in metro Atlanta], white neighborhoods always received the most bank loans per 1,000 single-family homes. Integrated neighborhoods always received fewer. Black neighborhoods — including the mayor’s neighborhood — always received the fewest.”
More recently, in 2018, WUNC reported that blacks and latinos in some cities in North Carolina were denied mortgages at higher rates than whites:
“Lenders and their trade organizations do not dispute the fact that they turn away people of color at rates far greater than whites. But they maintain that the disparity can be explained by two factors the industry has fought to keep hidden: the prospective borrowers’ credit history and overall debt-to-income ratio. They singled out the three-digit credit score — which banks use to determine whether a borrower is likely to repay a loan — as especially important in lending decisions.”
The WUNC example raises an interesting point: it is possible to look unfair via one measure (loan rates by demographic), but not by another (ability to pay as judged by credit history and debt-to-income ratio). Measuring fairness is complicated. In this case, we can’t tell if the lending practices are fair because the data on credit history and debt-to-income ratio for these particular groups are not available to us to evaluate lenders’ explanations of the disparity.
In 2007, the federal reserve board (FRB) reported on credit scoring and its effects on the availability and affordability of credit. They concluded that the credit characteristics included in credit history scoring models are not a proxy for race, although different demographic groups have substantially different credit scores on average, and “for given credit scores, credit outcomes — including measures of loan performance, availability, and affordability — differ for different demographic groups.” This FRB study supports the lenders’ claims that credit score might explain disparity in mortgage denial rates (since demographic groups have different credit scores), while also pointing out that credit outcomes are different for different groups.
Is this fair or not?
Some researchers say that fairness is not a statistical concept, and no statistic will fully capture it. There are many statistical definitions that people try to relate to (if not define) fairness.
First, here are two legal concepts that come up in many discussions on fairness:
[Lipt2017] points out that these are legal concepts of disparity, and creates corresponding terms for technical concepts of parity applied to machine learning classifiers:
There is a large body of literature on algorithmic fairness. From [Corb2018], two more definitions:
There is lack of consensus in the research community on an ideal statistical definition of fairness. In fact, there are impossibility results on achieving multiple fairness notions simultaneously ([Klei2016] [Chou2017]). As we noted previously, some researchers say that fairness is not a statistical concept.
Each statistical definition described above has counterarguments.
Treatment parity unfairly ignores real differences. [Corb2018] describes the case of the COMPAS score used to predict recidivism (whether someone will commit a crime if released from jail). After controlling for COMPAS score and other factors, women are less likely to recidivate. Thus, ignoring sex in this prediction might unfairly punish women. Note that the Equal Credit Opportunity Act legally mandates treatment parity: “Creditors may ask you for [protected class information like race] in certain situations, but they may not use it when deciding whether to give you credit or when setting the terms of your credit.” Thus, [Corb2018] implies that this sort of unfairness is enshrined in law.
Impact parity doesn’t ensure fairness (people argue against quotas), and can cripple a model’s accuracy, harming the model’s utility to society. [Hard2016] discusses this issue (using the term “demographic parity”) in its introduction.
Corbett et al. [Corb2018] argue at length that classification parity is naturally violated: “when base rates of violent recidivism differ across groups, the true risk distributions will necessarily differ as well — and this difference persists regardless of which features are used in the prediction.”
They also argue that calibration is not sufficient to prevent unfairness. Their hypothetical example is a bank that gives loans based solely on the default rate within a zip code, ignoring other attributes like income. Suppose that (1) within zip code, white and black applicants have similar default rates; and (2) black applicants live in zip codes with relatively high default rates. Then the bank’s plan would unfairly punish creditworthy black applicants, but still be calibrated.
In summary, likely fairness in lending has no single measure. We took a whirlwind tour of four statistical definitions, two motivated by history and two more recently motivated by machine learning, and summarized the counterarguments to each.
This also means it is challenging to automatically decide if an algorithm is fair. Open-source fairness-measuring packages reflect this by offering many different measures.
However, this doesn’t mean we should ignore statistical measures. They can give us an idea of whether we should look more carefully. Food for thought. We should feed our brain well, it being the most likely to make the final call.
(A note: this subject is rightfully contentious. Our intention is to add to the conversation in a productive, respectful way. We welcome feedback of any kind.)