Genetic risk factors associated with NAFLD

Non-alcoholic fatty liver disease (NAFLD) is estimated to affect 25% of the worldwide population, and is the leading cause of chronic liver disease in developed countries. Genetic research on NAFLD has included heritability studies, candidate gene studies, familial aggregation studies, and genome-wide association studies (GWAS). Next-generation sequencing approaches, such as whole-genome sequencing and whole-exon sequencing, are emerging as the post-GWAS era of genetic research. However, GWAS remains more practical for elucidating the genetic factors related to NAFLD, which is affected by thousands of common genetic variants and does not follow Mendelian inheritance. In the present review, we summarize the current knowledge regarding five GWAS-identified genetic loci that are associated with NAFLD. We also discuss the relationships between NAFLD-predisposing polymorphisms and cardiovascular disease, and potential applications for these identified genetic loci.


INTRODUCTION
Non-alcoholic fatty liver disease (NAFLD) is estimated to affect a quarter of the global population, and is the leading cause of chronic liver diseases in developed countries [1] . NAFLD etiologies are complex and the factors driving NAFLD progression are not completely understood, although they likely include environmental factors (e.g., diet), insulin resistance, increased visceral adiposity, and genetics. Genetic research on NAFLD has included heritability studies [1,2] , candidate gene studies [3] , familial aggregation studies [4,5] , and genome-wide association studies (GWAS) [6][7][8] . NAFLD heritability was initially evaluated through a candidate gene study. Such studies are designed to examine the association of a phenotype with single-nucleotide polymorphisms (SNPs) in selected genes; however, they have weak statistical power [9] .
Following candidate gene studies, GWAS have become the default methodology for testing the associations between diseases (phenotypes of interest) and millions of SNPs throughout the genome. Over the recent years, GWAS have dramatically improved our understanding of the genetic factors related to NAFLD susceptibility, progression, and outcomes [10] . GWAS have led to the identification of several variants that are significantly associated with NAFLD. For example, one well-known genetic risk factor for NAFLD is a coding variant in patatin-like phospholipase domain containing protein 3 (PNPLA3), an I-to-M substitution at position 148 (chr22:43928847, rs738409 C>G) [11] . This rs738409 variant has been repeatedly found to be associated with NAFLD or elevated hepatic fat content [7,8,12] . Additionally, NAFLD susceptibility is significantly associated with four other genes: transmembrane 6 superfamily member 2 (TM6SF2), membrane-bound O-acyltransferase domain containing 7 (MBOAT7), glucokinase regulator (GCKR), and hydroxysteroid 17β-dehydrogenase (HSD17B13) [9] . This was followed by next-generation sequencing, such as whole-genome sequencing and whole-exon sequencing, emerging as post-GWAS era advancements in genetic research. However, unlike monogenic diseases, heritability in complex diseases like NAFLD is affected by thousands of common genetic variants and thus does not follow Mendelian inheritance. GWAS have been used to uncover thousands of genetic variants that influence the risks for complex human traits and diseases, and are thus more appropriate for elucidating the genetic factors related to NAFLD.
In the present review, we describe the five GWAS-identified risk variants that exhibit the most wellestablished associations with NAFLD [ Table 1]. We also identify and discuss genetic associations between NAFLD and cardiovascular diseases, and suggest potential applications of genomic data for precision medicine.

PAPTATIN-LIKE PHOSPHOLIPASE DOMAIN CONTAINING PROTEIN 3
PNPLA3 p.I148M (chr22:43928847, rs738409 C>G) was the first NAFLD-related variant identified using GWAS [7] , and has exhibited a robust and well-replicated association with NAFLD in several studies [13][14][15] . PNPLA3 is highly expressed in the liver and adipose tissues. Its expression is regulated by insulin through a signaling pathway that includes LXR and SREBP-1c [16] , and is thus increased with feeding in animal  [45,47] studies [17] . The PNPLA3 protein hydrolyzes triglycerides and retinyl esters [18] . The variant rs738409 C>G causes an isoleucine-to-methionine substitution at amino acid position 148 in PNPLA3, which results in impaired retinyl ester release and reduced hydrolase activity, causing fat accumulation within hepatocytes, including hepatic stellate cells [19] .
Studies of PNPLA3 have transformed our knowledge of hepatic steatosis, revealing that lipid remodeling in intracellular lipid droplets is a common pathway underlying NAFLD progression, regardless of the environmental triggers. While the wild-type PNPLA3 protein is rapidly degraded, the variant protein has no lipase activity, thereby leading to triglyceride accumulation in the liver [19,20] . This can induce liver damage and inflammation, and can block the release of several extracellular proteins that protect against liver fibrosis, including matrix metalloproteinases and tissue inhibitor of metalloproteinases [21] . The G allele of rs738409 is significantly associated with NAFLD activity score (NAS, P = 0.004), steatosis (P = 0.03), lobular inflammation (P = 0.005), portal inflammation (P = 2.5 × 10 -4 ), and fibrosis (P = 7.7 × 10 -6 ) [22] . Moreover, homozygosity of this variant is reportedly linked to a 10-fold increased risk of developing NAFLD-associated HCC in the European population [23] . Overall, these findings indicate that the G allele of rs738409 increases susceptibility to the whole spectrum of NAFLD -from steatosis to NASH (an inflammation-associated form of NAFLD), fibrosis, and HCC.
The PNPLA3 gene could also be responsible for the different prevalence rates of NAFLD between ethnic groups. Different populations showed diverse odds ratios (ORs) for the variant rs738409 C>G, ranging from 2.08 to 18.23 [combined OR: 3.41 (2.57-4.52), P < 0.00001] [7,8,11,12] . According to the genome aggregation database browser (gnomAD, https://gnomad.broadinstitute.org/), the G allele of rs738409 has a frequency of 27.1% in the general population, but occurs at a lower frequency in persons of African ethnicity (26.1%), and at a higher frequency in persons of Latino ethnicity (54.9%), which may have an impact on NAFLD risk in Latino populations [7,12] . Accordingly, compared to other ethnic groups, persons of Latino ethnicity are reportedly more likely to progress to more severe forms of NAFLD [24] . The effect of PNPLA3 on NAFLD has also been described in East Asian cohorts. In two Japanese cohorts, the G risk allele of the rs738409 variant is significantly associated with NAFLD . Therefore, the rs738409 variant C>G in PNPLA3 is strongly related to NAFLD progression in both Latino and East Asian cohorts.

TRANSMEMBRANE 6 SUPERFAMILY 2
Transmembrane 6 superfamily member 2 (TM6SF2) is a protein that localizes to the endoplasmic reticulum-Golgi apparatus of hepatocytes, and is involved in the increased hepatocytic secretion of triglyceride-rich lipoproteins via the pathway of very-low-density lipoprotein secretion. The TM6SF2 polymorphism rs58542926 C>T (chr19:19268740, C>T) involves a C-to-T substitution at nucleotide 499, encoding a glutamate-to-lysine change at codon 167 (E167K). The variant rs58542926 leads to reduced TM6SF2 expression, and is thus associated with increased hepatic lipid content. In a multi-ancestry study, the rs58542926 polymorphism was related to increased serum liver enzyme [alanine transaminase (ALT)] levels and a decreased serum lipid profile (total cholesterol and triglycerides) [27] . Interestingly, rs58542926 has also been linked to a decreased risk of cardiovascular events based on the decreased circulating lipoprotein levels [28,29] . The relationship between TM6SF2 and serum liver enzyme (ALT) has also been identified in other large cohorts (n > 80,000) [27] . Moreover, the variant rs58542926 has been associated with increased liver fibrosis (P = 5.57 × 10 -5 ), independently of PNPLA3 I148M [30] .
The G risk allele of rs58542926 occurs with a lower frequency (0.06969, gnomAD) in East Asia; thus, we explored the association of rs58542925 with NAFLD in East Asian studies. A Korean study reported that the co-existence of the risk alleles rs738409 and rs58542926 was associated with an increased risk of NASH  [31] . Similar findings were reported in a Chinese cohort [32] . These results indicate that the rs58542926 variant in TM6SF2 is associated with NAFLD, even in East Asia, where the allele frequency is low.

MEMBRANE-BOUND O-ACYLTRANSFERASE DOMAIN CONTAINING 7
Membrane-bound O-acyltransferase domain containing 7 (MBOAT7) is a protein involved in phosphatidylinositol remodeling with arachidonic acid in the Lands cycle. MBOAT7 is mainly expressed in the liver, including in hepatic sinusoidal cells, hepatic stellate cells, and hepatocytes. In several studies, the T allele of rs641738 in MBOAT7 (chr19:54173068, rs641738 C>T) has been reported to increase the risk of developing the whole spectrum of NAFLD. Each T allele was associated with an increased risk of the development of hepatic steatosis

GLUCOKINASE REGULATOR
Glucokinase regulator (GCKR) controls de novo lipogenesis by regulating the glucose influx into hepatocytes, which boosts the lipogenic pathway by providing further substrate for liver biosynthesis. Several variants in the GCKR gene are reportedly associated with NAFLD [8,36] . The rs1260326 variant (chr2:27508073, C>T) encoding P446L has been considered as a causal variant for this association. In NAFLD patients, the T allele of rs1260326 is significantly associated with the hepatic fibrosis stage as compared to the F1 stage [OR: 2.06 (1.02-1.14), P = 0.0008] [37] . The rs780094 variant in GCKR has also been significantly associated with computed tomography-proven and biopsy-proven NAFLD in a genome-wide association study (OR: 1.45, P = 2.59 × 10 -8 ) [8] , and in a meta-analysis of five studies comprising of 2,091 NAFLD cases and 3,003 controls [OR: 1.25 (1.14-1.36), P < 0.00001] [38] .
The T risk allele of rs1260326 is associated with higher GCKR expression [39] . Unlike the wild-type GCKR protein, the GCKR P446L protein is not sustained by fructose-6-phosphate, resulting in enhanced hepatic uptake of glucose, glucokinase activity [40] , and de novo lipogenesis [41] . Interestingly, the risk allele of rs1260326 is also associated with decreased serum glucose levels and reduced T2DM risk [42,43] .
The protective effect of HSD17B13 is mediated by reduced activity of the enzyme, which is involved in the conversion of retinol to retinoic acid [47] . Retinoic acid reportedly suppresses fibrosis in NAFLD. This means that the protective effect of HSD17B13 is not due to changes in hepatic fat accumulation, but rather caused by the enzymatic activity of lipid droplet-associated retinol dehydrogenase activity.

ASSOCIATION BETWEEN NAFLD-PREDISPOSING POLYMORPHISMS AND CARDIOVASCULAR DISEASE
Several studies report that cardiovascular disease (CVD) is the most common cause of mortality among NAFLD patients [48][49][50] . This can be explained by the fact that NAFLD and CVD share common pathological pathways, such as inflammation, endothelial dysfunction, and oxidative stress [51,52] . Therefore, we explored the relationship between CVD and two of the most well-validated NAFLD-associated loci: PNPLA3 and TM6SF2.
A meta-analysis study, including 60,801 coronary heart disease (CHD) cases and 123,504 controls, determined that the rs738409 variant in PNPLA3 showed a protective effect against CVD [OR: 0.92 (0.87-0.97), P = 0.002] [53] . In another study, the G allele of rs738409 was inversely related to CHD in 576 patients who underwent elective coronary angiography (P = 0.02) [54] . However, this trend of association between the rs738409 G risk allele and CHD has been inconsistent. Another study found no association between the G risk allele of rs738409 and CHD [OR: 0.98 (0.95-1.02), P = 0.79] [55] . Interestingly, in a study including 1,103 premature CHD patients and 1,469 healthy controls, the presence of the G allele of rs738409 was associated with increased risk of premature CHD development among T2DM patients [OR 1.20 (1.011-1.421), P = 0.042] [56] . In addition to CHD, the G risk allele of rs738409 was reportedly linked to a greater risk of increased thickness of the carotid artery intima-media in 162 patients with biopsy-proven NAFLD [OR: 2.94 (1.12-7.70), P = 0.02], and this finding was validated in 267 patients with biopsy-proven or clinical NAFLD [57] .
The rs58542926 variant in TM6SF2, which is associated with fatty liver, is also reportedly protective against CVD by lowering serum lipid levels (total cholesterol, LDL-cholesterol, and triglycerides) [58] . This means that the T allele of rs58542926 in TM6SF2 confers protection against CVD at the expense of a higher risk of NAFLD. A meta-analysis confirmed that the rs58542926 T allele is associated with a tendency of decreased CVD risk [OR: 0.951 (0.92-0.98), P = 0.005] [53] . Moreover, in a smaller cross-sectional study, this allele was related to a decreased risk of carotid artery plaques [OR: 0.49 (0.25-0.94)] [29] .
Despite these studies, the shared genetic causality between CVD and NAFLD remains unclear. The four genes that are most commonly reported to be associated with NAFLD -PNPLA3, TM6SF2, MBOAT7, and GCKR -have been analyzed by weighted fixed-effects statistical modeling, revealing no association of NAFLD with CVD [OR: 1.00 (0.99-1.91), P = 0.93] [59] . These findings indicate that more complex relationships may exist between NAFLD and CVD. Exploring the roles of NAFLD-associated variants in CVD risk demonstrates that biologically relevant insights can be obtained by identifying individual pleiotropic loci that affect different outcomes. Clearly, more research is needed.

APPLICATIONS FOR PRECISION MEDICINE
There are several other genetic variants associated with NAFLD [Supplementary Table 1]. These associations with SNPs from GWAS are generally reported as P values and/or effect sizes. However, these metrics do not fully reflect the SNP's ability to differentiate between the control and the phenotype of interest. To apply genetic information for disease prediction, it is important to not only focus on the statistical power of a variant but also on the measurement of area under the ROC curve, which summarizes the true and false positive rates for a binary outcome [60] .
Another emerging metric is the development of polygenic risk scores (PRSs), which reflect the risk accumulation based on multiple SNPs, and can be calculated as a weighted sum of the disease risk alleles carried by an individual [61] . PRS use would be a sensible approach in the study of NAFLD, as both common and rare SNPs are related to NAFLD risk, irrespective of clinical risk factors [62] . However, there is little evidence of the clinical application of PRSs. Similarly, Krawczyk et al. [63] reported that the summed number of risk alleles (0-5) for three genes (PNPLA3, TM6SF2, and MBOAT7) was significantly correlated with the individual risks of increased hepatic triglyceride content and elevated serum liver enzyme levels (AST and ALT). This study further suggests that the historical concept of genetic risk score (GRS) -which includes the contemporary concept of a continuous spectrum of NAFLD risk -could also be useful for predicting NAFLD development and progression. Several NAFLD risk scoring models that incorporate both genetic and clinical information have also been proposed [64] . For example, Hyysalo et al. [65] developed a model for predicting NASH, which combines clinical variables and genetic information, based on European cohorts with biopsy-proven NAFLD. In another study, NAFLD-HCC was identified based on genotype information, age, sex, obesity, T2DM, and severe fibrosis, showing an AUROC of 0.96 ± 0.04 (89% specificity and 96% sensitivity) [33] .
Additional research is needed before PRSs, GRSs, or prediction models using both clinical variables and genetic information can be effectively applied for NAFLD investigation in clinical practice. This knowledge of genetic loci is potentially useful for risk stratification in patients with NAFLD. Moreover, considering that there are presently no approved drugs for NAFLD treatment [66] , there is an urgent need for more research and development of therapeutic targeting of the products of these genes in NAFLD patients with specific genetic variants that could provide insight into personalized treatments for NAFLD [ Figure 1].

TRANSLATIONAL IMPLICATIONS AND CHALLENGES
As discussed above, several attempts have been made to predict NAFLD and/or NASH using genetic information alone or in combination with clinical information. The results of these efforts can be applied to the development of a new scoring model with better diagnostic performance compared to the previous models. Notably, a prediction model developed by combining serum metabolites, serum biochemical parameters, and genotype information was reported to discriminate NASH from NAFL with a good diagnostic performance [67] . Such a model would be appealing for clinical translation, considering that the current gold standard for NASH diagnosis is liver biopsy, which is an invasive method. However, there are several challenges hindering the clinical translation of genetic information in NAFLD.
Firstly, most studies performed to evaluate the diagnostic accuracy of predictive models for NAFLD and/ or NASH risk have been based on a cross-sectional design. Although these results can be useful, it is not an optimal study design for investigating models based on genetic information. Indeed, unlike classical factors such as biochemical results (AST, ALT), genetic variants have the strength of being stable over time. Thus, if an ideal prediction model based on genetic information is properly established, it could be possible to stratify NAFLD before it develops or progresses, thus enabling intervention at an early stage or early age. However, to properly apply this concept, results should be accumulated from multiple longitudinal studies.
Secondly, another important issue to consider when utilizing genetic information is the interaction between environment and gene. For example, with regards to rs738409 in the PNPLA3 gene, it has been reported that the variant's effect is especially amplified in the setting of obesity [68] . This suggests that adiposity (environment) can influence how specific genetic information influences the full spectrum of NAFLD. Considering this interaction between gene and environment, a prediction model based exclusively on genetic information may not exhibit sufficient predictive power, while a model with integration of relevant NAFLD-associated clinical factors would be more likely to reach significant predictive power.
Overall, prediction models that use both genetic information and relevant clinical factors derived from longitudinal studies can achieve sufficient predictive power for NAFLD risk stratification at the individual level.

CONCLUSION
The identification of genes associated with NAFLD development and progression is expected to provide important insights into its pathophysiology, as well as to guide disease risk stratification and further new opportunities for timely therapeutic intervention. Several genetic variants have been implicated in NAFLD development and progression, and here we focused on the five genes whose associations with NAFLD have been most extensively replicated. These genetic risk variants can improve the accuracy of NAFLD diagnosis, and may also be useful for the identification of high-risk NAFLD patients who have unfavorable prognoses. An understanding of these NAFLD-associated genetic risk factors will help identify individuals at risk, and potentially guide the provision of appropriate treatments based on an individual's risk and likelihood of disease progression.