Risk prediction modelling for hepatocellular carcinoma (HCC) has been the focus of research in the last decade. The prediction models would help HCC risk stratification, so that patients at high risk of HCC would be able to receive more appropriate management and HCC surveillance. These models were mostly developed in treatmentnaïve chronic hepatitis B patients in the early days. In recent years, more prediction models were derived and validated in patients who have received antiviral treatment, which account for the majority of patients who are at increased risk of HCC. Various statistical tests are adopted in developing and validating a risk prediction model  commonly Cox proportional hazards regression, timedependent receiver operating characteristic (ROC) curve and area under the ROC curve. Even in wellvalidated models, there may be some pitfalls,
Development and validation of hepatocellular carcinoma (HCC) risk prediction models remain a hot area of liver research. Its importance is not just at the academic level, but also at the practical level. The burning need of some accurate as well as applicable HCC risk prediction models is intensified by the World Health Organization’s goal of eliminating hepatitis B virus (HBV) infection by 2030. This initiative calls for actions to reduce chronic viral hepatitis incidence and mortality to 80% and 65% respectively^{[1]}. As the majority of the mortality from chronic hepatitis B (CHB) is secondary to HCC^{[2,3]}, accurate HCC risk prediction is the key component of secondary prevention of HCC^{[4]}.
HCC is one of the top killers as it carries a high mortality rate, despite advances in HCC treatments^{[5]}. HCC represents the third most frequent cause of cancer death globally (782,000 deaths in 2018)^{[2]}. Chronic HBV infection is a key risk factor for HCC development, which accounts for approximately 50% of cases worldwide and as high as 70%80% of cases in regions where HBV is highly endemic^{[6]}. HCC surveillance facilitates early HCC diagnosis and makes curative treatments possible^{[7]}. Regular surveillance with transabdominal ultrasound scanning with or without tumour markers every 6 months in all CHB patients would be a significant burden on healthcare resources^{[8]}. This is especially true in the AsiaPacific region, as the majority of HCC disease burden (85%) locates in low and middleincome countries with high prevalence of HBV in the region^{[9]}. Accurate HCC models enable riskstratification for the huge number of CHB patients, so that healthcare resources can be targeted to patients who are at risk.
There are more than a dozen wellvalidated HCC predication models; some were developed mainly in untreated CHB patients, whereas some intended for nucleos(t)ide analogues (NA)treated patients^{[4,10]}. In this review article, we present a focused discussion on the key statistical strategies adopted in the development and validation of HCC prediction models.
Although a semiparametric Cox proportional hazards (PH) regression is widely used for developing a prediction model of a timetoevent outcome, the sample size requirements and followup durations for derivation and validation datasets of risk prediction models must be carefully considered. Of note, the effective sample size is defined by the number of events in Cox models. A rule of thumb is to have at least 10 events per variable at an initial stage (i.e., Total number of candidate variables. More accurately, it refers to the number of parameters to be estimated) for deriving a model and a minimum of 100 outcome events for validation cohorts^{[11]}. Candidate prognostic factors should be chosen a priori on the basis of clinical knowledge, literature review, data quality and availability, and cost constraints. Often, a univariate analysis, using either the logrank test or Cox regression, is applied to all predictors and then those potential variables with a
To assess model fit, martingale residuals can be examined for checking the assumption of linear effect of covariates on log hazard rate for continuous predictors. If linearity assumption is violated, nonlinear relationships can be investigated using fractional polynomials or restricted cubic splines. In contrast, Schoenfeld residuals are used to test the assumption of proportional hazards, either by graphical or analytical methods. A risk score (linear combination of model predictors with regression coefficients offering weights) is calculated for each subject, followed by determining an optimal cutoff value to stratifying individuals into risk categories based on a predefined decision rule. The sensitivity and specificity at optimal cut point are subsequently estimated, together with KaplanMeier curves and the logrank test can be used to evaluate the different risk profiles.
In addition, the timedependent receiver operating characteristic curve ROC(t) and area under the ROC curve AUC(t) analyses for survival data can be employed at some specific times of interest to assess predictive power of the model^{[14]}. Other performance metrics of model discrimination can also be computed including, among others, Harrell’s concordance index (Cindex) and Uno’s concordance statistic. It may be preferable to report Uno’s concordance statistic as Cindex is affected by censoring^{[15]}. For calibration, which seems to be often neglected, a measure proposed by Grønnesby and Borgan can be readily carried out by comparing the observed and predicted number of events based on dividing predicted risk scores into G different groups {where G = integer of [max(2, min(10, number of failures/40))]} to assess the overall goodnessoffit in particular to the Cox model^{[16,17]}. A calibration slope should also be presented routinely for both internal and external validation, of which a value close to 1 indicates good calibration. Conducting internal validation is crucial, preferably by bootstrap resampling^{[18]}. This technique can not only evaluate the stability of selected predictors in a multivariable model, but also correct prognostic index obtained from the original sample for optimism. For external validation, the ‘final’ model derived from derivation cohort is utilized to a new population to judge generalizability and transportability (some executable STATA codes can be found in the
CUHCC and liver stiffness measurement (LSM)HCC scores
Statistical strategies for HCC risk scores
Scores  Formulae  Statistical strategies 

Untreated patients  
CUHCC  Age > 50 (+3) + serum album ≤ 35 g/dL (+20) + serum total bilirubin > 18 umol/L (+1.5) + HBV DNA 46 log_{10} IU/mL (+1) OR > 6 log_{10} IU/mL (+4) + cirrhosis (+15)  Cox proportional hazard model

LSMHCC  Age > 50 (+10) + serum album ≤ 35 g/dL (+1) HBV DNA 4 log_{10} IU/mL (+5) + liver stiffness measurement 812 kPa (+8) OR > 12 kPa (+12)  Cox proportional hazard model

Untreated patients  
PAGEB  Age ≥ 30 (+2 to +10) + Male (+6) + Platelet < 200 (+6 to +9)  Cox proportional hazard model

mPAGEB  Age ≥ 30 (+3 to +11) + Male (+2) + Platelet < 250 (+2 to +5) + Albumin < 40 g/dL (+1 to +3)  Cox proportional hazard model

HCC: hepatocellular carcinoma; LSM: liver stiffness measurement; HBV: hepatitis B virus
The development of both the CUHCC and LSMHCC scores started with identifying significant risk factors of HCC. One approach is to include all categorized risk factors such as age, gender, and albumin (
There are several summary measures for determining the optimal cutoff values of a risk score, including cost analysis, likelihood ratios, and receiver operating characteristic (ROC) analysis. The use of different cutoff methods depends greatly on the medical condition. The LSMHCC score was categorized into low risk and high risk groups with a cutoff value of highest sum of sensitivity and specificity value, which is similar to the Youden’s index (
Procedure for HCC risk scoring development
Step  Description 

1  Categorizing all continuous risk factors into clinically meaningful categorical variables 
2  Implementing FineGray subdistribution hazard model to model the cumulative incidence of the event of interest as the Cox proportional hazard model overestimates the risk rate 
3  Assigning zero weights for reference levels of the categorical variable 
4  Defining weights by estimated regression coefficients which are multiplied by 10 and rounded to the nearest integer 
5  Deriving the optimal cutoff values of a HCC risk score by maximizing the Youden’s index 
HCC: hepatocellular carcinoma
Current firstline oral HBV antiviral treatment suppresses HBV DNA replication effectively and prevents disease progression in CHB patients, yet does not completely eliminate the risk of HCC development^{[23,24]}. Motivated by the modest performance of untreatedderived risk scores on treated patients, especially among the Caucasian population^{[23]}, the PAGEB score
The PAGEB score is calculated by summing up integer points that correspond to particular categories of the included risk factors. Based on multivariable Cox proportional hazards model, the authors demonstrated that advanced age, male gender, and low platelet counts are the three key risk factors to predict HCC development in the coming five years^{[26]}. Instead of relying on a complex Cox modelbased equation, they adopted the method described by Sullivan
After assigning all reference values, a base category is selected as the reference category for each risk factor. Usually the category with the lowest risk is chosen as the base category. The base category has 0 points in the points system. Following that, it is to determine how far each category is from the base category in terms of regression coefficient estimated by the original multivariable Cox regression. For each category of a continuous covariate, the distance is calculated as the product of the regression coefficient, i.e., natural logarithm of the adjusted hazard ratio, and the numerical difference of the reference value of that category from the reference value of the base category. The distance of each category of categorical covariate from the base category is exactly the estimated regression coefficient of that category. After that, a constant that represents the number of regression units that will correspond to one point in the points system is chosen. Then the point of each category of each risk factor is equal to its calculated distance divided by the constant, rounded to the nearest integer. Finally, the HCC risk score is calculated as the sum of integer point of each category that a patient falls into.
Existing HCC risk scores were mostly developed using traditional regression methods, or to be specific, the Cox proportional hazards regression. A point system is usually adopted by giving integer points to categories of each risk factor. In the old days, it was reasonable to reduce the complex regression equation into discrete scoring system so that clinicians can use the score with ease. Yet, as a tradeoff, continuous covariates have to be divided into categories. Statistically speaking, part of the information carried by the covariates can be lost through categorization. Also, the overall performance of the risk score will rely on the choice of cutoffs. Sometimes, the value of the covariates themselves, for example platelet counts, is more objective than the cutoff, especially if the cutoff may be estimated using your own data. Datadriven cutoffs for covariates may not be generalizable to other patient populations if there is some unmeasured difference between populations. With the advancement of technology, nowadays even complex equations can be easily calculated with the help of a computer next to the clinicians when they see their patients. All they need to do would be to input the value of every covariate to the computer, if not the computer does that for them automatically. It is expected that in the future, instead of a point system, complex equations that can achieve even higher accuracy derived by big data approaches including machine learning or deep learning algorithms would play a more important role in prediction of HCC.
After calculating the HCC risk score, researchers have to explain to clinicians and patients the meaning of the value. To deal with that, traditionally cutoffs for HCC risk score are determined based on diagnostic accuracy to classify patients into low, intermediate, and high risk of HCC development. The cumulative incidence of HCC in each risk stratum would then be estimated by survival analysis. A drawback of the current way of determining cutoff is that the criteria used do not suit the target, hence the limited use of HCC risk score in clinical practice. Indeed, most of the determined low cutoffs of existing HCC risk scores achieve a high NPV to exclude a meaningful proportion of patients with low HCC risk^{[32]}. HCC risk scores have the potential to guide HCC surveillance in the clinical setting, especially among noncirrhotic patients, by identifying patients who have a low HCC risk in the near future^{[10]}. HCC risk scores can be more useful if a low cutoff is selected based on the low annual incidence of HCC in the low risk group, for instance, less than the suggested threshold by the American Association for the Study of Liver Diseases for costeffective HCC surveillance for CHB patients, i.e., 0.2%^{[33]}.
Missing data is perhaps another important issue in developing a risk score. Many HCC risk scores involve laboratory measurements that may be missed in some of the patients. If ignored, a risk score developed based on solely complete cases can introduce selection bias and affect the precision of the effect estimates. Missing data should be probably handled by statistical methods such as multiple imputation to avoid bias. It is worth noting that apart from the PAGEB score, existing HCC risk scores usually did not state explicitly on how missing data are handled, which can potentially affect their generalizability.
With the knowledge of common statistical tests and strategies which have been adopted in the various HCC prediction models, the future is directed towards a more personalised approach. Continuous optimisation of the predictive accuracy of the models will be achieved by involving more serial parameters, as well as ontreatment data in NAtreated patients. HCC risk levels may change over time, as patients are getting older, at the same time the natural history has been modified by NA treatment, which leads to viral suppression, improvement in liver biochemistry, as well as regression of cirrhosis. Hence, accurate models should be able to identify such bidirectional changes of HCC risk over time. Whilst accuracy remains the most important aspect of an ideal prediction model, applicability and usability is just and important in order to translate HCC risk into clinical practice. Prediction models may be built into the computer systems for patient management with automated retrieval of relevant clinical parameters. The mostupdated HCC risk level would be able to guide the optimal HCC surveillance intervals or modalities, by providing timely alerts in the computer system.
Responsible for the interpretation of data and critical revision of the manuscript: Yip TCF, Hui VWK, Tse YK, Wong GLH
Not applicable.
This work was supported by the Commissioned Grant from Health and Medical Research Fund (HMRF) of the Food and Health Bureau (Reference no: 15160551) awarded to Wong GLH.
Yip TCF has served as a speaker for Gilead Sciences; Wong GLH has served as an advisory committee member for Gilead Sciences and Janssen; and as a speaker for Abbott, Abbvie, BristolMyers Squibb, Echosens, Gilead Sciences, Janssen and Roche; Hui VWK and Tse YK declared that there are no conflicts of interest.
Not applicable.
Not applicable.
© The Author(s) 2021.