**What’s known on the subject? and What does the study add?**

Machine learning is used for creating different predicion models in recent years. It is a technique that uses past experiences to improve their decisions without the need of human intervention. In our study different machine learning algorithms was used for building a prediction model for the risk of infertility in men.

#### Introduction

The World Health Organization defines infertility as 12 months of frequent, unprotected intercourse without pregnancy (1). Infertility is a medical and social problem effect about 15% of couples and 40% of these couples are infertile because of male factor (2). Infertility is a worldwide problem and is estimated that only in Turkiye 10-15% couples are infertile (3). Male infertility is a highly heterogeneous disorder and genetic factors play an important role in male infertility. Karyotypic abnormalities, cystic fibrosis transmembrane conductance regulator gene mutations and microdeletions on the Y chromosome are well-known genetic causes of azospermic or severely oligozoospermic men (4,5). There are diverse external factors for infertility, including age, smoking, obesity etc. (3).

The prediction contains variables in the dataset to conduct analysis and find patterns, which describe the data structure that can be interpreted by humans (6). Machine learning is a fast-growing field, which explores how computers can automatically learn to recognize complex data structures and make a conclusion based on a set of observed data (7).

Nowadays, machine learning applications are a part of our daily lives in different areas, for example, web searches, spam/email filtering, face recognition programs, and speech recognition programs (8). Machine learning has been used for the classification of different medical data and these results show that the performance of this study was produced promising results for different data sets. However, gathering and inventorying of more complex data types, the discovery of new diseases, and the development of new diagnostic methods have raised the need for machine learning methods in the medical area, which provides new ways for interpreting the complex data sets that researchers faced (9,10).

Machine learning has been separated into different subfields that deal with different types of learning tasks. Supervised learning is the most common used in practice and can be grouped into classification and regression. There are many algorithms for classification tasks with an increasing number and different features day by day, some classification algorithms commonly used are decision trees (DT), K nearest neighbor (KNN), Naive Bayes (NB), support vector machines (SVM), random forest (RF) (11,12).

There are different algorithms which can be used in research. The main question is which algorithm will fit on your data well? For statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any learning algorithm. Superlearner (SL) allows researchers to use multiple algorithms to outperform a single algorithm in non-parametric statistical models. Therefore, there is no need to decide which single technique to use for prediction. Instead, there is a method to use several candidate learners together at different weights by incorporating cross-validation. Cross-validation is an important evaluation technique used to assess the generalization performance of a machine learning model (13-15).

This study focuses on diagnosing the risk factors for male infertility disease by machine learning algorithms. The present study aims to compare different machine learning classifiers with different training and testing proportions. Additionally, the results were used to compare SL algorithm and see the advantages of this algorithm.

#### Materials and Methods

In this study we provided informed consent form patients and Ethics approval was given by the Ondokuz Mayıs University Medical Research Ethical Committee (2017/208, issued June 22, 2017). The dataset for the study was collected from 587 infertile and 57 fertile patients between 2007-2018 and published partially in two separate studies (16,17). A total of eleven attributes (ten attributes and one class attribute), age, hormone analysis, follicular stimulating hormone (FSH) level, luteinizing hormone (LH) level, routine semen parameters, total testosterone level, sperm concentration, and genetic variations. A total of five categorical and five numerical values are present in the data.

In the pre-processing step, the data set was checked for missing values. The attribute gr/gr+b2/b3 is dropped out from the analysis and for numerical data Z-score normalization is used to scale the data. First, 80% of the collected data was used to train the algorithms and the remaining 20% was used for testing the performances also these split ratios used in the study are 70-30% and 60-40%.

After removing the missing values, the final data set composed of 329 (85.5%) infertile and 56 (15.5%) fertile patients. We performed the classification using R, which is open-source statistical software. In the pre-processing step “Plyr” and “ggplot2” were used and analyses were carried out by “caret”, “SL”, “e071” and “part” packages for classification. A 10-fold cross-validation method was used to test the validity of the analysis.

**Machine Learning Algorithms Used for Classification **

This study focuses on six different machine learning algorithms that are DT, RF, NB, KNN, SVM and an ensemble method called SL.

**1. Decision Tree**

The algorithm uses a tree-like model, which starts at the root and builds the tree by choosing the most informative attribute at each step (18). The internal node and the root node are points with the name of the attribute; the sides are labeled by the most informative attribute values and a leaf node is displayed with different classes. The leaf corresponds to the decision outcomes (19). For attribute selection measures, the decision is chosen by the highest gain ratio. The training data set is used while creating DT with the C4.5 algorithm. For each node in the tree, the class that divides the sample into the best subsets is determined and this is the highest gain ratio. For continuous variables, this algorithm can be used as well (20). The rpart (recursive partitioning and regression trees) package is used for classification trees (21).

**2. Random Forest**

This algorithm is a type of ensemble learning that uses a combination of tree estimators. The principle is random sampling by building sub-trees and random subsets of features considered when splitting nodes. The samples are drawn by replacement, which is known as bootstrapping and the final model is the majority vote from the creating trees in the forest (22).

From the original dataset set, a sample of N is drawn to construct each tree. When the attributes have been selected, the algorithm forms a repetitive partitioning of the covariates. The best split is chosen as the one optimizing the classification and regression tree [CART (Classification and Regression Tree] splitting criterion, which is the gini index along with the mtry preselected directions (23). This process is repeated until each branch contains less than a pre-specified number of node sizes of observations. After this step, the prediction at a new point is computed by averaging observations falling into the branch of the new point. Each M tree gives a prediction, which is simply the majority class of the M predicted three (22).

**3. Naive Bayes**

The NB classifier is based on applying bayesian theorem. A probabilistic model estimates the conditional probabilities of the dependent variable from the training data and uses them for classification. This classifier assumes that the attributes are independent between the features and are equally important (24). This classifier predicts the class membership probability of examples by using the naive conditional independence assumption (25). The Bayesian generalized linear model (bglm) is a Bayesian function for generalized linear modeling by different distributions (26).

**4. K Nearest-neighbor **

This classifier is a method on learning by comparing a given test data set with a training data set, which resembles it. Here the samples of training data set are defined by *n* attributes, where each example indicates a point in *n-*dimensional space. This algorithm searches for the K training data samples nearest to the unknown example (7).

The performance of a KNN classifier depends on the choice of K and the distance metric. Without foreknowledge, this classifier applies Euclidean distances as the measurement of the closeness between examples. As in other classifiers majority vote assign the class label (27). Usually, the K parameter in the classifier is chosen experimentally. For each model, different numbers of nearest neighbors are chosen and the parameter with the best accuracy is given to define the classifier (28).

**5. Support Vector Machine **

This algorithm is mostly used for classifying linear and non-linear patterns. Linear patterns can be easily separated in low dimensions, whereas non-linear patterns can’t be easily separated. For this task, a set of mathematical functions known as kernels is used. The basic idea for SVM is the use of an optimal hyperplane, which can be used for classification, to solve linearly separable patterns. The optimal hyperplane is selected from the set of hyperplanes for classifying patterns that maximize the margin of the hyperplane. That is the distance from the hyperplane to the closest point of each pattern by maximizing the margin it can correctly classify the given patterns (29).

For non-linear separable patterns, the kernel functions return the inner product between two points in a higher feature space. The training occurs in the feature space, and the data points just appear inside the dot products with other points. This is called the “kernel trick,” where the non-linear pattern becomes linearly separable (30). The kernel function converts the data into the desired format and for this different kernel is used for non-linear patterns (31).

**6. Superlearner**

This algorithm is a cross-validation based method, which chooses one or weight of more optimal learners that perform asymptotically as well or better than any candidate learner. This prediction algorithm, which applies a set of candidate learners to the observed data, can include as many candidate learners to the model if computationally feasible (13,14). Different algorithms can be adapted to SL algorithm such as RF, SVM, NB (14).

The training set trains the estimators and the validation set estimates the performance of these estimators. The cross-validation selector selects the best performance for the learner on the validation set for the SL algorithm. In v-fold cross-validation, the training set is divided into v mutual sets of as nearly equal size. The v set and its complementary validation and training sample give v split the learning sample into training and corresponding validation sample. For every v split, the predictor is applied to the training set, and its risk is estimated by the corresponding validation set. Each learner, risks, and the validation set are averaged, resulting in cross-validated risk. The predictor is selected by the minimum cross-validated risk. The calculated risk is a measure of performance and the model getting minimized risk is the model with the minimum errors in prediction. This algorithm provides a weighted model using candidate learners. If the model is obtained with a single learner, this gives the discrete SL algorithm. There is no limitation for candidate learners, which is the main advantage of this algorithm (15).

**Performance Evaluation**

The performance of the algorithms selected for the study is evaluated using area under curve (AUC). The reason is to find common criteria to compare the performances of all algorithms. AUC measure the entire area under the receiver operating characteristic (ROC) curve. The ROC curve is a graph showing the performance of a classification model at all classification thresholds (32). Also, the performance metric which can be adopted by the confusion matrix like accuracy, sensitivity and specificity values have been evaluated for the algorithms (33).

Accuracy= (TP+TN)/(TP+FP+FN+TN)x100 (1)

Sensitivity= TP/(TP+FN)x100s (2)

Specificity= TN/(FP+TN)x100 (3)

In the equations, TP defines the number of true positives; FN defines the number of false negatives; TN defines the number of true negatives; the last is FP, which defines the number of false positives (34).

**Statistical Analysis**

The genetic data for the diagnosis of infertility was evaluated in terms of supervised machine learning algorithms. The C4.5, KNN, NB, SVM and RF algorithms were used as classifiers and compared with the SL algorithm according to the AUC performance criteria. The C4.5 decision tree algorithm was implemented using the J48 decision tree algorithm, KNN algorithm was implemented using Euclidean distance, NB algorithm was implemented using the NB classifier, the SVM algorithm was implemented using radial basis kernel, RF algorithm was implemented using bootstrapping while SL algorithm was implemented using different weights simultaneously all available on R program. The models were trained for different split ratios and 10-fold cross-validation was used.

#### Results

All classifiers and different split ratios of the overall performance of the dataset are shown in Table 1. The split ratio of 80-20% using the RF algorithm showed better accuracy among all other classifiers whereas SVM showing an AUC of 95% that is the best classifier. The split ratio of 70-30% using SVM showed a performance of 95% whereas the split ratio of 60-40% using RF showed a performance of 94% among all other classifiers. The results of sensitivity and specificity show a good performance for all different proportions as well.

According to these results in Figure 1, showing the importance of variables after analysing the data set. Here the first line is sperm concentration following by FSH and LH hormones in the line. Genetic factors sy1291, gr/gr2 and b2/b3 are the important genetic factors according to these findings.

Using the SL algorithm, the predictive model developed using the risks of different algorithms and coefficients yielded an AUC of 96% following by discrete SL and RF with an AUC of 95%. The coefficient is how much weight SL put on that model in the weighted-average. The lowest risk is yield by RF given below in Table 2. As seen from the table, bglm will not give any contribution to the analysed model. The weighted model consists of RF, KNN and rpart. Therefore, SL performed as the best algorithm as AUC 97% (Table 3). These performance is discrete SL and RF at AUC 96%, respectively.

#### Discussion

In this study, a machine learning-based prediction model for infertility in men was developed based on genetic data. This study demonstrated that the RF algorithm has higher accuracy than the NB, SVM, DT and KNN algorithms, irrespective of different split ratios. According to the results, it was discovered that different split ratios can change the classifier used for analysis. The accuracy was highest for RF for a split ratio of 80-20% whereas the NB classifier showed a poor accuracy of 89%. In a study by Noi and Kappas (35) it was shown that the larger the training sample size, the higher the accuracy. Our findings support this result too, where they obtained 90-95% accuracy for analyzing different data sizes and split ratios for balanced and unbalanced data sets in their studies. The highest performance is adopted by SVM, RF and KNN for the split ratio of 60-40%. The results of our dataset showed that the highest performance was obtained by SVM using RBF as kernels and RF classifiers that supports findings in literature (36). In conformity with the results obtained the performance is increased using RF algorithm for the genetic data set. RF is an important algorithm for medical data sets (37,38). One of the biggest problems in machine learning is which algorithm to use and the ideal split ratio for training and testing data. This study answers these questions by using different classifiers that compared with the SL algorithm that applies weighted candidate learners to the model.

The SL algorithm picks one or more optimal learners, which are called candidate learners, to build the algorithm. RF algorithm is a candidate learner, which puts the biggest weight because of the lowest risk in the model for the SL algorithm. KNN and rpart put the next important weights in the model as candidate learners. According to these findings, the best performance is obtained by the SL algorithm of 97% AUC. In a previous study by van der Laan et al. (15), different candidate learners like RF, least squares method, least angle regression and delete/change/addition set to the model for the diabetes dataset set and the smallest risk was obtained by delete/change/addition.

The variable importance analyses show that the sperm concentration is the most important variable. The Polymorphism genes are, respectively, in order of sy1291, gr/gr, and b2/b3. As a matter of fact, in reference by Kumar and Singh (39), it is stated that the important factor for infertility is due to semen parameter values not within normal limits. Information on the importance of variables and because of infertility data analysis the results support the literature. For example, Hicks et al. (40), a male infertility prediction study, used sperm videos. As mentioned, sperm parameters play an important role in infertility. The reported algorithms used in this study are simple linear regression, RFs, Gaussian process, sequential minimal optimization regression, elastic net, and random trees. Here, the error rate for RF is different compared to the other mentioned algorithms. RF algorithm is a an ensemble learning algorithm in which multiple models are combined to solve a particular problem (41).

One in six couples worldwide experiences infertility (42). It has been reported that the emotional status of couples who apply to a physician with infertility is deteriorates, and their susceptibility to depression increases (42). About a quarter of couples cannot continue their infertility treatments due to the burden of treatment (43). We think that the prediction of infertility, which has a complex nature and affects many areas such as the emotional conditions of couples, other health problems and the health system expenditures of the states, is of great importance for clinicians. Therefore, the development of models with high predictive ability will also improve clinical approaches for infertility treatment. These study findings, whenever applied to any patient’s record of infertility risk factors, can be used to predict the risk of infertility in men. The predictive model developed can be integrated into existing health information systems which can be used by urologists to predict patients’ risk of infertility in real time.

#### Conclusion

The results of the study show that different split ratios affect the performance also it can change the algorithm that be used. The SL algorithm is a weighted model that consists of different candidate learners. According to the results, the algorithm with the highest performance and minimum risk are linked to each other.

A researcher builds a model, by using different algorithms while different classifiers show different performances. However, there are too many algorithms in the literature. Choosing the best algorithm requires time and expertise. At this stage, SL is an important tool and recommended for achieving high performance and as a guide to the researcher. In this study, the model was obtained using five candidate learners and their performances were compared. SL gives the researcher time and expertise in solving data sets. However, different models can be established by evaluating different algorithms. In later studies, it is planned to conduct studies by trying combinations of different algorithms and using bigger sample data sizes. Simulation finding could be a good study to conduct.

**Ethics**

**Ethics Committee Approval:** In this study we provided informed consent form patients and Ethics approval was given by the Ondokuz Mayıs University Medical Research Ethical Committee (2017/208, issued June 22, 2017).

**Informed Consent:** In this study we provided informed consent form patients.

**Peer-review:** Externally peer-reviewed.

**Authorship Contributions**

Surgical and Medical Practices: S.K., L.T., E.K., Concept: S.K., L.T., E.K., Design: S.K., L.T., E.K., Data Collection or Processing: S.K., L.T., E.K., Analysis or Interpretation: S.K., L.T., E.K., Literature Search: S.K., L.T., E.K., Writing: S.K., L.T., E.K.

**Conflict of Interest:** No conflict of interest was declared by the authors.

**Financial Disclosure:** The authors declare that they have no relevant financial.