Deep learning techniques for ordinal classification have recently gained significant attention. Predicting an ordinal variable, that is, a variable that demonstrates a natural relationship between categories, is of relevance for a number of real-world problems in various fields of knowledge. For example, a medical diagnosis can occur at different stages of the disease. Applying standard classifiers to ordered labels can lead to errors in distant categories, when errors in an ordinal problem ideally tend to be produced in adjacent classes because of their similarity. To address this issue, we propose a soft labelling approach based on generalised triangular distributions, which are asymmetric and different for each class. The parameters of these distributions are determined using a metaheuristic and are specifically adapted to the given problem. Moreover, this approach enables the model to avoid errors in distant classes (e.g. classifying a patient with a severe disease as healthy). A comprehensive comparison was performed using eight datasets and five performance metrics. The main advantage of the proposed soft-labelling approach is that it adapts the distributions to each problem, resulting in greater flexibility and better performance. The results and statistical analysis show that the proposed methodology significantly outperforms all other methods.
This paper proposes a novel methodology for recovering missing time series data, a crucial task for subsequent Machine Learning (ML) analyses. The methodology is specifically applied to Significant Wave Height (SWH) time series in the field of marine engineering. The proposed approach involves two phases. Firstly, the SWH time series for each buoy is independently reconstructed using three transfer function models: regression-based, correlation-based, and distance-based. The distance-based transfer function exhibits the best overall performance. Secondly, Evolutionary Artificial Neural Networks (EANNs) are utilised for the final recovery of each time series, using as inputs highly correlated buoys that have been intermediately recovered. The EANNs are evolved considering two metrics, the novel squared error relevance area, which balances the importance of extreme and around-mean values, and the well-known mean squared error. The study considers SWH time series data from 15 buoys in two coastal zones in the United States. The results demonstrate that the distance-based transfer function is generally the best transfer function, and that EANNs outperform a range of state-of-the-art ML techniques in 12 out of the 15 buoys, with a number of connections comparable to linear models. Furthermore, the proposed methodology outperforms the two most popular approaches for time series reconstruction, BRITS and SAITS, for all buoys except one. Therefore, the proposed methodology provides a promising approach, which may be applied to time series from other fields, such as wind or solar energy farms in the field of green energy.
The sanitary emergency caused by COVID-19 has compromised countries and generated a worldwide health and economic crisis. To provide support to the countries’ responses, numerous lines of research have been developed. The spotlight was put on effectively and rapidly diagnosing and predicting the evolution of the pandemic, one of the most challenging problems of the past months. This work contributes to the existing literature by developing a two-step methodology to analyze the transmission rate, designing models applied to territories with similar pandemic behavior characteristics. Virus transmission is considered as bacterial growth curves to understand the spread of the virus and to make predictions about its future evolution. Hence, an analytical clustering procedure is first applied to create groups of locations where the virus transmission rate behaved similarly in the different outbreaks. A curve decomposition process based on an iterative polynomial process is then applied, obtaining meaningful forecasting features. Information of the territories belonging to the same cluster is merged to build models capable of simultaneously predicting the 14-day incidence in several locations using Evolutionary Artificial Neural Networks. The methodology is applied to Andalusia (Spain), although it is applicable to any region across the world. Individual models trained for a specific territory are carried out for comparison purposes. The results demonstrate that this methodology achieves statistically similar, or even better, performance for most of the locations. In addition to being extremely competitive, the main advantage of the proposal lies in its complexity cost reduction. The total number of parameters to be estimated is reduced up to 93.51% for the short term and 93.31% for the mid-term forecasting, respectively. Moreover, the number of required models is reduced by 73.53% and 58.82% for the short- and mid-term forecasting horizons.
This paper tackles the Donor-Recipient (D-R) matching for Liver Transplantation (LT). Typically, D-R matching is performed following the knowledge of a team of experts guided by the use of a prioritisation system. One of the most extended, the Model for End-stage Liver Disease (MELD), aims to decrease the mortality in the waiting list. However, it does not take into account the result of the transplant. In this sense, with the aim of developing a system able to bear in mind the survival benefit, we propose to treat the problem as an ordinal classification one. The organ survival will be predicted at four different thresholds. The results achieved demonstrate that ordinal classifiers are capable of outperforming nominal approaches in the state-of-the-art. Finally, this methodology can help experts make more informed decisions about the appropriateness of assigning a recipient for a specific donor, maximising the probability of post-transplant survival in LT.
This work presents a novel ordinal Deep Learning (DL) approach to Time Series Ordinal Classification (TSOC) field. TSOC consists in classifying time series with labels showing a natural order between them. This particular property of the output variable should be exploited to boost the performance for a given problem. This paper presents a novel DL approach in which time series are encoded as 3-channels images using Gramian Angular Field and Markov Transition Field. A soft labelling approach, which considers the probabilities generated by a unimodal distribution for obtaining soft labels that replace crisp labels in the loss function, is applied to a ResNet18 model. Specifically, beta and triangular distributions have been applied. They have been compared against three state-of-the-art deep learners in the Time Series Classification (TSC) field using 13 univariate and multivariate time series datasets. The approach considering the triangular distribution (O-GAMTFT) outperforms all the techniques benchmarked.
Time Series Classification (TSC) is an extensively researched field from which a broad range of real-world problems can be addressed obtaining excellent results. One sort of the approaches performing well are the so-called dictionary-based techniques. The Temporal Dictionary Ensemble (TDE) is the current state-of-the-art dictionary-based TSC approach. In many TSC problems we find a natural ordering in the labels associated with the time series. This characteristic is referred to as ordinality, and can be exploited to improve the methods performance. The area dealing with ordinal time series is the Time Series Ordinal Classification (TSOC) field, which is yet unexplored. In this work, we present an ordinal adaptation of the TDE algorithm, known as ordinal TDE (O-TDE). For this, a comprehensive comparison using a set of 18 TSOC problems is performed. Experiments conducted show the improvement achieved by the ordinal dictionary-based approach in comparison to four other existing nominal dictionary-based techniques.
In this paper we have tackled the problem of long-term air temperature prediction with eXplainable Artificial Intelligence (XAI) models. Specifically, we have evaluated the performance of an Artificial Neural Network (ANN) architecture with sigmoidal neurons in the hidden layer, trained by means of an evolutionary algorithm (Evolutionary ANNs, EANNs). This XAI model architecture (XAI-EANN) has been applied to the long-term air temperature prediction at different sub-regions of the South of the Iberian Peninsula. In this case, the average August air temperature has been predicted from ERA5 Reanalysis data variables, obtaining good predictions skills and explainable models in terms of the input climatological variables considered. A cluster analysis has been first carried out in terms of the average air temperature in the zone, in such a way that a number of sub-regions with different air temperature behaviour have been defined. The proposed XAI-EANN model architecture has been applied to each of the defined sub-regions, in order to find significant differences among them, which can be explained with the XAI-EANN models obtained. Finally, a comprehensive comparison against some state-of-the-art techniques has also been carried out, concluding that there are statistically significant differences in terms of accuracy in favour of the proposed XAI-EANN model, which also benefits from being an XAI model.
Summary Background The Model for End-stage Liver Disease (MELD) and its sodium-corrected variant (MELD-Na) have created gender disparities in accessing liver transplantation. We aimed to derive and validate the Gender-Equity Model for liver Allocation (GEMA) and its sodium-corrected variant (GEMA-Na) to amend such inequities. Methods In this cohort study, the GEMA models were derived by replacing creatinine with the Royal Free Hospital glomerular filtration rate (RFH-GFR) within the MELD and MELD-Na formulas, with re-fitting and re-weighting of each component. The new models were trained and internally validated in adults listed for liver transplantation in the UK (2010–20; UK Transplant Registry) using generalised additive multivariable Cox regression, and externally validated in an Australian cohort (1998–2020; Royal Prince Alfred Hospital [Australian National Liver Transplant Unit] and Austin Hospital [Victorian Liver Transplant Unit]). The study comprised 9320 patients: 5762 patients for model training, 1920 patients for internal validation, and 1638 patients for external validation. The primary outcome was mortality or delisting due to clinical deterioration within the first 90 days from listing. Discrimination was assessed by Harrell’s concordance statistic. Findings 449 (5·8%) of 7682 patients in the UK cohort and 87 (5·3%) of 1638 patients in the Australian cohort died or were delisted because of clinical deterioration within 90 days. GEMA showed improved discrimination in predicting mortality or delisting due to clinical deterioration within the first 90 days after waiting list inclusion compared with MELD (Harrell’s concordance statistic 0·752 [95% CI 0·700–0·804] vs 0·712 [0·656–0·769]; p=0·001 in the internal validation group and 0·761 [0·703–0·819] vs 0·739 [0·682–0·796]; p=0·036 in the external validation group), and GEMA-Na showed improved discrimination compared with MELD-Na (0·766 [0·715–0·818] vs 0·742 [0·686–0·797]; p=0·0058 in the internal validation group and 0·774 [0·720–0·827] vs 0·745 [0·690–0·800]; p=0·014 in the external validation group). The discrimination capacity of GEMA-Na was higher in women than in the overall population, both in the internal (0·802 [0·716–0·888]) and external validation cohorts (0·796 [0·698–0·895]). In the pooled validation cohorts, GEMA resulted in a score change of at least 2 points compared with MELD in 1878 (52·8%) of 3558 patients (25·0% upgraded and 27·8% downgraded). GEMA-Na resulted in a score change of at least 2 points compared with MELD-Na in 1836 (51·6%) of 3558 patients (32·3% upgraded and 19·3% downgraded). In the whole cohort, 3725 patients received a transplant within 90 days of being listed. Of these patients, 586 (15·7%) would have been differently prioritised by GEMA compared with MELD; 468 (12·6%) patients would have been differently prioritised by GEMA-Na compared with MELD-Na. One in 15 deaths could potentially be avoided by using GEMA instead of MELD and one in 21 deaths could potentially be avoided by using GEMA-Na instead of MELD-Na. Interpretation GEMA and GEMA-Na showed improved discrimination and a significant re-classification benefit compared with existing scores, with consistent results in an external validation cohort. Their implementation could save a clinically meaningful number of lives, particularly among women, and could amend current gender inequities in accessing liver transplantation. Funding Junta de Andalucía and EDRF.
This first section introduces the topic presented and the related state-of-theart developments. Time series data mining (TSDM) mainly consists of the following tasks: anomaly detection (Blázquez-García et al., 2020), classification (Ismail-Fawaz et al., 2019), analysis and preprocessing (Hamilton, 1994), segmentation (Keogh et al., 2004), clustering (Liao, 2005) and prediction (Weigend, 2018). More concretely, this chapter is focused on the applications of time series preprocessing, segmentation and prediction to real-world problems.
Many types of research have been carried out with the aim of combating the COVID-19 pandemic since the first outbreak was detected in Wuhan, China. Anticipating the evolution of an outbreak helps to devise suitable economic, social and health care strategies to mitigate the effects of the virus. For this reason, predicting the SARS-CoV-2 transmission rate has become one of the most important and challenging problems of the past months. In this paper, we apply a two-stage mid and long-term forecasting framework to the epidemic situation in eight districts of Andalusia, Spain. First, an analytical procedure is performed iteratively to fit polynomial curves to the cumulative curve of contagions. Then, the extracted information is used for estimating the parameters and structure of an evolutionary artificial neural network with hybrid architectures (i.e., with different basis functions for the hidden nodes) while considering single and simultaneous time horizon estimations. The results obtained demonstrate that including polynomial information extracted during the training stage significantly improves the mid- and long-term estimations in seven of the eight considered districts. The increase in average accuracy (for the joint mid- and long-term horizon forecasts) is 37.61% and 35.53% when considering the single and simultaneous forecast approaches, respectively.
Background and aims: The model for end stage liver disease (MELD) and its sodium-corrected variant (MELD-Na) have created gender disparities in accessing liver transplantation (LT). We derived and validated a new model that replaced creatinine with the Royal Free glomerular filtration rate (PMID: 27779785) within the MELD and MELD-Na formulas. Method: The “Gender-Equity Model for liver Allocation” (GEMA) and its sodium-corrected variant (GEMA-Na) were trained and internally validated in adults listed for LT in the United Kingdom (2010–2020) using generalized additive multivariate Cox regression. The models were externally validated in an Australian cohort (1998–2020). The primary outcome was mortality or delisting due to clinical deterior- ation at 90 days. The Greenwood-Nam-D’Agostino test was used to test calibration. Results: The study comprised 9, 320 patients: 5, 762 patients for model training, 1, 920 patients for internal validation, and 1, 638 patients for external validation. The prevalence of the primary outcome ranged from 5.3% to 6%. In the internal validation cohort, GEMA and GEMA-Na showed a Harrell’s c-statistic = 0.752 and 0.766, respectively, for the primary outcome, which were significantly higher than those of the MELD score (0.712) and the MELD-Na score (0.742). Results were consistent in the external validation cohort. Among women, these differences were more pronounced (see Harrell’s c-statistics in the table). GEMA and GEMA-Na were adequately calibrated and prioritized differently 43.9% and 41.8% of LT patients, respectively. Patients prioritized by GEMA-Na were more often women, had higher prevalence of ascites and showed triple risk of the primary outcome compared to patients prioritized by MELD- Na. One in 15 deaths would be avoided by using GEMA instead of MELD, and 1 in 21 deaths would be avoided by using GEMA-Na instead of MELD-Na. Among women, 1 in 8 deaths would be avoided in either situation. Conclusion: GEMA-Na predicts mortality or delisting due to clinical deterioration in patients awaiting LT more accurately than MELD-Na and its implementation may amend gender disparities.
In this paper, an approach based on a time series clustering technique is presented by extracting relevant features from the original temporal data. A curve characterization is applied to the daily contagion rates of the 34 sanitary districts of Andalusia, Spain. By determining the maximum incidence instant and two inflection points for each wave, an outbreak curve can be described by six intensity features, defining its initial and final phases. These features are used to derive different groups using state-of-the-art clustering techniques. The experimentation carried out indicates that $$k=3$$k=3is the optimum number of descriptive groups of intensities. According to the resulting clusters for each wave, the pandemic behavior in Andalusia can be visualised over time, showing the most affected districts in the pandemic period considered. Additionally, in order to perform a pandemic overview of the whole period, the approach is also applied to joint information of all the considered periods
The prediction of wave height and flux of energy is essential for most ocean engineering applications. To simultaneously predict both wave parameters, this paper presents a novel approach using short-term time prediction horizons (6h and 12h). Specifically, the methodology proposed presents a twofold simultaneity: 1) both parameters are predicted by a single model, applying the multi-task learning paradigm, and 2) the prediction tasks are tackled for several neighbouring ocean buoys with such single model by the development of a zonal strategy. Multi-Task Evolutionary Artificial Neural Network (MTEANN) models are applied to two different zones located in the United States, considering measurements collected by three buoys in each zone. Zonal MTEANN models have been compared in a two-phased procedure: 1) against the three individual MTEANN models specifically trained for each buoy of the zone, and 2) against some state-of-the-art regression techniques. Results achieved show that the proposed zonal methodology obtains not only better performance than the individual MTEANN models, but it also requires a lower number of connections. Besides, the zonal MTEANN methodology outperforms state-of-the-art regression techniques. Hence, the proposed approach results in an excellent method for predicting both significant wave height and flux of energy at short-term prediction time horizons.
Programming has traditionally been an engineering competence, but recently it is acquiring significant importance in several areas, such as Life Sciences, where it is considered to be essential for problem solving based on data analysis. Therefore, students in these areas need to improve their programming skills related to the data analysis process. Similarly, engineering students with proven technical ability may lack the biological background which is likewise fundamental for problem-solving. Using hackathon and teamwork-based tools, students from both disciplines were challenged with a series of problems in the area of Life Sciences. To solve these problems, we established work teams that were trained before the beginning of the competition. Their results were assessed in relation to their approach in obtaining the data, performing the analysis and finally interpreting and presenting the results to solve the challenges. The project succeeded, meaning students solved the proposed problems and achieved the goals of the activity. This would have been difficult to address with teams made from the same field of study. The hackathon succeeded in generating a shared learning and a multidisciplinary experience for their professional training, being highly rewarding for both students and faculty members.
Machine learning (ML) is the field of science that combines knowledge from artificial intelligence, statistics and mathematics intending to give computers the ability to learn from data without being explicitly programmed to do so. It falls under the umbrella of Data Science and is usually developed by Computer Engineers becoming what is known as Data Scientists. Developing the necessary competences in this field is not a trivial task, and applying innovative methodologies such as gamification can smooth the initial learning curve. In this context, communities offering platforms for open competitions such as Kaggle can be used as a motivating element. The main objective of this work is to gamify the classroom with the idea of providing students with valuable hands-on experience by means of addressing a real problem, as well as the possibility to cooperate and compete simultaneously to acquire ML competences. The innovative teaching experience carried out during two years meant a great motivation, an improvement of the learning capacity and a continuous recycling of knowledge to which Computer Engineers are faced to.
This work analyzes the performance of several state-of-the-art Time Series Classification (TSC) techniques in the cryptocurrency returns modeling field. The data used in this study comprehends the close price of 6 of the principal cryptocurrencies, collected with a frequency of 5 minutes from January 1st to September 21th of 2021. The aim of this work is twofold: 1) to study the weak form of the Efficient Market Hypothesis (EMH) and 2) to examine the veracity behind the theory of the Random Walk Model (RWM). For this, two datasets are built. The first uses autoregressive values, whereas the second dataset is constructed by introducing randomized past values from the time series. Then, a comparison of the performances achieved by the different TSC techniques is carried out. Results obtained show a pronounced difference in terms of performance obtained by all the TSC models when applied to the original dataset against the randomized one. The results achieved by the models applied to the original dataset are significantly better in terms of Area Under ROC Curve (AUC) and Recall. Therefore, the EMH is refused in its weak form, and indisputable evidence against the RWM in a high-frequency scope is provided.
Time-series clustering is the process of grouping time series with respect to their similarity or characteristics. Previous approaches usually combine a specific distance mea- sure for time series and a standard clustering method. However, these approaches do not take the similarity of the different sub- sequences of each time series into account, which can be used to better compare the time-series objects of the dataset. In this article, we propose a novel technique of time-series clustering consisting of two clustering stages. In a first step, a least-squares polynomial segmentation procedure is applied to each time series, which is based on a growing window technique that returns different-length segments. Then, all of the segments are pro- jected into the same dimensional space, based on the coefficients of the model that approximates the segment and a set of statisti- cal features. After mapping, a first hierarchical clustering phase is applied to all mapped segments, returning groups of segments for each time series. These clusters are used to represent all time series in the same dimensional space, after defining another spe- cific mapping process. In a second and final clustering stage, all the time-series objects are grouped. We consider internal clus- tering quality to automatically adjust the main parameter of the algorithm, which is an error threshold for the segmenta- tion. The results obtained on 84 datasets from the UCR Time Series Classification Archive have been compared against three state-of-the-art methods, showing that the performance of this methodology is very promising, especially on larger datasets.
Data Science is the area that comprises the development of scientific methods, processes, and systems for extracting knowledge from previously collected data, aiming to analyse the procedures being carried out currently. The professional profile associated with this field is the Data Scientist, generally carried out by Computer Engineers as the skills and competencies acquired during their training are perfectly suited to what this job requires. Due to the need for training new Data Scientists, among other goals, there are different emerging platforms where they can acquire extensive experience, such as Kaggle. The main objective of this teaching experience is to provide students with practical experience on a real problem, as well as the possibility of cooperating and competing at the same time. Thus, the acquisition and development of the necessary competencies in Data Science are carried out in a highly motivating environment. The development of activities related to this profile has had a direct impact on the students, being fundamental the motivation, the learning capacity and the continuous recycling of knowledge to which Computer Engineers are subjected.
Time Series Ordinal Classification (TSOC) is yet an unexplored field of machine learning consisting in the classification of time series whose labels follow a natural order relationship between them. In this context, a well-known approach for time series nominal classification was previously used: the Shapelet Transform (ST). The exploitation of the ordinal information was included in two steps of the ST algorithm: 1) by using the Pearson’s determination coefficient (R2) for computing the quality of the shapelets, which favours shapelets with better ordering, and 2) by applying an ordinal classifier instead of a nominal one to the transformed dataset. For this, the distance between labels was represented by the absolute value of the difference between the corresponding ranks, i.e. by the L1 norm. In this paper, we study the behaviour of different Lp norms for representing class distances in ordinal regression, evaluating 9 different Lp norms with 7 ordinal time series datasets from the UEA-UCR time series classification repository and 10 different ordinal classifiers. The results achieved demonstrate that the Pearson’s determination coefficient using the L1.9 norm in the computation of the difference between the shapelet and the time series labels achieves a significantly better performance when compared to the rest of the approaches, in terms of both Correct Classification Rate (CCR) and Average Mean Absolute Error (AMAE).
Activation functions are used in neural networks as a tool to introduce non-linear transformations into the model and, thus, enhance its representation capabilities. They also determine the output range of the hidden layers and the final output. Traditionally, artificial neural networks mainly used the sigmoid activation function as the depth of the network was limited. Nevertheless, this function tends to saturate the gradients when the number of hidden layers increases. For that reason, in the last years, most of the works published related to deep learning and convolutional networks use the Rectified Linear Unit (ReLU), given that it provides good convergence properties and speeds up the training process thanks to the simplicity of its derivative. However, this function has some known drawbacks that gave rise to new proposals of alternatives activation functions based on ReLU. In this work, we describe, analyse and compare different recently proposed alternatives to test whether these functions improve the performance of deep learning models regarding the standard ReLU.
Classification and regression techniques are two of the main tasks considered by the Machine Learning area. They mainly depend on the target variable to predict. In this context, ordinal classification represents an intermediate task, which is focused on the prediction of nominal variables where the categories follow a specific intrinsic order given by the problem. Nevertheless, the integration of different algorithms able to solve ordinal classification problems is often unavailable in most of existing Machine Learning software, which hinders the use of new approaches. Therefore, this paper focuses on the incorporation of an ordinal classification algorithm (NSLVOrd) in one of the most complete ordinal regression frameworks, ``Ordinal Regression and Classification Algorithms framework (ORCA)'' by using both fuzzy rules and the JFML library. The use of NSLVOrd in the ORCA tool as well as a case study with a real database are shown where the obtained results are promising.
Donor-Recipient (D-R) matching is one of the main challenges to be fulfilled nowadays. Due to the increasing number of recipients and the small amount of donors in liver transplantation, the allocation method is crucial. In this paper, to establish a fair comparison, the United Network for Organ Sharing database was used with 4 different end-points (3 months, and 1, 2 and 5 years), with a total of 39, 189 D-R pairs and 28 donor and recipient variables. Modelling techniques were divided into two groups: 1) classical statistical methods, including Logistic Regression (LR) and Naïve Bayes (NB), and 2) standard machine learning techniques, including Multilayer Perceptron (MLP), Random Forest (RF), Gradient Boosting (GB) or Support Vector Machines (SVM), among others. The methods were compared with standard scores, MELD, SOFT and BAR. For the 5-years end-point, LR (AUC = 0.654) outperformed several machine learning techniques, such as MLP (AUC = 0.599), GB (AUC = 0.600), SVM (AUC = 0.624) or RF (AUC = 0.644), among others. Moreover, LR also outperformed standard scores. The same pattern was reproduced for the others 3 end-points. Complex machine learning methods were not able to improve the performance of liver allocation, probably due to the implicit limitations associated to the collection process of the database.
This paper evaluates the performance of different evolutionary neural network models in a problem of solar radiation prediction at Toledo, Spain. The prediction problem has been tackled exclusively from satellite-based measurements and variables, which avoids the use of data from ground stations or atmospheric soundings. Specifically, three types of neural computation approaches are considered: neural networks with sigmoid-based neurons, radial basis function units and product units. In all cases these neural computation algorithms are trained by means of evolutionary algorithms, leading to robust and accurate models for solar radiation prediction. The results obtained in the solar radiation estimation at the radiometric station of Toledo show an excellent performance of evolutionary neural networks tested. The structure sigmoid unit-product unit with evolutionary training has been shown as the best model among all tested in this paper, able to obtain an extremely accurate prediction of the solar radiation from satellite images data, and outperforming all other evolutionary neural networks tested, and alternative Machine Learning approaches such as Support Vector Regressors or Extreme Learning Machines.
Time series ordinal classification is one of the less studied problems in time series data mining. This problem consists in classifying time series with labels that show a natural order between them. In this paper, an approach is proposed based on the Shapelet Transform (ST) specifically adapted to ordinal classification. ST consists of two different steps: 1) the shapelet extraction procedure and its evaluation; and 2) the classifier learning using the transformed dataset. In this way, regarding the first step, 3 ordinal shapelet quality measures are proposed to assess the shapelets extracted, and, for the second step, an ordinal classifier is applied once the transformed dataset has been constructed. An empirical evaluation is carried out, considering 7 ordinal datasets from the UEA & UCR Time Series Classification (TSC) repository. The results show that a support vector ordinal classifier applied to the ST using the Pearson’s correlation coefficient (R2) is the combination achieving the best resultsin terms of two evaluation metrics: accuracy and average mean absolute error. A final comparison against three of the most popular and compet-itive nominal TSC techniques is performed, demonstrating that ordinal approaches can achieve higher performances even in terms of accuracy.
Purpose of review: Machine Learning techniques play an important role in organ transplantation. Analysing the main tasks for which they are being applied, together with the advantages and disadvantages of their use, can be of crucial interest for clinical practitioners. Recent findings: In the last 10 years, there has been an explosion of interest in the application of ML techniques to organ transplantation. Several approaches have been proposed in the literature aiming to find universal models by considering multicenter cohorts or from different countries. Moreover, recently, deep learning has also been applied demonstrating a notable ability when dealing with a vast amount of information. Summary: Organ transplantation can benefit from ML in such a way to improve the current procedures for donor-recipient matching or to improve standard scores. However, a correct preprocessing is needed to provide consistent and high quality databases for ML algorithms, aiming to robust and fair approaches to support expert decision-making systems.
Nominal time series classification has been widely developed over the last years. However, to the best of our knowledge, ordinal classification of time series is an unexplored field, and this paper proposes a first approach in the context of the shapelet transform (ST). For those time series dataset where there is a natural order between the labels and the number of classes is higher than 2, nominal classifiers are not capable of achieving the best results, because the models impose the same cost of misclassification to all the errors, regardless the difference between the predicted and the ground-truth. In this sense, we consider four different evaluation metrics to do so, three of them of an ordinal nature. The first one is the widely known Information Gain (IG), proved to be very competitive for ST methods, whereas the remaining three measures try to boost the order information by refining the quality measure. These three measures are a reformulation of the Fisher score, the Spearman’s correlation coefficient (ρ), and finally, the Pearson’s correlation coefficient (R²). An empirical evaluation is carried out, considering 7 ordinal datasets from the UEA & UCR time series classification repository, 4 classifiers (2 of them of nominal nature, whereas the other 2 are of ordinal nature) and 2 performance measures (correct classification rate, CCR, and average mean absolute error, AMAE). The results show that, for both performance metrics, the ST quality metric based on R² is able to obtain the best results, specially for AMAE, for which the differences are statistically significant in favour of R².
In this paper we tackle a problem of convective situations analysis at Adolfo-Suarez Madrid-Barajas International Airport (Spain), based on Ordinal Regression algorithms. The diagnosis of convective clouds is key in a large airport like Barajas, since these meteorological events are associated with strong winds and local precipitation, which may affect air and land operations at the airport. In this work, we deal with a 12-h time horizon in the analysis of convective clouds, using as input variables data from a radiosonde station and also from numerical weather models. The information about the objective variable (convective clouds presence at the airport) has been obtained from the Madrid-Barajas METAR and SPECI aeronautical reports. We treat the problem as an ordinal regression task, where there exist a natural order among the classes. Moreover, the classification problem is highly imbalanced, since there are very few convective clouds events compared to clear days. Thus, a process of oversampling is applied to the database in order to obtain a better balance of the samples for this specific problem. An important number of ordinal regression methods are then tested in the experimental part of the work, showing that the best approach for this problem is the SVORIM algorithm, based on the Support Vector Machine strategy, but adapted for ordinal regression problems. The SVORIM algorithm shows a good accuracy in the case of thunderstorms and Cumulonimbus clouds, which represent a real hazard for the airport operations.
In the last decade, the sound quality of electric induction motors is a hot topic in the research field. Specially, due to its high number of applications, the population is exposed to physical and psychological discomfort caused by the noise emission. Therefore, it is necessary to minimise its psychological impact on the population. In this way, the main goal of this work is to evaluate the use of multitask artificial neural networks as a modelling technique for simultaneously predicting psychoacoustic parameters of induction motors. Several inputs are used, such as, the electrical magnitudes of the motor power signal and the number of poles, instead of separating the noise of the electric motor from the environmental noise. Two different kind of artificial neural networks are proposed to evaluate the acoustic quality of induction motors, by using the equivalent sound pressure, the loudness, the roughness and the sharpness as outputs. Concretely, two different topologies have been considered: simple models and more complex models. The former are more interpretable, while the later lead to higher accuracy at the cost of hiding the cause-effect relationship. Focusing on the simple interpretable models, product unit neural networks achieved the best results: 38.77 for MSE and 13.11 for SEP. The main benefit of this product unit model is its simplicity, since only 10 inputs variables are used, outlining the effective transfer mechanism of multitask artificial neural networks to extract common features of multiple tasks. Finally, a deep analysis of the acoustic quality of induction motors in done using the best product unit neural networks.
The prediction of convective clouds formation is a very important problem in different areas such as agriculture, natural hazards prevention or transport-related facilities, among others. In this paper we evaluate the capacity of different types of evolutionary artificial neural networks to predict the formation of convective clouds, tackling the problem as a classification task. We use data from Madrid-Barajas airport, including variables and indices derived from the Madrid-Barajas airport radiosonde station. As objective variable, we use the cloud information contained in the METAR and SPECI meteorological reports from the same airport and we consider a prediction time-horizon of 12 hours. The performance of different types of evolutionary artificial neural networks has been discussed and analysed, including three types of basis functions (Sigmoidal Unit, Product Unit and Radial Basis Function), and two types of models, a mono-objective evolutionary algorithm with two objective functions and a multi-objective evolutionary algorithm optimised by the two objective functions simultaneously. We show that some of the developed neuro-evolutionary models obtain high quality solutions to this problem, due to its high unbalance characteristic.
Several European countries have established criteria for prioritising initiation of treatment in patients infected with the hepatitis C virus (HCV) by grouping patients according to clinical characteristics. Based on neural network techniques, our objective was to identify those factors for HIV/HCV co-infected patients (to which clinicians have given careful consideration before treatment uptake) that have not being included among the prioritisation criteria. This study was based on the Spanish HERACLES cohort (NCT02511496) (April-September 2015, 2940 patients) and involved application of different neural network models with different basis functions (product-unit, sigmoid unit and radial basis function neural networks) for automatic classification of patients for treatment. An evolutionary algorithm was used to determine the architecture and estimate the coefficients of the model. This machine learning methodology found that radial basis neural networks provided a very simple model in terms of the number of patient characteristics to be considered by the classifier (in this case, six), returning a good overall classification accuracy of 0.767 and a minimum sensitivity (for the classification of the minority class, untreated patients) of 0.550. Finally, the area under the ROC curve was 0.802, which proved to be exceptional. The parsimony of the model makes it especially attractive, using just eight connections. The independent variable “recent PWID” is compulsory due to its importance. The simplicity of the model means that it is possible to analyse the relationship between patient characteristics and the probability of belonging to the treated group.
This paper presents a novel approach to tackle simultaneously short- and long-term energy flux prediction (specifically, at 6h, 12h, 24h and 48h time horizons). The methodology proposed is based on the Multi-Task Learning paradigm in order to solve the four problems with a single model. We consider Multi-Task Evolutionary Artificial Neural Networks (MTEANN) with four outputs, one for each time prediction horizon. For this purpose, three buoys located at the Gulf of Alaska are considered. Measurements collected by these buoys are used to obtain the target values of energy flux, whereas, only reanalysis data are used as input values, allowing the applicability to other locations. The performance of three different basis functions (Sigmoidal Unit, Radial Basis Function and Product Unit) are compared against some popular stateof-the-art approaches such as Extreme Learning Machines and Support Vector Regressors. The results show that MTEANN methodology using Sigmoidal Units in the hidden layer and a linear output achieves the best performance. In this way, the multi-task methodology is an excellent and lower-complexity approach for energy flux prediction at both short- and long-term prediction time horizons. Furthermore, the results also confirm that reanalysis data is enough for describing well the problem tackled.
The aim of this study is to develop and validate a machine learning (ML) model for predicting survival after liver transplantation based on pre-transplant donor and recipient characteristics. For this pur- pose, we consider a database from the United Network for Organ Shar- ing (UNOS), containing 29 variables and 39,095 donor-recipient pairs, describing liver transplantations performed in the United States of Amer- ica from November 2004 until June 2015. The dataset contains more than a 74% of censoring, being a challenging and difficult problem. Sev- eral methods including proportional-hazards regression models and ML methods such as Gradient Boosting were applied, using 10 donor char- acteristics, 15 recipient characteristics and 4 shared variables associated with the donor-recipient pair. In order to measure the performance of the seven state-of-the-art methodologies, three different evaluation met- rics are used, being the concordance index (ipcw) the most suitable for this problem. The results achieved show that, for each measure, a dif- ferent technique obtains the highest value, performing almost the same, but, if we focus on ipcw, Gradient Boosting outperforms the rest of the methods.
Shapelets are phase independent subseries that can be used to discriminate between time series. Shapelets have proved to be very effective primitives for time series classification. The two most prominent shapelet based classification algorithms are the shapelet transform (ST) and learned shapelets (LS). One significant difference between these approaches is that ST is data driven, whereas LS searches the entire shapelet space through stochastic gradient descent. The weakness of the former is that full enumeration of possible shapelets is very time consuming. The problem with the latter is that it is very dependent on the initialisation of the shapelets. We propose hybridising the two approaches through a pipeline that includes a time constrained data driven shapelet search which is then passed to a neural network architecture of learned shapelets for tuning. The tuned shapelets are extracted and formed into a transform, which is then classified with a rotation forest. We show that this hybrid approach is significantly better than either approach in isolation, and that the resulting classifier is not significantly worse than a full shapelet search.
Wave height prediction is an important task for ocean and marine resource management. Traditionally, regression techniques are used for this prediction, but estimating continuous changes in the corresponding time series can be very difficult. With the purpose of simplifying the prediction, wave height can be discretised in consecutive intervals, resulting in a set of ordinal categories. Despite this discretisation could be performed using the criterion of an expert, the prediction could be biased to the opinion of the expert, and the obtained categories could be unrepresentative of the data recorded. In this paper, we propose a novel automated method to categorise the wave height based on selecting the most appropriate distribution from a set of well- suited candidates. Moreover, given that the categories resulting from the discretisation show a clear natural order, we propose to use different ordinal classifiers. The methodology is tested in real wave height data collected from two buoys located in the Gulf of Alaska. We also incorporate reanalysis data in order to increase the accuracy of the predictors. The results confirm that this kind of discretisation is suitable for the time series considered and that the ordinal classifiers achieve outstanding results in comparison with nominal techniques.
Desiccant wheels (DW) could be a serious alternative to conventional dehumidification systems based on direct expansion units, which depend on electrical energy. The main objective of this work was to evaluate the use of multitask artificial neural networks (ANNs) as a modelling technique for DWs activated at low temperature with low computational load and good accuracy. Two different ANN models were developed to predict two output variables: outlet process air temperature and humidity ratio. The results show that a sigmoid unit neural network obtained 0.390 and 2.987 for MSE and SEP, respectively. These results outline the effective transfer mechanism of multitask ANNs to extract common features of multiple tasks, being useful for modelling a DW activated at low temperature. On the other hand, moisture removal capacity of the DW and its performance were analysed under several inlet air conditions, showing an increase under process air conditions close to saturation air.
The prediction of low-visibility events is very important in many human activities, and crucial in transportation facilities such as airports, where they can cause severe impact in flight scheduling and safety. The design of accurate predictors for low-visibility events can be approached by modelling future visibility conditions based on past values of different input variables, recorded at the airport. The use of autoregressive time series forecasters involves adjusting the order of the model (number of past series values or size of the sliding window), which usually depends on the dynamical nature of the time series. Moreover, the same window size is normally used for all the data, thought it would be reasonable to use different sliding windows. In this paper, we propose a hybrid prediction model for daily low-visibility events, which combines fixed-size and dynamic windows, and adapts its size according to the dynamics of the time series. Moreover, visibility is labelled using three ordered categories (FOG, MIST and CLEAR), and the prediction is then carried out by means of ordinal classifiers, in order to take advantage of the ordinal nature of low-visibility events. We evaluate the model using a dataset from Valladolid airport (Spain), where radiation fog is very common in autumn and winter months. The considered data set includes five different meteorological input variables (wind speed and direction, temperature, relative humidity and QNH - pressure adjusted at mean sea level) and the Runway Visual Range (RVR), which is used to characterize the low-visibility events at the airport. The results show that the proposed hybrid window model with ordinal classification leads to very robust performance prediction in daily time-horizon, improving the results obtained by the persistence model and alternative prediction schemes tested.
Time series clustering is the process of grouping time series with respect to their similarity or characteristics. Previous approaches usually combine a specific distance measure for time series and a standard clustering method. However, these approaches do not take the similarity of the different subsequences of each time series into account, which can be used to better compare the time series objects of the dataset. In this paper, we propose a novel technique of time series clustering based on two clustering stages. In a first step, a least squares polynomial segmentation procedure is applied to each time series, which is based on a growing window technique that returns different-length segments. Then, all the segments are projected into same dimensional space, based on the coefficients of the model that approximates the segment and a set of statistical features. After mapping, a first hierarchical clustering phase is applied to all mapped segments, returning groups of segments for each time series. These clusters are used to represent all time series in the same dimensional space, after defining another specific mapping process. In a second and final clustering stage, all the time series objects are grouped. We consider internal clustering quality to automatically adjust the main parameter of the algorithm, which is an error threshold for the segmentation. The results obtained on 84 datasets from the UCR Time Series Classification Archive have been compared against two state-of-the-art methods, showing that the performance of this methodology is very promising.
Wave height prediction is an important task for ocean and marine resource management. Traditionally, regression techniques are used for this prediction, but estimating continuous changes in the corresponding time series can be very difficult. With the purpose of simplifying the prediction, wave height can be discretised in consecutive intervals, resulting in a set of ordinal categories. Despite this discretisation could be performed using the criterion of an expert, the prediction could be biased to the opinion of the expert, and the obtained categories could be unrepresentative of the data recorded. In this paper, we propose a novel automated method to categorise the wave height based on selecting the most appropriate distribution from a set of well-suited candidates. Moreover, given that the categories resulting from the discretisation show a clear natural order, we propose to use different ordinal classifiers instead of nominal ones. The methodology is tested in real wave height data collected from two buoys located in the Gulf of Alaska and South Kodiak. We also incorporate reanalysis data in order to increase the accuracy of the predictors. The results confirm that this kind of discretisation is suitable for the time series considered and that the ordinal classifiers achieve outstanding results in comparison with nominal techniques.
Los eventos de muy baja visibilidad producidos por niebla son un problema recurrente en ciertas zonas cercanas a rı́os y grandes montañas, que afectan fuertemente a la actividad humana en diferentes aspectos. Este tipo de eventos pueden llegar a suponer costes materiales e incluso humanos muy importantes. Uno de los sectores más influenciados por las condiciones de muy baja visibilidad son los medios de transporte, fundamen- talmente el transporte aéreo, cuya actividad se ve seriamente mermada, provocando retrasos, cancelaciones y, en el peor de los casos, terribles accidentes. En el aeropuerto de Valladolid son muy frecuentes las situaciones de baja visibilidad por niebla, especialmente en los meses considerados de invierno (noviembre, diciembre, enero y febrero). Esto afecta de forma directa a la manera en la que operan los vuelos de este aeropuerto. De esta forma, es muy importante conocer las posibles condiciones de niebla a corto plazo para aplicar procedimientos de seguridad y organización dentro del aeropuerto. En el presente artı́culo se propone el uso de diferentes modelos de ventanas dinámicas y estáticas junto con clasificadores de aprendizaje automático, para la predicción de niveles de niebla. En lugar de abordar el problema como una tarea de regresión, la variable de interés para la caracterización del nivel de visibilidad en el aeropuerto (Rango Visual de Pista, RVR) se discretiza en 3 categorı́as, lo que aporta mayor robustez a los modelos de clasificación obtenidos. Los resultados indican que una combinación de ventana dinámica con ventana estática, junto con modelos de clasificación basados en Gradient Boosted Trees es la metodologı́a que proporciona los mejores resultados.
The amount of data available in time series is recently increasing in an exponential way, making difficult time series preprocessing and analysis. This paper adapts different methods for time series representation, which are based on time series segmentation. Specifically, we consider a particle swarm optimization algorithm (PSO) and its barebones exploitation version (BBePSO). Moreover, a new variant of the BBePSO algorithm is proposed, which takes into account the ositions of the particles throughout the generations, where those close in time are given more importance. This methodology is referred to as weighted BBePSO (WBBePSO). The solutions obtained by all the algorithms are finally hybridised with a local search algorithm, combining simple segmentation strategies (Top-Down and Bottom-Up). WBBePSO is tested in 13 time series and compared against the rest of algorithms, showing that it leads to the best results and obtains consistent representations.
Time series segmentation can be approached using metaheuristics procedures such as genetic algorithms (GAs) methods, with the purpose of automatically finding segments and determine similarities in the time series with the lowest possible clustering error. In this way, segments belonging to the same cluster must have similar properties, and the dissimilarity between segments of different clusters should be the highest possible. In this paper we tackle a specific problem of significant wave height time series segmentation, with application in coastal and ocean engineering. The basic idea in this case is that similarity between segments can be used to characterise those segments with high significant wave heights, and then being able to predict them. A recently metaheuristic, the Coral Reef Optimization (CRO) algorithm is proposed for this task, and we analyze its performance by comparing it with that of a GA in three wave height time series collected in three real buoys (two of them in the Gulf of Alaska and another one in Puerto Rico). The results show that the CRO performance is better than the GA in this problem of time series segmentation, due to the better exploration of the search space obtained with the CRO.
Durante los últimos años se ha producido un gran aumento de series de datos distribuidas a lo largo del tiempo, o lo que es lo mismo, de series temporales. Este crecimiento ha traído consigo un interés en su agrupamiento o clustering, proceso de agrupar las series de forma que, las series de un mismo grupo sean muy similares entre sí y muy diferentes a las de otros grupos. Cuando las series temporales son muy largas, presentan ruido o valores perdidos, muchos de los métodos actuales obtienen soluciones que no son aceptables. En este artículo se presenta un nuevo método de clustering para series temporales, mediante polinomios utilizando un método conocido como Growing Window. De esta forma simplificamos la serie a un conjunto de coeficientes lineales de grado variable, para, posteriormente, agrupar los diferentes segmentos y hallar los centroides de cada cluster. Al final la serie queda simplificada a n x d elementos, siendo n el número de clusters y d el grado del polinomio utilizado en la aproximación y es con esta representación con la que se realiza la agrupación final. El objetivo de esta nueva metodología consiste en disminuir la dimensionalidad de la serie temporal, la sensibilidad del agrupamiento y los datos perdidos.
This paper proposes a reservoir computing architecture for predicting wind power ramp events (WPREs), which are strong increases or decreases of wind speed in a short period of time. This is a problem of high interest, because WPREs increases the maintenance costs of wind farms and hinders the energy production. The standard echo state network architecture is modified by replacing the linear regression used to compute the reservoir outputs by a nonlinear support vector machine, and past ramp function values are combined with reanalysis data to perform the prediction. Another novelty of the study is that we will predict three type of events (negative ramps, non-ramps and positive ramps), instead of binary classification of ramps, given that the type of ramp can be crucial for the correct maintenance of the farm. The model proposed obtains satisfying results, being able to correctly predict around 70% of WPREs and outperforming other models.