Model Validity Tests for RBF Network

M. Y. Mashor
School of Electrical and Electronic Engineering,
University Science of Malaysia,
Perak Branch Campus,
31750 Tronoh, Perak,
Malaysia.
E-mail : yusof@eng.usm.my

Abstract

Model validation is an important step in system identification process. However, theoretical derivation of model validity tests for neural network such as RBF network is very complicated. The current study, investigate the capability of some of the model validity tests that are widely been used namely one step ahead prediction, model predicted output, means square error and correlation tests. This paper also explores the appropriateness of these validity tests to provide some inside information about network model deficiencies.

Key Words : Model validation, system identification, RBF network, prediction, mean square error, correlation tests.

1. Introduction

Radial basis function networks have theoretically been proved to posses a universal approximation property by Poggio and Girosi (1990). The networks have also been successfully applied in various fields such as system identification (Chen et al., 1991, 1992; Elanayar and Shin, 1994; Pottmann and Seborg, 1992; Ye and Loh, 1993), pattern recognition (Arad et al., 1994; Galicki et al., 1997; Jonathan and Buxton, 1997), robotics (Feng, 1993; Gorinevsky and Connolly, 1994), medicine (Linken and Nie, 1993) and business (Tan et al., 1992). In the present study, some model validity tests are investigated for RBF network in relation to system identification. Unlike conventional parametric models, neural network models are highly non-linear hence the definitive proof of network validity tests is very difficult. However, identification using neural networks involves learning or estimating mathematical descriptions of systems. Thus, the fundamental results of model validation methods for the conventional system identification may also be used for RBF network.

Model validity tests such as one step ahead prediction, model predicted output, mean square error, Chi-square and correlation tests are widely been used especially in parametric system identification (Chen et al., 1990; Korenberg et al., 1988; Billings and Voon, 1986, Billings et al. 1992). Due to a complex structure of RBF network, the mathematical proof of the suitability of these tests to validate the RBF network model will be very complicated. In the present study, the suitability of these validity tests is investigated using simulation.

In this study, the RBF network will be trained using the hybrid-training algorithm that was similar to the method introduced by Chen et al. (1992). The method uses exactly the same clustering algorithm and weight estimation method used by Chen et al. (1992) but the training is performed using off-line technique, i.e. the RBF centres are positioned before the weights are estimated. The off-line training was selected such that the effect of bad initial centre positions can be reduced. The hybrid-training algorithm was selected because many results that have been reported on the RBF network are based on training algorithms that are similar to this. So the discussion in this paper will be particularly applicable to those works.

2. RBF Network with Linear Input Connections

A RBF network with m outputs and nh hidden nodes can be expressed as:

(1)

where wij, wio and cj(t) are the connection weights, bias connection weights and RBF centres respectively, v (t) is the input vector to the RBF network composed of lagged input, lagged output and lagged prediction error and is a non-linear basis function. denotes a distance measure that is normally taken to be the Euclidean norm.

Since neural networks are highly non-linear, even a linear system has to be approximated using the non-linear neural network model. However, modelling a linear system using a non-linear model can never be better than using a linear model. Considering this argument, the RBF network with additional linear input connections is used. The proposed network allows the network inputs to be connected directly to the output node via weighted connections to form a linear model in parallel with the non-linear standard RBF model as shown in Figure 1.

The new RBF network with m outputs, n inputs, nh hidden nodes and nl linear input connections can be expressed as:

(2)

where the l ‘s and vl’s are the weights and the input vector for the linear connections respectively. The input vector for the linear connections may consist of past inputs, outputs and noise lags. Since l 's appear to be linear within the network, the l 's can be estimated using the same algorithm as for the w’s. As the additional linear connections only introduce a linear model, no significant computational load is added to the standard RBF network training. Furthermore, the number of required linear connections is normally much smaller than the number of hidden nodes in the RBF network. In the present study, Givens least squares algorithm with additional linear input connection features is used to estimate w’s and l ‘s. Refer to Chen et. al. (1992) or Mashor (1995) for implementation of Givens least squares algorithm.


Figure 1. The RBF network with linear input connections

3. Model Validity Tests

A poorly fitted neural network model often predicts badly and can be biased. This may occur due to incorrect input node assignments, noisy data, insufficient hidden nodes, inappropriate values of design parameters etc. Model validity tests are procedures designed to detect model deficiencies. There are several ways of testing a model. In the present study one step ahead prediction (OSA), model predicted output (MPO), mean squared error (MSE) and correlation tests will be used to test the fitted network model.

3.1 One Step Ahead Prediction and Model Predicted Output

One step ahead prediction has been used by many authors to measure the predictive capability of a fitted model. OSA is given as:

(3)

and the residual or prediction error is defines as

(4)

where is a non-linear function, in this case the RBF network. Another test that often gives a better measurement of the predictive capability of a fitted model is model predicted output. MPO can be expressed as:

(5)

and the deterministic error or deterministic residual is:

(6)

It is normal for OSA to be good even when the model is biased, underfitted or overfitted because the value of is predicted based on the past actual output y(t) rather than the past predicted output . Unlike OSA, MPO estimates the output based on the past values of predicted output, , so the output is estimated totally based on the fitted model.

To illustrate these concepts, consider system S1, which is a tension legs data. 1000 data samples were taken from the system for analysis. Initially the network was trained using an adequate input vector, where the input vector was assigned as:

and .

Other parameters were set as , , clustering gain and the number of centres nh = 45. Refer to Chen et. al. (1992) or Mashor (1995) for the definition of these parameters. The OSA and MPO in Figures (2a) and (2b) were generated where both plots were good. When the network was trained with the linear input,

while leaving other specifications unchanged, the OSA and MPO in Figures (3a) and (3b) were produced. In this case the OSA plot is good but the MPO plot becomes unstable over the testing data set (from 601 to 1000 data samples). This example shows that OSA cannot always detect the deficiency in a fitted model. On the other hand MPO computed over testing data set will normally reveal model deficiency.


(a). One step ahead prediction

(b). Model predicted output

Figure 2. Predicted outputs of the network model with proper input vector for system S1

 


(a). One step ahead prediction
(b). Model predicted output

Figure 3. Predicted outputs of the network model with improper input vector for system S1

Network overfitting is also hard to be detected by OSA test. For example, if system S1 is now over specified by assigning:

,
and

The MPO is now very bad especially over the testing data set, refer to Figure (4b). In this case, the network tends to include noise as part of the process model that often leads to a more complex model. This type of overfitted network model is often harder to detect because the network normally produces a reasonable OSA but MPO is normally very bad. The MPO in Figure (4b) is much worse than the MPO produced using the network with the appropriate structure in Figure (2b). Since the network is trained to minimise the residual over the training data set, it is not surprising that even an overfitted network will produce a good OSA over the training data set. However, OSA over the testing data set will normally deteriorate if the model is heavily overfitted. This situation is illustrated by the OSA plot in Figure (4a).


(a). One step head prediction

(b). Model predicted output
Figure 4. Predicted outputs of the RBF network model with the overfitted structure

3.2 Mean Squared Error

Mean squared error (MSE) is an iterative method of model validation where the model is tested by calculating the mean squared error at each training step. Mean squared error at t-th training steps, is given by:

(7)

where and are the mean squared error and OSA for a given estimated parameters after t training steps respectively, and nd is the number of data that are used to calculate the MSE. The data for calculating MSE can be the same as the training data or a different set of data (testing data set).

The MSE plot will indicate how fast the prediction error and the network parameters converge with the number of training data. This can also be used to determine the number of data samples that are required to train a network. MSE will normally decrease with the number of data but after a certain number of data the MSE will no longer significantly decrease with increasing numbers of data. However, the number of data that is required to train the network depends on the training algorithm as well as the network architecture. An example of MSE evolution is shown in Figure (5) that was generated for system S1. From the plot, it is found that the MSE converges after about 300 data, which indicates that the network requires about 300 data be trained properly.


Figure 5. MSE evolution for system S1.

In general, a good model will produce a good MSE, however the model with a good MSE will not always imply that the model is good. For instance, an overfitted model will normally produce a good MSE although the model cannot predict very well and may be biased. This problem may be avoided by splitting the data into two sets, the training set and the testing sets. The MSE calculated using the testing data set often provides a better measurement of the predictive capability of a fitted model. This can be seen in Figures (6a) and (6b) where the networks have the overfitted hidden nodes and input nodes. The plots show that if the MSE is calculated using testing data set then the effect of overfitting will be clear. In the case of hidden nodes overfitting (refer to figure 6b), MSE will decrease with the increasing number of hidden node for training data set but increasing for testing data set. In other words, the network loses its generalisation property. Thus, computing the MSE over the testing data set becomes more helpful to avoid overfitting.


(a). Input nodes
(b). Hidden nodes
Figure 6. Variation of MSE with the overfitted input nodes and hidden nodes for system S1

MSE plot also gives a rough idea of the appropriate number of hidden nodes and maximum input lag. The MSE plots for S1 against the maximum number of input lag and hidden nodes Figures (6a) and (6b) indicate that the network should have 80 hidden nodes and a maximum lag of 8 that is nu = 8 and ny = 8. With this specifications the network will gives the optimum MSE over both training and testing data set. Therefore, MSE test can be used to avoid underfitting and overfitting in RBF network.

3.3 Correlation Tests

A non-linear model is considered as unbiased if the residual, e (t), is unpredictable or uncorrelated with all linear and non-linear combinations of past inputs and outputs. Billings and Voon [1986] proved that for a certain class of non-linear systems the following conditions should hold if the fitted non-linear model is adequate:

(8)

where and are the mean value of u2(t) and the expectation respectively. In practice, the correlation will never be exactly zero for all lags but the model is considered as adequate if the correlation tests lie within 95% confidence limits, defined as , where N is the number of training data. Autocorrelation of the residual will also never be an ideal delta function but will be considered as adequate if the autocorrelation plot enters the 95% confidence limits before lag one. The detail interpretation of the correlation tests can be referred to Billings and Voon [1986].

The correlation test between two sequences and is normally computed using the following equation:

(9)

The term is used to normalise such that the values of will always lie between -1 and 1 irrespective of the signal strength. It is very difficult to prove definitively for neural networks that the tests in equation (8) will detect every possible model deficiency.

Noise can enter a system internally or externally, however, for a linear system the internal noise can always be translated to be additive at the output. Whereas, if a system is non-linear, internal noise can introduce cross product terms between the input, output and noise. Both types of noises can create problem and will normally induce bias but internal noise is often more difficult to handle. The problem of noisy data can normally be eliminated by fitting an appropriate noise model.

The theoretical analysis of this problem can be presented with a few assumptions. Assume that the centres of the RBF network have been fixed and correct designing parameters have been specified, then the problem of RBF weight estimation is reduced to a linear least squares problem. A RBF network with a single output and nh hidden nodes can be expressed as:

(10)

where wj, cj and e (t) are the connection weights, RBF centres and prediction error respectively; f (.) is a basis function, selected to be the thin-plate-spline; and v(t) is an input vector that may consist of past inputs, past outputs or past prediction errors.

The term f ( || v(t) - cj(t) || ) is the output of the hidden nodes that becomes available before the w's are estimated. If the term is represented by zj(t), equation (10) can be expressed as a general regression model:

(11)

where z(t), and e (t) are the regressors and the prediction errors respectively. In matrix form equation (11) can be written as:

Y = ZW + X (12)

where Y = [ y(1) ... y(N) ]T, W = [ w1 ... wnh ]T, X = [ e (1) ... e (N) ]T, N is the number of training data and

(13)

The solution of equation (12) can be deduced from the least squares estimate, (Goodwin and Payne, 1977) to give:

(14)

Substitute equation (12) into equation (14) and rearrange the terms to yield:

(15)

It is clear from equation (15) that for the estimate of to be unbiased the term must be zero. This can only happen if which means that the estimate will only approach the true value if the elements of the regressor Z are uncorrelated with X. Because Z is somehow related to past inputs, past outputs and past prediction errors, for a model to be unbiased the prediction error should be uncorrelated with all combinations of past inputs, past outputs and past prediction errors. These requirements are the same as the requirements for the correlation tests in equation (8). Hence the correlation tests may be used to detect bias in RBF networks.

The capability of the correlation tests to detect model deficiencies will be illustrated by using the following system (called system S2) for the node assignment problem:

1000 data pairs were generated by using a uniformly distributed zero mean white noise sequence u(t) between [-1, +1]. The output of the system was corrupted by a coloured noise formed by a function of a Gaussian white noise, e(t) that has zero mean and variance of 0.05. A network with 50 hidden nodes, , , and clustering gain were used to model the system.

Initially the network was trained by using the correct input vector, where the input vector was assigned as:

The correlation tests in Figure (7a) were produced where all the tests are satisfied hence the model can be considered as adequate. When the input vector was assigned by excluding the term, the correlation tests in Figure (7b) were produced. The plots of and are well outside the confidence limits at lag 2 indicating that a term (or input node) of lag 2 has been omitted from the input vector and the missing term is an odd powered term of u. Since u(t) (which is uniformly distributed) has an even probability density function (PDF) the residual must be corrupted by an odd power of u term. The missing term cannot be a noise term or output term because the autocorrelation of the residuals is satisfied.


(a). correct input vector
(b). without u(t-2)


(c). without y(t-2)
(d). without u2(t-1)


(e). without noise model

Figure 7. Correlation tests for the input node assignment problem

When the input vector was assigned by excluding the y(t-2) term, the correlation tests in Figure (7c) were produced. The plots of and are well outside the 95% confidence limits at lag 2 indicating that a y or e term at lag 2 has been omitted. By assigning the input vector to exclude the u2(t-1) term, the correlation tests in Figure (7d) were produced where and are outside the 95% confidence limits. This indicates that an even power of u term was missing from the model, in this case u2(t-1). By assigning the input vector to exclude the e(t-1) term, the correlation tests in Figure (7e) were produced. All the tests except for are satisfied suggesting that the process model is correct but the noise model is biased. The plot enters the 95% confidence limits at lag 2 indicating that a noise term at lag 1 is missing.

This example suggests that the correlation tests can be used to detect missing terms in the RBF network input vector. For a simulation example, it is easy to get a good model that satisfies all the correlation tests provided an appropriate number of hidden nodes and other design parameters are specified correctly because the input vector is known. However, in practice where the true system is unknown, the detection of network deficiency is more challenging and interpretation of the correlation tests is not always so straightforward. Model deficiency occurs not only because of incorrect input vector but may be due to any incorrect design parameter of RBF network. However, correlation tests are designed to detect all possible deficiencies in the model irrespective of the cause of the deficiency.

Correlation tests are quite reliable to detect bias in RBF network model where OSA and MPO normally fail. To illustrate this idea considers system S3 that is the heat exchanger data (Billings and Fadhil 1986). The network was trained using 40 centres, b 0 = 0.99, b (0) = 0.95, h (0) = 0.9, nu = 3, ny = 2 plus a bias input and 600 data were used for training. The MPO of the model is shown in Figure (8) where the network gives a reasonable prediction over both the training and testing data sets. However, the correlation tests in Figure (8) suggesting that the model is biased where almost all the correlation plots lie outside the 95% confidence limits.


(a). Model predicted output
(b). Correlation tests
Figure 8. Validation tests for system S3

4. Conclusions

It is very difficult to prove definitively for neural networks that the model validation procedures will detect the network deficiency. Model validity tests such as correlation tests, OSA, MPO and mean squared error have been shown to provide some helpful information about the fitted network model. OSA that is commonly been used to measure the predictive capability of a fitted model has been shown to be not adequate to validate the network model. It has been shown that OSA plot is still good even though the network model was highly biased. Even the OSA plot over the testing data set was just slightly deteriorated. A better validation test for predictive capability of RBF network model is MPO, which normally detects most of the model deficiencies.

MSE can be used to indicate how fast the parameters of the network converge to their final values. Thus, it will indicate the number of minimum data samples that should be used to train the network. The results also suggest that MSE plot over testing data set can be used to find the optimum number of input nodes and hidden nodes for RBF network.

Results in section 3.3 suggest that the correlation tests are adequate to detect missing input terms in RBF network input vector. In practice where the true system is unknown, the detection of a deficiency is more challenging and interpretation of the correlation tests is not always so straightforward. Model deficiency may occur not only because of incorrect input vector but may be due to any incorrect design parameter or RBF centres. Fortunately, correlation tests are designed to detect all possible deficiencies in the model irrespective of the cause of the deficiency.

References

  1. Arad, N., Dyn, N., Reisfeld, D., and Yeshurun, Y., 1994, “Image warping by radial basis functions: application to facial expressions”, CVGIP: Graphical Models and Image Processing, 56 (2), 161-172.

  2. Billings, S.A., and Fadhil, M.B., 1985, “The practical identification of system with non-linearities”, Proc. 7th IFAC Symp. on Identification and System Parameter Estimation, York, U.K., 155-160.

  3. Billings, S.A., and Voon, W.S.F., 1986, “Structure detection and model validity tests in the identification of non-linear systems”, Proc. IEE, Part D, 127, 272-285.

  4. Billings, S.A., Jamaluddin, H.B. and Chen, S., 1992, “Properties of neural networks with applications to modelling non-linear dynamical systems”, Int. J. of Control, 55, 193-224.

  5. Chen, S., Billings, S.A., Cowan, C.F.N., and Grant, P.M., 1990, “Practical identification of NARMAX models using radial basis functions”, Int. J. Control, 52, 1327-1350.

  6. Chen, S., Cowan, C.F.N., and Grant, P.M., 1991, “Orthogonal least squares learning algorithm for radial basis function networks”, IEEE Trans. on Neural Networks, 2, 302-309.

  7. Chen, S., Billings, S.A. and Grant, P.M., 1992, “Recursive hybrid algorithm for non-linear system identification using radial basis function networks”, Int. J. of Control, 55, 1051-1070.

  8. Elanayar V.T., S., and Shin, Y.C., 1994, “Radial basis function neural network for approximation and estimation of non-linear stochastic dynamic systems”, IEEE Trans. on Neural Networks, 5 (4), 594-603.

  9. Feng, G., 1993, “Improved tracking control for robots using neural networks”, Proc. of the 1993 American Control Conf., 1, 69-73.

  10. Galicki, M., Witte, H., J., Eiselt, M., and Griessbach, G., 1997, “Common optimisation of adaptive preprocessing units and a neural network during the learning period Application in EGG pattern recognition”, Neural Networks, 10(6), 1153-1163.

  11. Goodwin, G.C. and Payne, R.L., 1977, Dynamic System Identification: Experiment Design and Data Analysis, Academic Press, New York.

  12. Gorinevsky, D., and Connolly, T.H., 1994, “Comparison of some neural network and scattered data approximation: The inverse manipulator kinematics example”, Neural Computation, 6 (3), 521-542.

  13. Howell, A.J., and Buxton, H., 1997, “Recognising simple behaviours using time delay RBF networks”, Neural Processing Letters, 5, 97-104.

  14. Korenberg, M.J., Billings, S.A, Liu, Y.P. and McIlroy, P.J., 1988, "Orthogonal parameters estimation algorithm for non-linear stochastic systems". Int. J. Control, 48 (1), 193-210.

  15. Linkens, D.A., and Nie, J., 1993, “Fuzzified RBF network-based learning control: Structure and self-construction”, IEEE Int. Conf. on Neural Networks, 2, 1016-1021.

  16. Mashor, M.Y., System identification using radial basis function network, PhD thesis, University of Sheffield, United Kingdom, 1995.

  17. Poggio, T., and Girosi, F., 1990, “Network for approximation and learning”, Proc. of IEEE, 78 (9), 1481-1497.

  18. Pottmann, M., and Seborg, D.E., 1992, “Identification of non-linear processes using reciprocal multiquadratic functions”, J. of Process Control, 2 (4), 189-203.

  19. Tan, P.Y., Lim, G., Chua, K., Wong, F.S., and Neo, S., 1992, “Comparative studies among neural nets, radial basis functions and regression methods”, ICARCV '92. Second Int. Conf. on Automation, Robotics and Computer Vision, 1, NW-3.3/1-6.

  20. Ye, X., and Loh, N.K., 1993, “Dynamic system identification using recurrent radial basis function network”, Proc. of the 1993 American Control Conf., 3, 2912-2916.

Assumption University of Thailand
Huamark, Bangkok 10240 , Thailand
For comment, Please contact WebMaster