gmentro.py
- Calculate entropy by Gaussian mixture fitting
gmentro.py
[-h
] [-d
] [-q
] [-c
] [-w
] [-J
J] [--order
1
|1.5
|2
|full
] [--center
] [--centeronly
] [--stop
cv
|aicsd
]
[--overfit
OVERFIT] [--sdelta
SDELTA] [--emt
EMT] [--cols
COLS] [--slice
SLICE] [--maxk
MAXK] [--ncand
NCAND] [--unit
J
|c
|e
]
[--odir
directory] [--version
] datafilename
gmentro.py
estimates entropy by Gaussian mixture fitting. The data
file named datafilename must contain an n x d matrix with n
lines and d columns, representing n samples of d variables.
Comment lines starting with #
are allowed in the data file.
-h
-d
-q
-c
-w
-J
J
-q
)--order
1
|1.5
|2
|full
1
only calculates 1-D
entropies and sums them up. 1.5
calculates the first-order
approximation and corrects it with the quasiharmonic mutual information.
2
also calculates pairwise mutual information values and subtracts their
sum from the first-order entropy to obtain the second-order entropy.
The sum of mutual information values will also be reported. full
calculates a
full-dimensional entropy. Default: full
--center
--centeronly
--stop
cv
|aicsd
cv
refers to the cross-validation stopping criterion;
aicsd
uses a combination of the Akaike Information Criterion and entropy change
upon adding another component (see --sdelta
). See section STOPPING CRITERION below.
Default: cv
.--overfit
OVERFIT
--sdelta
SDELTA
--stop
aicsd
.
See section STOPPING CRITERION below. Default: 0.2.--emt
EMT
--cols
COLS
--slice
SLICE
--maxk
MAXK
--ncand
NCAND
--unit
J
|c
|e
J
for J/K/mol, c
for cal/K/mol, e
for
natural units (nats). Default: J
.--odir
directory
--version
The program writes its output to the console, and it also generates 3 output files whose names are formed from the data file name after cutting off its last 4 characters. For example, if the input file is named datafile.dat then the following output files will be generated:
datafile.gme.out
datafile.gme.log
datafile.gme.npz
If the -J
J option is specified to provide a job number then the name
of each output file will include the job number after the .gme.
part, i.e. datafile.gme.
J.out
.
The .out
and .log
files will contain a header section where the
program parameters are summarized. This is followed by the calculation
results. Lines containing textual information will start with a #
character for easier parsing (except in debug mode).
For full-dimensional calculations, the output will be two or three columns,
with the first column being the number of Gaussian components and the
second column the calculated entropy with the given number of
components. This can be used to plot the entropy vs. the number of
components easily, e.g. with gnuplot
. If the cross-validation stopping
criterion is used (as by default), the second and third columns will contain
the entropy as calculated from the training and testing set, respectively.
Generally, the value in the second column should be used as the estimated entropy.
By default, the program uses the cross-validation stopping criterion. This means that the input ensemble is randomly divided into two equal parts which will serve as training and testing sets. The Gaussian mixture fitting is performed on the training set and the log-likelihood is evaluated on the testing set upon adding each new Gaussian component. If this log-likelihood decreases upon adding a new component, the component is discarded and the calculation is stopped.
The cross-validation stopping criterion is very robust, but because the input ensemble is divided randomly into two parts, the estimated entropy will be different upon running the program several times. You can calculate a mean and a standard deviation from the results.
As an alternative to the cross-validation stopping criterion, the
program can use a combination of two stopping criteria: the Akaike
Information Criterion (AIC) and an entropy change criterion. This can
be activated by using the --stop aicsd
option. The program will stop
when either of the two criteria is met.
The AIC is defined by
AIC = 2*npar - 2*LL
where LL is the log-likelihood of the mixture with the given sample, and npar is the number of parameters in the mixture:
npar = k-1 + k*d + k*d*(d+1)/2
which includes the number of weights (k-1), the means of the Gaussian components (k*d parameters), and the covariance matrices (k*d*(d+1)/2 parameters).
The algorithm stops when the AIC increases compared to the previous step, i.e. if AIC(k+1) > AIC(k) then the k-component Gaussian mixture generated in the previous step will be accepted as the best estimate, and the associated entropy will be reported.
The entropy change criterion will cause the program to stop when the
calculated entropy differs by less than SDELTA from the one
calculated in the previous step. The value of SDELTA can be set by
the --sdelta
SDELTA command line option; its default value is 0.1.
Because this is a relatively small value, the AIC will often stop the
program before the entropy change falls below this value.
Note that the algorithm also stops if the log-likelihood does not increase upon performing partial EM on the newly added Gaussian component. Therefore, the program can detect, for example, if the sample was generated from only a single Gaussian distribution.
The program can display a number of warnings as it runs. The most common warnings are:
"x out of n likelihoods are too small"
"No appropriate candidates found. Trying more candidates..."
"Failed to find candidates. Result has not converged."
"No parent can be split, sample too small"
Most convergence problems can be avoided by using the cross-validation
stopping criterion (as by default). When using the --stop aicsd
option, i.e. using a combination of AIC and entropy change stopping
criteria, convergence problems may appear, especially with
undersampled distributions. The Akaike Information Criterion still
prevents convergence problems in most cases because it stops the
algorithm before the number of components could grow too high.
However, for small samples and/or small number of variables, the AIC
may not be able to detect the optimum model, and the entropy change
criterion will be dominant. In this case, if the SDELTA value
specified with the --sdelta
option is too small, the algorithm may
fail to converge. For undersampled distributions, the program may keep
adding more and more components, with the entropy decreasing nearly
linearly until the calculation finally reports that no more components
can be added. This typically indicates insufficient sampling; the data
points probably occupy a random intricate shape in the
high-dimensional space. To solve the problem, increase SDELTA.
Alternatively, increase sample size or reduce the number of
dimensions. In some cases, the ensemble may actually have an overly
complex shape which cannot be approximated well by a Gaussian mixture
with a reasonable number of components.
gmentro.py datafile.dat
datafile.dat
,
using all default parameters.gmentro.py --center datafile.dat
datafile.dat
before
calculating the entropy. The data in the file must be angles measured
in degrees. After the centering, the entropy will be calculated as
usual.gmentro.py --centeronly datafile.dat
datafile.centered.dat
then exit. No entropy will be calculated. The
centered data can then be used in further runs of gmentro.py
.gmentro.py --maxk 1 datafile.dat
datafile.dat
gmentro.py --order 2 datafile.dat
datafile.dat
. All 1-D entropies and all 2-D entropies and mutual
information values, as well as their sums will be reported.gmentro.py --cols 1-4,8-10 --every 3 datafile.dat
gmentro.py --aicsd --sdelta 0.1 --emt 1-e6 datafile.dat
datafile.dat
using a combination of AIC
and entropy change stopping criterion. Calculation will stop when either the AIC
criterion is met or the entropy change upon adding a new component is less than 0.1 J/K/mol.
Execution time may be longer and overfitting may occur.gmentro.py --ncand 50 datafile.dat
It is advisable to run the entropy calculation several times to see
how much the results vary. To start a batch of 6 runs of gmentro.py
on the same data, the following Bourne shell script could be used:
for i in 1 2 3 4 5 6 ; do
gmentro.py -J $i datafile.dat &
done
The output files will be named datafile.gme
.X.yyy where X goes
from 1 to 6 and yyy is one of out, log, and npz.
The -J
J option is also useful when gmentro.py
is run several times
on the same data with different parameters. By adding a job number or
job id string, output files from different runs can have different
names, thereby avoiding the overwriting of the output files from the
previous run.
Entropy calculation on angle data using Gaussian mixtures requires
that the angle distributions be centered. The distribution centering
can be performed using the --center
command line option, which
assumes that the angles in the data file are provided in degrees.
Centering can be omitted if the angle data in the data file are
already pre-centered. Pre-centering can be performed using the
--centeronly
option. Note that entropies calculated on non-centered
angle data will be meaningless.
Written by Andras Szilagyi (szilagyi.andras at ttk.mta.hu). Much of the code was adapted from the Matlab code downloaded from http://lear.inrialpes.fr/people/verbeek/software.php. The reference for the greedy EM method for Gaussian mixture fitting is: Verbeek JJ, Vlassis N, Krose B.: Efficient greedy learning of gaussian mixture models. Neural Comput. 2003 Feb;15(2):469-85. Please contact the author with bug reports, comments, etc.
Please cite gmentro.py
as follows:
Gyimesi G, Zavodszky P, Szilagyi A:\ Calculation of configurational entropy differences from conformational ensembles using Gaussian mixtures.\ J. Chem. Theory Comput., 13(1):29-41. (2017) DOI: 10.1021/acs.jctc.6b00837\ PMID: 27958758
The program can be downloaded from http://gmentropy.szialab.org.