CLMPROTOCOLS2(5) - Linux man page online | File formats

16 May 2014
clmprotocols(5) FILE FORMATS clmprotocols(5) 1. NAME 2. DESCRIPTION 3. TOC 4. FAQ 5. Network representation 6. Loading large networks 7. Converting between formats 8. Clustering similarity graphs encoded in BLAST results 9. Clustering expression data 10. Reducing node degrees in the graph 11. SEE ALSO 12. AUTHOR NAME clmprotocols - Work flows and protocols for mcl and friends DESCRIPTION A guide to doing analysis with mcl and its helper programs. TOC 1...... General pardon 1.1... For whom is mcl and for whom is this FAQ? 1.2... For whom is mcl and for whom is this FAQ? 2...... General questions 2.1... For whom is mcl and for whom is this FAQ? 2.2... For whom is mcl and for whom is this FAQ? FAQ General pardon 1.1 For whom is mcl and for whom is this FAQ? For everybody with an appetite for graph clustering. 1.2 For whom is mcl and for whom is this FAQ? For everybody with an appetite for graph clustering. General questions 2.1 For whom is mcl and for whom is this FAQ? For everybody with an appetite for graph clustering. 2.2 For whom is mcl and for whom is this FAQ? For everybody with an appetite for graph clustering. Network representation The clustering program mcl expects the name of file as its first argument. If the --abc option is used, the file is assumed to adhere to a simple format where a network is specified edge by edge, one line and one edge at a time. Each line describes an edge as two labels and a numerical value, all separated by white space. The labels and the value respectively identify the two nodes and the edge weight. The format is called ABC-format, where 'A' and 'B' represent the two labels and 'C' represents the edge weight. The latter is optional; if omitted the edge weight is set to one. If ABC-for‐ mat is used, the output is returned as a listing of clusters, each cluster given as a line of white-space separated labels. MCL can also utilize a second representation, which is a stringent and unambiguous for‐ mat for both input and output. This is called matrix format and it is required when using other programs in the mcl suite, for example when comparing and analysing clus‐ terings using clm(1) or when extracting and transforming networks using mcx(1). Native mode (matrix format) is entered simply by not specifying --abc. The recommended approach using mcl is to convert an external format to ABC-format. The program mcxload(1) reads the latter and creates a native network file and a dictionary file that maps network nodes to labels. All applications in the MCL suite, including mcl itself, can read this native network file format. Label output can be obtained using mcxdump(1). The workflow is thus: # External format has been converted to file (abc format) mcxload -abc --stream-mirror -write-tab -o data.mci mcl data.mci -I 1.4 mcl data.mci -I 2 mcl data.mci -I 4 mcxdump -icl -tabr -o mcxdump -icl -tabr -o mcxdump -icl -tabr -o In this example the cluster output is stored in native format and dumped to labels using mcxdump. The stored output can now be used to learn more about the clusterings. An example is the following, where clm(1) is applied in mode dist to gauge the distance between different clusterings. clm dist --chain{14,20,40} Loading large networks If you deal with very large networks (say with hundreds of millions of edges), it is recommended to use binary format (cf mcxio(5)). This is simply achieved by adding --write-binary to the mcxload command line. The resulting file is no longer human-read‐ able but will be faster to read by a factor between ten- or twenty-fold compared to standard MCL-edge network format, and a factor around fifty-fold compared to label for‐ mat. All MCL-edge programs are able to read binary format, and speed of reading will be somewhere in the order of millions of edges per second, compared to, for example, roughly 100K edges per second for label format. Memory usage for mcxload can be lowered by replacing the option --stream-mirror with -ri max. Converting between formats Converting label format to tabular format Label format, two or three (including weight) columns: mcxload -abc --stream-mirror -write-tab -o data.mci mcxdump -imx data.mci -tab --dump-table Simple Interaction File (SIF) format: mcxload -sif data.sif --stream-mirror -write-tab -o data.mci mcxdump -imx data.mci -tab --dump-table It can be noted that these two examples are very similar, and differ only in the way the input to mcxload is specified. Clustering similarity graphs encoded in BLAST results A specific instance of the workflow above is the clustering of proteins based on their sequence similarities. In the most typical scenario the external format is BLAST out‐ put, which needs to be transformed to ABC format. In the examples below the input is in columnar blast format obtained with the blast -m8 option. It requires a version of mcl at least as recent as 09-061. First we create an ABC-formatted file using the external columnar BLAST format, which is assumed to be in a file called seq.cblast. cut -f 1,2,11 seq.cblast > The columnar format in the file seq.cblast has, for a given BLAST hit, the sequence labels in the first two columns and the asssociated E-value in column 11. It is parsed by the standard UNIX cut(1) utility. The format must have been created with the BLAST -m8 option so that no comment lines are present. Alternatively these can be filtered out using grep. The newly created file is loaded by mcxload(1), which writes both a network file seq.mci and a dictionary file mcxload -abc --stream-mirror --stream-neg-log10 -stream-tf 'ceil(200)' -o seq.mci -write-tab The --stream-mirror option ensures that the resulting network will be undirected, as recommended when using mcl. Omitting this option would result in a directed network as BLAST E-values generally differ between two sequences. The default course of action for mcxload(1) is to use the best value found between a pair of labels. The next option, --abc-neg-log10 transforms the numerical values in the input (the BLAST E-values) by taking the logarithm in base 10 and subsequently negating the sign. Finally, the trans‐ formed values are capped so that any E-value below 1e-200 is set to a maximum allowed edge weight of 200. To obtain clusterings from seq.mci and one has two choices. The first is to generate an abstract clustering representation and from that obtain the label output, as follows. Below the -o option is not used, so mcl will create meaningful and unique output names by itself. The default way of doing this is to preprend the prefix out. and to append a suffix encoding the inflation value used, with inflation encoded using two digits of precision and the decimal separator removed. mcl seq.mci -I 1.4 mcl seq.mci -I 2 mcl seq.mci -I 4 mcl seq.mci -I 6 mcxdump -icl out.seq.mci.I14 -tabr -o dump.seq.mci.I14 mcxdump -icl out.seq.mci.I20 -tabr -o dump.seq.mci.I20 mcxdump -icl out.seq.mci.I40 -tabr -o dump.seq.mci.I40 mcxdump -icl out.seq.mci.I60 -tabr -o dump.seq.mci.I60 Now the file and its associates can be used for example to compute the distances between the encoded clusterings with clm dist, to compute a set of strictly reconciled nested clusterings with clm order, or to compute an efficiency criterion with clm info. Alternatively, label output can be obtained directly from mcl as follows. mcl seq.mci -I 1.4 -use-tab mcl seq.mci -I 2 -use-tab mcl seq.mci -I 4 -use-tab mcl seq.mci -I 6 -use-tab Clustering expression data The clustering of expression data constitutes another workflow. In this case the exter‐ nal format usually is a tabular file format containing labels for genes or probes and numerical values measuring the expression values or fold changes across a series of conditions or experiments. Such tabular files can be processed by mcxarray(1), which comes installed with mcl. The program computes correlations (either Pearson or Spear‐ mann) between genes, and creates an edge between genes if their correlation exceeds the specified cutoff. From this mcxarray(1) creates both a network file and a dictionary file. In the example below, the file is in tabular format with one row of column headers (e.g. tags for experiments) and one column of row identifiers (e.g. probe or gene identifiers). mcxarray -data -skipr 1 -skipc 1 -o expr.mci -write-tab --pearson -c ↲ o 0.7 -tf 'abs(),add(-0.7)' This uses the Pearson correlation, ignoring values below 0.7. The remaining values in the interval [0.7-1] are remapped to the interval [0-0.3]. This is recommended so that the edge weights will have increased contrast between them, as mcl is affected by rela‐ tive differences (ratios) between edge weights rather than absolute differences. To illustrate this, values 0.75 and 0.95 are mapped to 0.05 and 0.25, with respective ratios 0.79 and 0.25. The network file expr.mci and the dictionary file can now be used as before. It is possible to investigate the effect of the correlation cutoff as follows. First a network is generated at a very low threshold, and this network is analysed using mcx‐ query. mcxarray -data -skipr 1 -skipc 1 -o expr20.mcx --write-binary --pearson -co 0 ↲ .2 -tf 'abs()' mcx query -imx expr20.mcx --vary-correlation The output is in a tabular format describing the properties of the network at increas‐ ing correlation thresholds. Examples are the size of the biggest component, the number of orphan nodes (not connected to any other node), and the mean and median node degrees. A good way to choose the cutoff is to balance the number of singletons and the median node degree. Both should preferably not be too high. For example the number of orphan nodes should be less than ten percent of the total number of nodes, and the median node degree should be at most one hundred neighbours. Reducing node degrees in the graph A good way to lower node degrees in a network is to require that an edge is among the best k edges (those of highest weight) for both nodes incident to the edge, for some value of k. This is achieved by using knn(k) in the argument to the -tf option to mcl or mcx alter. To give an example, a graph was formed on translations in Ensembl release 57 on 2.6M nodes. The similarities were obtained from BLAST scores, leading to a graph with a total edge count of 300M, with best-connected nodes of degree respec‐ tively 11148, 9083, 9070, 9019 and 8988, and with mean node degree 233. These degrees are unreasonable. The graph was subjected to mcx query to investigate the effect of varying k-NN parameters. A good heuristic is to choose a value that does not signifi‐ cantly change the number of singletons in the input graph. In the example it meant that -tf 'knn(160)' was feasible, leading to a mean node degree of 98. A second approach to reduce node degrees is to employ the -ceil-nb option. This ranks nodes by node degree, highest first. Nodes are considered in order of rank, and edges of low weight are removed from the graph until a node satisfies the node degree thresh‐ old specified by -ceil-nb. SEE ALSO mcxio(5). AUTHOR Stijn van Dongen. clmprotocols 14-137 16 May 2014 clmprotocols(5)
This manual Reference Other manuals
clmprotocols2(5) referred by
refer to clm(1) | clmprotocols(5) | cut(1) | mcx(1) | mcxarray(1) | mcxdump(1) | mcxio(5) | mcxload(1)
Download raw manual
Main page File Formats (+47) clmprotocols 14-137 (+2) № 5 (+2141)
Go top