SIMPLE SOLUTIONS

CLMPROTOCOLS2(5) - man page online | file formats

16 May 2014
clmprotocols(5)                           FILE FORMATS                            clmprotocols(5)

   1. NAME
   2. DESCRIPTION
   3. TOC
   4. FAQ
   5. Network representation
   6. Loading large networks
   7. Converting between formats
   8. Clustering similarity graphs encoded in BLAST results
   9. Clustering expression data
  10. Reducing node degrees in the graph
  11. SEE ALSO
  12. AUTHOR

  NAME
          clmprotocols - Work flows and protocols for mcl and friends

  DESCRIPTION
          A guide to doing analysis with mcl and its helper programs.

  TOC
  1...... General pardon
   1.1... For whom is mcl and for whom is this FAQ?
   1.2... For whom is mcl and for whom is this FAQ?

  2...... General questions
   2.1... For whom is mcl and for whom is this FAQ?
   2.2... For whom is mcl and for whom is this FAQ?

  FAQ
                                              General pardon

   1.1    For whom is mcl and for whom is this FAQ?

          For everybody with an appetite for graph clustering.

   1.2    For whom is mcl and for whom is this FAQ?

          For everybody with an appetite for graph clustering.

                                             General questions

   2.1    For whom is mcl and for whom is this FAQ?

          For everybody with an appetite for graph clustering.

   2.2    For whom is mcl and for whom is this FAQ?

          For everybody with an appetite for graph clustering.

  Network representation
          The clustering program mcl expects the name of file as its first argument.  If the
          --abc option is used, the file is assumed to adhere to a simple format where a network
          is specified edge by edge, one line and one edge at a time.  Each line describes an
          edge as two labels and a numerical value, all separated by white space. The labels and
          the value respectively identify the two nodes and the edge weight. The format is called
          ABC-format, where 'A' and 'B' represent the two labels and 'C' represents the edge
          weight. The latter is optional; if omitted the edge weight is set to one.  If ABC-for‐
          mat is used, the output is returned as a listing of clusters, each cluster given as a
          line of white-space separated labels.

          MCL can also utilize a second representation, which is a stringent and unambiguous for‐
          mat for both input and output.  This is called matrix format and it is required when
          using other programs in the mcl suite, for example when comparing and analysing clus‐
          terings using clm(1) or when extracting and transforming networks using mcx(1).  Native
          mode (matrix format) is entered simply by not specifying --abc.

          The recommended approach using mcl is to convert an external format to ABC-format. The
          program mcxload(1) reads the latter and creates a native network file and a dictionary
          file that maps network nodes to labels. All applications in the MCL suite, including
          mcl itself, can read this native network file format. Label output can be obtained
          using mcxdump(1). The workflow is thus:

             #  External format has been converted to file data.abc (abc format)

             mcxload -abc data.abc --stream-mirror -write-tab data.tab -o data.mci

             mcl data.mci -I 1.4
             mcl data.mci -I 2
             mcl data.mci -I 4

             mcxdump -icl out.data.mci.I14 -tabr data.tab -o dump.data.mci.I14
             mcxdump -icl out.data.mci.I20 -tabr data.tab -o dump.data.mci.I20
             mcxdump -icl out.data.mci.I40 -tabr data.tab -o dump.data.mci.I40

          In this example the cluster output is stored in native format and dumped to labels
          using mcxdump. The stored output can now be used to learn more about the clusterings.
          An example is the following, where clm(1) is applied in mode dist to gauge the distance
          between different clusterings.

             clm dist --chain out.data.mci.I{14,20,40}

  Loading large networks
          If you deal with very large networks (say with hundreds of millions of edges), it is
          recommended to use binary format (cf mcxio(5)).  This is simply achieved by adding
          --write-binary to the mcxload command line. The resulting file is no longer human-read‐
          able but will be faster to read by a factor between ten- or twenty-fold compared to
          standard MCL-edge network format, and a factor around fifty-fold compared to label for‐
          mat.  All MCL-edge programs are able to read binary format, and speed of reading will
          be somewhere in the order of millions of edges per second, compared to, for example,
          roughly 100K edges per second for label format.

          Memory usage for mcxload can be lowered by replacing the option --stream-mirror with
          -ri max.

  Converting between formats
          Converting label format to tabular format
          Label format, two or three (including weight) columns:

             mcxload -abc data.abc --stream-mirror -write-tab data.tab -o data.mci
             mcxdump -imx data.mci -tab data.tab --dump-table

          Simple Interaction File (SIF) format:

             mcxload -sif data.sif --stream-mirror -write-tab data.tab -o data.mci
             mcxdump -imx data.mci -tab data.tab --dump-table

          It can be noted that these two examples are very similar, and differ only in the way
          the input to mcxload is specified.

  Clustering similarity graphs encoded in BLAST results
          A specific instance of the workflow above is the clustering of proteins based on their
          sequence similarities. In the most typical scenario the external format is BLAST out‐
          put, which needs to be transformed to ABC format.  In the examples below the input is
          in columnar blast format obtained with the blast -m8 option.  It requires a version of
          mcl at least as recent as 09-061.  First we create an ABC-formatted file using the
          external columnar BLAST format, which is assumed to be in a file called seq.cblast.

             cut -f 1,2,11 seq.cblast > seq.abc

          The columnar format in the file seq.cblast has, for a given BLAST hit, the sequence
          labels in the first two columns and the asssociated E-value in column 11. It is parsed
          by the standard UNIX cut(1) utility. The format must have been created with the BLAST
          -m8 option so that no comment lines are present. Alternatively these can be filtered
          out using grep.  The newly created seq.abc file is loaded by mcxload(1), which writes
          both a network file seq.mci and a dictionary file seq.tab.

             mcxload -abc seq.abc --stream-mirror --stream-neg-log10 -stream-tf 'ceil(200)'
                   -o seq.mci -write-tab seq.tab

          The --stream-mirror option ensures that the resulting network will be undirected, as
          recommended when using mcl. Omitting this option would result in a directed network as
          BLAST E-values generally differ between two sequences. The default course of action for
          mcxload(1) is to use the best value found between a pair of labels. The next option,
          --abc-neg-log10 tranforms the numerical values in the input (the BLAST E-values) by
          taking the logarithm in base 10 and subsequently negating the sign. Finally, the trans‐
          formed values are capped so that any E-value below 1e-200 is set to a maximum allowed
          edge weight of 200.

          To obtain clusterings from seq.mci and seq.tab one has two choices. The first is to
          generate an abstract clustering representation and from that obtain the label output,
          as follows.  Below the -o option is not used, so mcl will create meaningful and unique
          output names by itself. The default way of doing this is to preprend the prefix out.
          and to append a suffix encoding the inflation value used, with inflation encoded using
          two digits of precision and the decimal separator removed.

             mcl seq.mci -I 1.4
             mcl seq.mci -I 2
             mcl seq.mci -I 4
             mcl seq.mci -I 6

             mcxdump -icl out.seq.mci.I14 -tabr seq.tab -o dump.seq.mci.I14
             mcxdump -icl out.seq.mci.I20 -tabr seq.tab -o dump.seq.mci.I20
             mcxdump -icl out.seq.mci.I40 -tabr seq.tab -o dump.seq.mci.I40
             mcxdump -icl out.seq.mci.I60 -tabr seq.tab -o dump.seq.mci.I60

          Now the file out.seq.tab.I14 and its associates can be used for example to compute the
          distances between the encoded clusterings with clm dist, to compute a set of strictly
          reconciled nested clusterings with clm order, or to compute an efficiency criterion
          with clm info.

          Alternatively, label output can be obtained directly from mcl as follows.

             mcl seq.mci -I 1.4  -use-tab seq.tab
             mcl seq.mci -I 2  -use-tab seq.tab
             mcl seq.mci -I 4  -use-tab seq.tab
             mcl seq.mci -I 6  -use-tab seq.tab

  Clustering expression data
          The clustering of expression data constitutes another workflow. In this case the exter‐
          nal format usually is a tabular file format containing labels for genes or probes and
          numerical values measuring the expression values or fold changes across a series of
          conditions or experiments. Such tabular files can be processed by mcxarray(1), which
          comes installed with mcl. The program computes correlations (either Pearson or Spear‐
          mann) between genes, and creates an edge between genes if their correlation exceeds the
          specified cutoff. From this mcxarray(1) creates both a network file and a dictionary
          file. In the example below, the file expr.data is in tabular format with one row of
          column headers (e.g. tags for experiments) and one column of row identifiers (e.g.
          probe or gene identifiers).

             mcxarray -data expr.data -skipr 1 -skipc 1 -o expr.mci -write-tab expr.tab --pearson -c ↲
 o 0.7 -tf 'abs(),add(-0.7)'

          This uses the Pearson correlation, ignoring values below 0.7.  The remaining values in
          the interval [0.7-1] are remapped to the interval [0-0.3]. This is recommended so that
          the edge weights will have increased contrast between them, as mcl is affected by rela‐
          tive differences (ratios) between edge weights rather than absolute differences. To
          illustrate this, values 0.75 and 0.95 are mapped to 0.05 and 0.25, with respective
          ratios 0.79 and 0.25.  The network file expr.mci and the dictionary file expr.tab can
          now be used as before.

          It is possible to investigate the effect of the correlation cutoff as follows.  First a
          network is generated at a very low threshold, and this network is analysed using mcx‐
          query.

             mcxarray -data expr.data -skipr 1 -skipc 1 -o expr20.mcx --write-binary --pearson -co 0 ↲
 .2 -tf 'abs()'
             mcx query -imx expr20.mcx --vary-correlation

          The output is in a tabular format describing the properties of the network at increas‐
          ing correlation thresholds. Examples are the size of the biggest component, the number
          of orphan nodes (not connected to any other node), and the mean and median node
          degrees.  A good way to choose the cutoff is to balance the number of singletons and
          the median node degree. Both should preferably not be too high.  For example the number
          of orphan nodes should be less than ten percent of the total number of nodes, and the
          median node degree should be at most one hundred neighbours.

  Reducing node degrees in the graph
          A good way to lower node degrees in a network is to require that an edge is among the
          best k edges (those of highest weight) for both nodes incident to the edge, for some
          value of k. This is achieved by using knn(k) in the argument to the -tf option to mcl
          or mcx alter.  To give an example, a graph was formed on translations in Ensembl
          release 57 on 2.6M nodes.  The similarities were obtained from BLAST scores, leading to
          a graph with a total edge count of 300M, with best-connected nodes of degree respec‐
          tively 11148, 9083, 9070, 9019 and 8988, and with mean node degree 233.  These degrees
          are unreasonable.  The graph was subjected to mcx query to investigate the effect of
          varying k-NN parameters. A good heuristic is to choose a value that does not signifi‐
          cantly change the number of singletons in the input graph.  In the example it meant
          that -tf 'knn(160)' was feasible, leading to a mean node degree of 98.

          A second approach to reduce node degrees is to employ the -ceil-nb option.  This ranks
          nodes by node degree, highest first. Nodes are considered in order of rank, and edges
          of low weight are removed from the graph until a node satisfies the node degree thresh‐
          old specified by -ceil-nb.

  SEE ALSO
          mcxio(5).

  AUTHOR
          Stijn van Dongen.

  clmprotocols 14-137                        16 May 2014                            clmprotocols(5)
This manual Reference Other manuals
clmprotocols2(5) referred by
refer to clm(1) | clmprotocols(5) | cut(1) | mcx(1) | mcxarray(1) | mcxdump(1) | mcxio(5) | mcxload(1)