SIMPLE SOLUTIONS

AFNIX-TXT(3) - Linux man page online | Library functions

Standard text processing module.

Chapter
2017-11-22
txt(3) AFNIX Module txt(3)

NAME

txt - standard text processing module

STANDARD TEXT PROCESSING MODULE

The Standard Text Processingmodule is an original implementation of an object collection dedicated to text processing. Although text scaning is the current operation perfomed in the field of text processing, the module provides also specialized object to store and index text data. Text sorting and transliteration is also part of this module. Scanning concepts Text scanning is the ability to extract lexical elements or lexemesfrom a stream. A scan‐ ner or lexical analyzer is the principal object used to perform this task. A scanner is created by adding special object that acts as a pattern matcher. When a pattern is matched, a special object called a lexemeis returned. Pattern object A Patternobject is a special object that acts as model for the string to match. There are several ways to build a pattern. The simplest way to build it is with a regular expres‐ sion. Another type of pattern is a balanced pattern. In its first form, a pattern object can be created with a regular expression object. # create a pattern object const pat (afnix:txt:Pattern "$d+") In this example, the pattern object is built to detect integer objects. pat:check "123" # true pat:match "123" # 123 The checkmethod return true if the input string matches the pattern. The matchmethod returns the string that matches the pattern. Since the pattern object can also operates with stream object, the matchmethod is appropriate to match a particular string. The pat‐ tern object is, as usual, available with the appropriate predicate. afnix:txt:pattern-p pat # true Another form of pattern object is the balanced pattern. A balanced pattern is determined by a starting string and an ending string. There are two types of balanced pattern. One is a single balanced pattern and the other one is the recursive balanced pattern. The single balanced pattern is appropriate for those lexical element that are defined by a character. For example, the classical C-string is a single balanced pattern with the double quote character. # create a balanced pattern const pat (afnix:txt:Pattern "ELEMENT" "<" ">") pat:check "<xml>" # true pat:match "<xml>" # xml In the case of the C-string, the pattern might be more appropriately defined with an addi‐ tional escape character. Such character is used by the pattern matcher to grab characters that might be part of the pattern definition. # create a balanced pattern const pat (afnix:txt:Pattern "STRING" "'" '\') pat:check "'hello'" # true pat:match "'hello'" # "hello" In this form, a balanced pattern with an escape character is created. The same string is used for both the starting and ending string. Another constructor that takes two strings can be used if the starting and ending strings are different. The last pattern form is the balanced recursive form. In this form, a starting and ending string are used to delimit the pattern. However, in this mode, a recursive use of the starting and ending strings is allowed. In order to have an exact match, the number of starting string must equal the number of ending string. For example, the C-comment pattern can be viewed as recursive balanced pattern. # create a c-comment pattern const pat (afnix:txt:Pattern "STRING" "/*" "*/" ) Lexeme object The Lexemeobject is the object built by a scanner that contains the matched string. A lex‐ eme is therefore a tagged string. Additionally, a lexeme can carry additional information like a source name and index. # create an empty lexeme const lexm (afnix:txt:Lexeme) afnix:txt:lexeme-p lexm # true The default lexeme is created with any value. A value can be set with the set-valuemethod and retrieved with the get-valuemethods. lexm:set-value "hello" lexm:get-value # hello Similar are the set-tagand get-tagmethods which operate with an integer. The source name and index are defined as well with the same methods. # check for the source lexm:set-source "world" lexm:get-source # world # check for the source index lexm:set-index 2000 lexm:get-index # 2000 Text scanning Text scanning is the ability to extract lexical elements or lexemes from an input stream. Generally, the lexemes are the results of a matching operation which is defined by a pat‐ tern object. As a result, the definition of a scanner object is the object itself plus one or several pattern object. Scanner construction By default, a scanner is created without pattern objects. The lengthmethod returns the number of pattern objects. As usual, a predicate is associated with the scanner object. # the default scanner const scan (afnix:txt:Scanner) afnix:txt:scanner-p scan # true # the length method scan:length # 0 The scanner construction proceeds by adding pattern objects. Each pattern can be created independently, and later added to the scanner. For example, a scanner that reads real, integer and string can be defined as follow: # create the scanner pattern const REAL ( afnix:txt:Pattern "REAL" [$d+.$d*]) const STRING ( afnix:txt:Pattern "STRING" """ '\') const INTEGER ( afnix:txt:Pattern "INTEGER" [$d+|"0x"$x+]) # add the pattern to the scanner scanner:add INTEGER REAL STRING The order of pattern integration defines the priority at which a token is recognized. The symbol name for each pattern is optional since the functional programming permits the cre‐ ation of patterns directly. This writing style makes the scanner definition easier to read. Using the scanner Once constructed, the scanner can be used as is. A stream is generally the best way to operate. If the scanner reaches the end-of-stream or cannot recognize a lexeme, the nil object is returned. With a loop, it is easy to get all lexemes. while (trans valid (is:valid-p)) { # try to get the lexeme trans lexm (scanner:scan is) # check for nil lexeme and print the value if (not (nil-p lexm)) (println (lexm:get-value)) # update the valid flag valid:= (and (is:valid-p) (not (nil-p lexm))) } In this loop, it is necessary first to check for the end of the stream. This is done with the help of the special loop construct that initialize the validsymbol. As soon as the the lexeme is built, it can be used. The lexeme holds the value as well as it tag. Text sorting Sorting is one the primary function implemented inside the text processingmodule. There are three sorting functions available in the module. Ascending and descending order sorting The sort-ascentfunction operates with a vector object and sorts the elements in ascending order. Any kind of objects can be sorted as long as they support a comparison method. The elements are sorted in placed by using a quick sortalgorithm. # create an unsorted vector const v-i (Vector 7 5 3 4 1 8 0 9 2 6) # sort the vector in place afnix:txt:sort-ascent v-i # print the vector for (e) (v) (println e) The sort-descentfunction is similar to the sort-ascentfunction except that the object are sorted in descending order. Lexical sorting The sort-lexicalfunction operates with a vector object and sorts the elements in ascending order using a lexicographic ordering relation. Objects in the vector must be literal objects or an exception is raised. Transliteration Transliteration is the process of changing characters my mapping one to another one. The transliteration process operates with a character source and produces a target character with the help of a mapping table. The transliteration process is not necessarily reversible as often indicated in the literature. Literate object The Literateobject is a transliteration object that is bound by default with the identity function mapping. As usual, a predicate is associate with the object. # create a transliterate object const tl (afnix:txt:Literate) # check the object afnix:txt:literate-p tl # true The transliteration process can also operate with an escape character in order to map dou‐ ble character sequence into a single one, as usually found inside programming language. # create a transliterate object by escape const tl (afnix:txt:Literate '\') Transliteration configuration The set-mapconfigures the transliteration mapping table while the set-escape-mapconfigure the escape mapping table. The mapping is done by setting the source character and the tar‐ get character. For instance, if one want to map the tabulation character to a white space, the mapping table is set as follow: tl:set-map '' ' ' The escape mapping table operates the same way. It should be noted that the mapping algo‐ rithm translate first the input character, eventually yielding to an escape character and then the escape mapping takes place. Note also that the set-escapemethod can be used to set the escape character. tl:set-map '' ' ' Transliteration process The transliteration process is done either with a string or an input stream. In the first case, the translatemethod operates with a string and returns a translated string. On the other hand, the readmethod returns a character when operating with a stream. # set the mapping characters tl:set-map '0 tl:set-map ''' ' tl:set-map ' tl:set-map '' # translate a string tl:translate "helo" # word

STANDARD TEXT PROCESSING REFERENCE

Pattern The Patternclass is a pattern matching class based either on regular expression or bal‐ anced string. In the regex mode, the pattern is defined with a regex and a matching is said to occur when a regex match is achieved. In the balanced string mode, the pattern is defined with a start pattern and end pattern strings. The balanced mode can be a single or recursive. Additionally, an escape character can be associated with the class. A name and a tag is also bound to the pattern object as a mean to ease the integration within a scan‐ ner. Predicate pattern-p Inheritance Object Constructors Pattern (none) The Patternconstructor creates an empty pattern. Pattern (String|Regex) The Patternconstructor creates a pattern object associated with a regular expres‐ sion. The argument can be either a string or a regular expression object. If the argument is a string, it is converted into a regular expression object. Pattern (String String) The Patternconstructor creates a balanced pattern. The first argument is the start pattern string. The second argument is the end balanced string. Pattern (String String Character) The Patternconstructor creates a balanced pattern with an escape character. The first argument is the start pattern string. The second argument is the end balanced string. The third character is the escape character. Pattern (String String Boolean) The Patternconstructor creates a recursive balanced pattern. The first argument is the start pattern string. The second argument is the end balanced string. Constants REGEX The REGEXconstant indicates that the pattern is a regular expression. BALANCED The BALANCEDconstant indicates that the pattern is a balanced pattern. RECURSIVE The RECURSIVEconstant indicates that the pattern is a recursive balanced pattern. Methods check -> Boolean (String) The checkmethod checks the pattern against the input string. If the verification is successful, the method returns true, false otherwise. match -> String (String|InputStream) The matchmethod attempts to match an input string or an input stream. If the match‐ ing occurs, the matching string is returned. If the input is a string, the end of string is used as an end condition. If the input stream is used, the end of stream is used as an end condition. set-tag -> none (Integer) The set-tagmethod sets the pattern tag. The tag can be further used inside a scan‐ ner. get-tag -> Integer (none) The get-tagmethod returns the pattern tag. set-name -> none (String) The set-namemethod sets the pattern name. The name is symbol identifier for that pattern. get-name -> String (none) The get-namemethod returns the pattern name. set-regex -> none (String|Regex) The set-regexmethod sets the pattern regex either with a string or with a regex object. If the method is successfully completed, the pattern type is switched to the REGEX type. set-escape -> none (Character) The set-escapemethod sets the pattern escape character. The escape character is used only in balanced mode. get-escape -> Character (none) The get-escapemethod returns the escape character. set-balanced -> none (String| String String) The set-balancedmethod sets the pattern balanced string. With one argument, the same balanced string is used for starting and ending. With two arguments, the first argument is the starting string and the second is the ending string. Lexeme The Lexemeclass is a literal object that is designed to hold a matching pattern. A lexeme consists in string (i.e. the lexeme value), a tag and eventually a source name (i.e. file name) and a source index (line number). Predicate lexeme-p Inheritance Literal Constructors Lexeme (none) The Lexemeconstructor creates an empty lexeme. Lexeme (String) The Lexemeconstructor creates a lexeme by value. The string argument is the lexeme value. Methods set-tag -> none (Integer) The set-tagmethod sets the lexeme tag. The tag can be further used inside a scan‐ ner. get-tag -> Integer (none) The get-tagmethod returns the lexeme tag. set-value -> none (String) The set-valuemethod sets the lexeme value. The lexeme value is generally the result of a matching operation. get-value -> String (none) The get-valuemethod returns the lexeme value. set-index -> none (Integer) The set-indexmethod sets the lexeme source index. The lexeme source index can be for instance the source line number. get-index -> Integer (none) The get-indexmethod returns the lexeme source index. set-source -> none (String) The set-sourcemethod sets the lexeme source name. The lexeme source name can be for instance the source file name. get-source -> String (none) The get-sourcemethod returns the lexeme source name. Scanner The Scannerclass is a text scanner or lexical analyzerthat operates on an input stream and permits to match one or several patterns. The scanner is built by adding patterns to the scanner object. With an input stream, the scanner object attempts to build a buffer that match at least one pattern. When such matching occurs, a lexeme is built. When building a lexeme, the pattern tag is used to mark the lexeme. Predicate scanner-p Inheritance Object Constructors Scanner (none) The Scannerconstructor creates an empty scanner. Methods add -> none (Pattern*) The addmethod adds 0 or more pattern objects to the scanner. The priority of the pattern is determined by the order in which the patterns are added. length -> Integer (none) The lengthmethod returns the number of pattern objects in this scanner. get -> Pattern (Integer) The getmethod returns a pattern object by index. check -> Lexeme (String) The checkmethod checks that a string is matched by the scanner and returns the associated lexeme. scan -> Lexeme (InputStream) The scanmethod scans an input stream until a pattern is matched. When a matching occurs, the associated lexeme is returned. Literate The Literateclass is transliteration mapping class. Transliteration is the process of changing characters my mapping one to another one. The transliteration process operates with a character source and produces a target character with the help of a mapping table. This transliteration object can also operate with an escape table. In the presence of an escape character, an escape mapping table is used instead of the regular one. Predicate literate-p Inheritance Object Constructors Literate (none) The Literateconstructor creates a default transliteration object. Literate (Character) The Literateconstructor creates a default transliteration object with an escape character. The argument is the escape character. Methods read -> Character (InputStream) The readmethod reads a character from the input stream and translate it with the help of the mapping table. A second character might be consumed from the stream if the first character is an escape character. getu -> Character (InputStream) The getumethod reads a Unicode character from the input stream and translate it with the help of the mapping table. A second character might be consumed from the stream if the first character is an escape character. reset -> none (none) The resetmethod resets all the mapping table and install a default identity one. set-map -> none (Character Character) The set-mapmethod set the mapping table by using a source and target character. The first character is the source character. The second character is the target charac‐ ter. get-map -> Character (Character) The get-mapmethod returns the mapping character by character. The source character is the argument. translate -> String (String) The translatemethod translate a string by transliteration and returns a new string. set-escape -> none (Character) The set-escapemethod set the escape character. get-escape -> Character (none) The get-escapemethod returns the escape character. set-escape-map -> none (Character Character) The set-escape-mapmethod set the escape mapping table by using a source and target character. The first character is the source character. The second character is the target character. get-escape-map -> Character (Character) The get-escape-mapmethod returns the escape mapping character by character. The source character is the argument. Functions sort-ascent -> none (Vector) The sort-ascentfunction sorts in ascending order the vector argument. The vector is sorted in place. sort-descent -> none (Vector) The sort-descentfunction sorts in descending order the vector argument. The vector is sorted in place. sort-lexical -> none (Vector) The sort-lexicalfunction sorts in lexicographic order the vector argument. The vec‐ tor is sorted in place.
AFNIX 2017-11-22 txt(3)
Download raw manual
Main page AFNIX Module (+11) AFNIX (+23) № 3 (+68044)
Go top