presage
0.9.1
|
#include <tokenizer.h>
Classes | |
class | StreamGuard |
Public Member Functions | |
Tokenizer (std::istream &stream, const std::string blankspaces, const std::string separators) | |
virtual | ~Tokenizer () |
virtual int | countTokens ()=0 |
virtual bool | hasMoreTokens () const =0 |
virtual std::string | nextToken ()=0 |
virtual double | progress () const =0 |
void | blankspaceChars (const std::string) |
std::string | blankspaceChars () const |
void | separatorChars (const std::string) |
std::string | separatorChars () const |
void | lowercaseMode (const bool) |
bool | lowercaseMode () const |
std::string | streamToString () const |
Protected Member Functions | |
bool | isBlankspace (const int character) const |
bool | isSeparator (const int character) const |
Protected Attributes | |
std::istream & | stream |
std::ios::iostate | sstate |
std::streamoff | offbeg |
std::streamoff | offend |
std::streamoff | offset |
Private Attributes | |
std::string | blankspaces |
std::string | separators |
bool | lowercase |
The Tokenizer class takes an input stream and parses it into "tokens", allowing the tokens to be read one at a time.
The parsing process is controlled by the character classification sets:
Each byte read from the input stream is regarded as a character in the range '\u0000' through '\u00FF'.
In addition, an instance has flags that control:
A typical application first constructs an instance of this class, supplying the input stream to be tokenized, the set of blankspaces, and the set of separators, and then repeatedly loops, while method hasMoreTokens returns true, calling the nextToken method.
Definition at line 64 of file tokenizer.h.
Tokenizer::Tokenizer | ( | std::istream & | stream, |
const std::string | blankspaces, | ||
const std::string | separators | ||
) |
Definition at line 27 of file tokenizer.cpp.
References blankspaceChars(), blankspaces, offbeg, offend, offset, separatorChars(), separators, sstate, and stream.
|
virtual |
Definition at line 53 of file tokenizer.cpp.
std::string Tokenizer::blankspaceChars | ( | ) | const |
Gets blankspace characters.
Definition at line 66 of file tokenizer.cpp.
References blankspaces.
Referenced by Tokenizer().
void Tokenizer::blankspaceChars | ( | const std::string | chars | ) |
|
pure virtual |
Returns the number of tokens left.
Implemented in ForwardTokenizer, and ReverseTokenizer.
|
pure virtual |
Tests if there are more tokens.
Implemented in ForwardTokenizer, and ReverseTokenizer.
|
protected |
Definition at line 91 of file tokenizer.cpp.
References blankspaces.
Referenced by ForwardTokenizer::nextToken(), and ReverseTokenizer::nextToken().
|
protected |
Definition at line 101 of file tokenizer.cpp.
References separators.
Referenced by ForwardTokenizer::nextToken(), and ReverseTokenizer::nextToken().
bool Tokenizer::lowercaseMode | ( | ) | const |
Gets lowercase mode.
Definition at line 86 of file tokenizer.cpp.
References lowercase.
Referenced by ForwardTokenizer::nextToken(), and ReverseTokenizer::nextToken().
void Tokenizer::lowercaseMode | ( | const bool | value | ) |
Sets lowercase mode.
Definition at line 81 of file tokenizer.cpp.
References lowercase.
Referenced by ContextChangeDetector::change(), ContextTracker::getToken(), ContextTracker::learn(), and main().
|
pure virtual |
Returns the next token.
Implemented in ForwardTokenizer, and ReverseTokenizer.
|
pure virtual |
Returns progress percentage.
Implemented in ForwardTokenizer, and ReverseTokenizer.
std::string Tokenizer::separatorChars | ( | ) | const |
Gets separator characters.
Definition at line 76 of file tokenizer.cpp.
References separators.
Referenced by Tokenizer().
void Tokenizer::separatorChars | ( | const std::string | chars | ) |
|
inline |
Definition at line 129 of file tokenizer.h.
|
private |
Definition at line 174 of file tokenizer.h.
Referenced by blankspaceChars(), isBlankspace(), and Tokenizer().
|
private |
Definition at line 177 of file tokenizer.h.
Referenced by lowercaseMode().
|
protected |
Definition at line 166 of file tokenizer.h.
Referenced by ForwardTokenizer::countTokens(), ForwardTokenizer::ForwardTokenizer(), ReverseTokenizer::hasMoreTokens(), ReverseTokenizer::nextToken(), ReverseTokenizer::progress(), and Tokenizer().
|
protected |
Definition at line 167 of file tokenizer.h.
Referenced by ReverseTokenizer::countTokens(), ForwardTokenizer::hasMoreTokens(), ForwardTokenizer::nextToken(), ReverseTokenizer::nextToken(), ReverseTokenizer::progress(), ForwardTokenizer::progress(), ReverseTokenizer::ReverseTokenizer(), and Tokenizer().
|
protected |
Definition at line 168 of file tokenizer.h.
Referenced by ForwardTokenizer::countTokens(), ReverseTokenizer::countTokens(), ForwardTokenizer::ForwardTokenizer(), ReverseTokenizer::hasMoreTokens(), ForwardTokenizer::hasMoreTokens(), ReverseTokenizer::nextToken(), ForwardTokenizer::nextToken(), ForwardTokenizer::progress(), ReverseTokenizer::progress(), ReverseTokenizer::ReverseTokenizer(), and Tokenizer().
|
private |
Definition at line 175 of file tokenizer.h.
Referenced by isSeparator(), separatorChars(), and Tokenizer().
|
protected |
Definition at line 165 of file tokenizer.h.
Referenced by Tokenizer(), and ~Tokenizer().
|
protected |
Definition at line 164 of file tokenizer.h.
Referenced by ForwardTokenizer::countTokens(), ReverseTokenizer::countTokens(), ForwardTokenizer::nextToken(), ReverseTokenizer::nextToken(), ReverseTokenizer::ReverseTokenizer(), Tokenizer(), and ~Tokenizer().