XML parsers
SAX base parser
-
template<typename HandlerT, typename ConfigT = sax_parser_default_config>
class sax_parser : public orcus::sax::parser_base SAX parser for XML documents.
This parser is barebone in that it only parses the document and picks up all encountered elements and attributes without checking proper element pairs. The user is responsible for checking whether or not the document is well-formed in terms of element scopes.
This parser additionally records the begin and end offset positions of each element.
- Template Parameters:
HandlerT – Handler type with member functions for event callbacks. Refer to sax_handler.
ConfigT – Parser configuration.
Public Functions
-
sax_parser(std::string_view content, handler_type &handler)
-
~sax_parser() = default
-
void parse()
-
struct sax_parser_default_config
Public Static Attributes
-
static uint8_t baseline_version = 10
An integer value representing a baseline XML version. A value of 10 corresponds with version 1.0 whereas a value of 11 corresponds with version 1.1.
-
static uint8_t baseline_version = 10
-
class sax_handler
Public Functions
-
inline void doctype(const orcus::sax::doctype_declaration &dtd)
Called when a doctype declaration <!DOCTYPE … > is encountered.
- Parameters:
dtd – struct containing doctype declaration data.
-
inline void start_declaration(std::string_view decl)
Called when <?… is encountered, where the ‘…’ may be an arbitraray dentifier. One common declaration is <?xml which is typically given at the start of an XML stream.
- Parameters:
decl – name of the identifier.
-
inline void end_declaration(std::string_view decl)
Called when the closing tag (>) of a <?… ?> is encountered.
- Parameters:
decl – name of the identifier.
-
inline void start_element(const orcus::sax::parser_element &elem)
Called at the start of each element.
- Parameters:
elem – information of the element being parsed.
-
inline void end_element(const orcus::sax::parser_element &elem)
Called at the end of each element.
- Parameters:
elem – information of the element being parsed.
-
inline void characters(std::string_view val, bool transient)
Called when a segment of a text content is parsed. Each text content is a direct child of an element, which may have multiple child contents when the element also has a child element that are direct sibling to the text contents or the text contents are splitted by a comment.
- Parameters:
val – value of the text content.
transient – when true, the text content has been converted and is stored in a temporary buffer due to presence of one or more encoded characters, in which case the passed text value needs to be either immediately converted to a non-text value or be interned within the scope of the callback.
-
inline void attribute(const orcus::sax::parser_attribute &attr)
Called upon parsing of an attribute of an element. Note that when the attribute’s transient flag is set, the attribute value is stored in a temporary buffer due to presence of one or more encoded characters, and must be processed within the scope of the callback.
- Parameters:
attr – struct containing attribute information.
-
inline void doctype(const orcus::sax::doctype_declaration &dtd)
-
struct parser_element
Element properties passed by sax_parser to its handler’s open_element() and close_element() calls.
-
struct parser_attribute
Attribute properties passed by sax_parser to its handler’s attribute() call. When an attribute value is “transient”, it has been converted due to presence of encoded character(s) and has been stored in a temporary buffer. The handler must assume that the value will not survive after the callback function ends.
SAX namespace parser
-
template<typename HandlerT>
class sax_ns_parser SAX based XML parser with extra namespace handling.
It uses an instance of xmlns_context passed by the caller to validate and convert namespace values into identifiers. The namespace identifier of each encountered element is always given even if one is not explicitly given.
This parser keeps track of element scopes and detects non-matching element pairs.
- Template Parameters:
HandlerT – Handler type with member functions for event callbacks. Refer to sax_ns_handler.
Public Functions
-
sax_ns_parser(std::string_view content, xmlns_context &ns_cxt, handler_type &handler)
-
~sax_ns_parser() = default
-
void parse()
Start parsing the document.
- Throws:
orcus::malformed_xml_error – when it encounters a non-matching closing element.
-
class sax_ns_handler
Public Functions
-
inline void doctype(const orcus::sax::doctype_declaration &dtd)
Called when a doctype declaration <!DOCTYPE … > is encountered.
- Parameters:
dtd – struct containing doctype declaration data.
-
inline void start_declaration(std::string_view decl)
Called when <?… is encountered, where the ‘…’ may be an arbitraray dentifier. One common declaration is <?xml which is typically given at the start of an XML stream.
- Parameters:
decl – name of the identifier.
-
inline void end_declaration(std::string_view decl)
Called when the closing tag (>) of a <?… ?> is encountered.
- Parameters:
decl – name of the identifier.
-
inline void start_element(const orcus::sax_ns_parser_element &elem)
Called at the start of each element.
- Parameters:
elem – information of the element being parsed.
-
inline void end_element(const orcus::sax_ns_parser_element &elem)
Called at the end of each element.
- Parameters:
elem – information of the element being parsed.
-
inline void characters(std::string_view val, bool transient)
Called when a segment of a text content is parsed. Each text content is a direct child of an element, which may have multiple child contents when the element also has a child element that are direct sibling to the text contents or the text contents are splitted by a comment.
- Parameters:
val – value of the text content.
transient – when true, the text content has been converted and is stored in a temporary buffer due to presence of one or more encoded characters, in which case the passed text value needs to be either immediately converted to a non-text value or be interned within the scope of the callback.
-
inline void attribute(std::string_view name, std::string_view val)
Called upon parsing of an attribute of a declaration. The value of an attribute is assumed to be transient thus should be consumed within the scope of this callback.
- Todo:
Perhaps we should pass the transient flag here as well like all the other places.
- Parameters:
name – name of an attribute.
val – value of an attribute.
-
inline void attribute(const orcus::sax_ns_parser_attribute &attr)
Called upon parsing of an attribute of an element. Note that when the attribute’s transient flag is set, the attribute value is stored in a temporary buffer due to a presence of encoded characters, and must be processed within the scope of the callback.
- Parameters:
attr – struct containing attribute information.
-
inline void doctype(const orcus::sax::doctype_declaration &dtd)
-
struct sax_ns_parser_element
-
struct sax_ns_parser_attribute
SAX token parser
-
template<typename HandlerT>
class sax_token_parser SAX parser that tokenizes element and attribute names while parsing. All pre-defined elements and attribute names are translated into integral identifiers via use of tokens. The user of this class needs to provide a pre-defined set of element and attribute names at construction time.
This parser internally uses sax_ns_parser.
- Template Parameters:
HandlerT – Handler type with member functions for event callbacks. Refer to sax_token_handler.
Public Functions
-
sax_token_parser(std::string_view content, const tokens &_tokens, xmlns_context &ns_cxt, handler_type &handler)
-
~sax_token_parser() = default
-
void parse()
-
class sax_token_handler
Public Functions
-
inline void declaration(const orcus::xml_declaration_t &decl)
Called immediately after the entire XML declaration has been parsed.
- Parameters:
decl – struct containing the attributes of the XML declaration.
-
inline void start_element(const orcus::xml_token_element_t &elem)
Called at the start of each element.
- Parameters:
elem – struct containing the element’s information as well as all the attributes that belong to the element.
-
inline void end_element(const orcus::xml_token_element_t &elem)
Called at the end of each element.
- Parameters:
elem – struct containing the element’s information as well as all the attributes that belong to the element.
-
inline void characters(std::string_view val, bool transient)
Called when a segment of a text content is parsed. Each text content is a direct child of an element, which may have multiple child contents when the element also has a child element that are direct sibling to the text contents or the text contents are splitted by a comment.
- Parameters:
val – value of the text content.
transient – when true, the text content has been converted and is stored in a temporary buffer due to presence of one or more encoded characters, in which case the passed text value needs to be either immediately converted to a non-text value or be interned within the scope of the callback.
-
inline void declaration(const orcus::xml_declaration_t &decl)
Namespace
-
class xmlns_repository
Central XML namespace repository that stores all namespaces that are used in the current session.
Warning
this class is not copyable, but is movable; however, the moved-from object will not be usable after the move.
Public Functions
-
xmlns_repository(const xmlns_repository&) = delete
-
xmlns_repository &operator=(const xmlns_repository&) = delete
-
xmlns_repository()
-
xmlns_repository(xmlns_repository &&other)
-
~xmlns_repository()
-
xmlns_repository &operator=(xmlns_repository&&)
-
void add_predefined_values(const xmlns_id_t *predefined_ns)
Add a set of predefined namespace values to the repository.
- Parameters:
predefined_ns – predefined set of namespace values. This is a null-terminated array of xmlns_id_t. This xmlns_repository instance will assume that the instances of these xmlns_id_t values will be available throughout its life cycle; caller needs to ensure that they won’t get deleted before the corresponding xmlns_repository instance is deleted.
-
xmlns_context create_context()
Create a context object associated with this namespace repository.
Warning
Since this context object references values stored in the repo, make sure that it will not out-live the repository object itself.
- Returns:
context object to use for a new XML stream.
-
xmlns_id_t get_identifier(size_t index) const
Get XML namespace identifier from its numerical index.
- Parameters:
index – numeric index of namespace.
- Returns:
valid namespace identifier, or XMLNS_UNKNOWN_ID if not found.
-
std::string get_short_name(xmlns_id_t ns_id) const
See xmlns_context::get_short_name() for the explanation of this method, which works identically to it.
-
xmlns_repository(const xmlns_repository&) = delete
-
class xmlns_context
XML namespace context. A new context should be used for each xml stream since the namespace keys themselves are not interned. Don’t hold an instance of this class any longer than the life cycle of the xml stream it is used in.
An empty key value i.e.
""
is associated with a default namespace.Public Functions
-
xmlns_context()
-
xmlns_context(xmlns_context&&)
-
xmlns_context(const xmlns_context &r)
-
~xmlns_context()
-
xmlns_context &operator=(const xmlns_context &r)
-
xmlns_context &operator=(xmlns_context &&r)
-
xmlns_id_t push(std::string_view alias, std::string_view uri)
Push a new namespace alias-value pair to the stack.
- Parameters:
alias – namespace alias to push onto the stack. If the same alias is already present, this overwrites it until it gets popped off the stack.
uri – namespace name to associate with the alias.
- Returns:
normalized namespace identifier for the namespace name.
-
void pop(std::string_view alias)
Pop a namespace alias from the stack.
- Parameters:
alias – namespace alias to pop from the stack.
-
xmlns_id_t get(std::string_view alias) const
Get the currnet namespace identifier for a specified namespace alias.
- Parameters:
alias – namespace alias to get the current namespace identifier for.
- Returns:
current namespace identifier associated with the alias.
-
size_t get_index(xmlns_id_t ns_id) const
Get a unique index value associated with a specified identifier. An index value is guaranteed to be unique regardless of contexts.
- Parameters:
ns_id – a namespace identifier to obtain index for.
- Returns:
index value associated with the identifier.
-
std::string get_short_name(xmlns_id_t ns_id) const
Get a ‘short’ name associated with a specified identifier. A short name is a string value conveniently short enough for display purposes, but still guaranteed to be unique to the identifier it is associated with.
Note
The xmlns_repository class has method of the same name, and that method works identically to this method.
- Parameters:
ns_id – a namespace identifier to obtain short name for.
- Returns:
short name for the specified identifier.
-
std::string_view get_alias(xmlns_id_t ns_id) const
Get an alias currently associated with a given namespace identifier.
- Parameters:
ns_id – namespace identifier.
- Returns:
alias name currently associted with the given namespace identifier, or an empty string if the given namespace is currently not associated with any aliases.
-
std::vector<xmlns_id_t> get_all_namespaces() const
-
void dump(std::ostream &os) const
-
void dump_state(std::ostream &os) const
Dump the internal state for debugging in YAML format.
-
void swap(xmlns_context &other) noexcept
-
xmlns_context()
Common
-
struct doctype_declaration
Document type declaration passed by sax_parser to its handler’s doctype() call.
Public Members
-
keyword_type keyword
-
std::string_view root_element
-
std::string_view fpi
-
std::string_view uri
-
keyword_type keyword
-
char orcus::sax::decode_xml_encoded_char(const char *p, size_t n)
Given an encoded name (such as ‘quot’ and ‘amp’), return a single character that corresponds with the name. The name shouldn’t include the leading ‘&’ and trailing ‘;’.
- Parameters:
p – pointer to the first character of encoded name
n – length of encoded name
- Returns:
single character that corresponds with the encoded name. ‘\0’ is returned if decoding fails.
-
std::string orcus::sax::decode_xml_unicode_char(const char *p, size_t n)
Given an encoded unicode value (such as #20A9), return a UTF-8 string that corresponds with the unicode value. The value shouldn’t include the leading ‘&’ and trailing ‘;’.
- Parameters:
p – pointer to the first character of encoded name
n – length of encoded name
- Returns:
string that corresponds with the encoded value. An empty string is returned if decoding fails.