Welcome to xml-miner’s documentation!¶
XML/TRXML Selector¶
Description¶
This package provides two scripts: mine-xml
and
mine-trxml
.
mine-xml
selects tags from xml/mxml files, and save the
selected values to file.
mine-trxml
selects fields from trxml/mtrxml files, and save
the selected values to file.
Requirements¶
Python 3.6+
Installation¶
pip install xml-selector
Usage¶
Use xml selector script¶
The xml selector supports:¶
- one or more tagnames:
- selector could be one tagname
name
- or comma separated tagnames
langskill,compskill,softskills
- multiple sources:
- e.g. select from xml dir, xml files, mxml file, or directly from annotation server
examples:¶
#select from xml directory
mine-xml --source tests/xmls/ --selector name --output_file name.tsv
mine-xml --source tests/xmls/ --selector langskill,compskill,softskill --output_file skill.tsv --with_field_name
#select from xml file or mxml file
mine-xml --source tests/sample.mxml --selector experience --output_file experience.tsv
#select directly from annotation server
mine-xml --source localhost:50249 --selector name --output_file name.tsv --query "set Data2018"
Use trxml selector script¶
The trxml selector supports:¶
- one or more selectors:
- selector can be one field:
name.0.name
- or comma separated fields:
name.0.name,address.0.address
- single or multi item:
- can select field from one item, e.g.
experienceitem.3.experience
- or select field value of all item, e.g.
experienceitem.experience
(orexperienceitem.*.experience
) - multiple sources:
- e.g. select from trxml dir, trxml files, or mtrxml file
examples:¶
# one selector, single item
mine-trxml --source tests/trxmls/ --selector name.0.name --output_file name.tsv
# one selector, multiple item
mine-trxml --source tests/sample.mxml --selector experienceitem.experience --output_file experience.tsv
# more selectors, single item
mine-trxml --source tests/trxmls/ --selector name.0.name,address.0.address,phone.0.phone --output_file personal.tsv
# more selectors, multiple item
mine-trxml --source tests/sample.mxml --itemgroup experienceitem --fields experience,experiencedate --output_file experience.tsv
mine-trxml --source tests/sample.mxml --selector experienceitem.*.experience,experienceitem.*.experiencedate --output_file experience.tsv
mine-trxml --source tests/sample.mxml --selector experienceitem.experience,experienceitem.experiencedate --output_file experience.tsv
Development¶
To install package and its dependencies, run the following from project root directory:
python setup.py install
To work the code and develop the package, run the following from project root directory:
python setup.py develop
To run unit tests, execute the following from the project root directory:
python setup.py test
selector and output details:¶
mine-xml:
input: documents, selector(s), output
output:
- default (parameter
with_field_name
not set):filename, field_value
e.g. select all names with selector
name
filename value xxxx Chao Li - parameter
with_field_name
set:filename, field_value, field_name
e.g. select skills with selector
compskill,langskill,otherskill
filename value field xxxx java compskill xxxx dutch langskill - default (parameter
mine-trxml
- input:
- documents, selector(s), output,
- documents, itemgroup, fields, output
- single selector:
- single item (
name.0.name
): filename field
filename name.0.name xxxx Chao Li - multi items (
skill.*.skill
): filename item_index field
filename item_index field xxxx 0 java xxxx 1 dutch - multiple selectors
- single item: filename, field1, field2 …
each selector points to a field of a specific item with a digital index, e.g.
name.0.lastname,name.0.firstname,address.0.country
filename name.0.lastname name.0.firstname address.0.country xxxx Li Chao China xxxx Lee Richard USA - multi items: filename, item_index, field1, field2 …
each selector points to a field from all items in an itemgroup, e.g.
skill.skill,skill.type,skill.date
filename skill skill type date xxxx 0 java compskill 2001-2005 xxxx 1 dutch langskill 2002-
Installation¶
Stable release¶
To install xml-miner, run this command in your terminal:
$ pip install xml-miner
This is the preferred method to install xml-miner, as it will always install the most recent stable release.
If you don’t have pip installed, this Python installation guide can guide you through the process.
From sources¶
The sources for xml-miner can be downloaded from the Github repo.
You can either clone the public repository:
$ git clone git://github.com/tilaboy/xml-minder
Or download the tarball:
$ curl -OL https://github.com/tilaboy/xml-miner/tarball/master
Once you have a copy of the source, you can install it with:
$ python setup.py install
xml_miner¶
xml_miner package¶
Subpackages¶
xml_miner.data_utils package¶
Submodules¶
xml_miner.data_utils.asclient module¶
A module to communicate the TK annotation server
-
class
xml_miner.data_utils.asclient.
ASClient
(host: str, port: str, as_user: str = '', as_pass: str = '')¶ Bases:
object
Python version of annotation server client
-
BUFFER_SIZE
= 2048¶
-
ENCODING
= 'utf-8'¶
-
check_user_password
()¶ send server the username and password and confirm the loggin
-
close_socket
()¶ shutdown the connect
-
get_docs
(query='') → Iterator[str]¶ get all queried documents
-
get_ids
(query: str = '') → List[str]¶ get all document ids of all the queried documents
-
make_connection
() → bool¶ Build connect to the Annotation Server
-
prepare_message
(message: str) → str¶ prepare the query message: - add new line - encode with utf8
-
send_and_receive
(query: str) → str¶ send the query to AS and decode the received response
-
socket_output
() → str¶ Receive response from Annotation Server and decode
-
xml_miner.data_utils.data_loader module¶
A module to load input xml/trxml from different source
-
class
xml_miner.data_utils.data_loader.
DataLoader
(data_generator=None)¶ Bases:
object
DataLoader: - load data from different resources - generate xml input for downstream tasks
-
TRXML_HEADER
= '<TextractorResult '¶
-
XML_HEADER
= '<begin '¶
-
classmethod
load_from_as
(host, port, query, as_user='', as_pass='')¶ create the document loader object from AnnotationServer
- params:
- host (string): hostname of the annotationserver port (int): port of the AnnotationServer query (string): query to select documents as_user: AnnotationServer username as_pass: AnnotationServer password
- output:
- DataLoader object: a iterator object to generate xml string
-
classmethod
load_from_dir
(input_dir)¶ create the document loader object from dir
- params:
- input_dir (string): director contains xml files
- output:
- DataLoader object: a iterator object to generate xml string
-
classmethod
load_from_mtrxml
(input_mxml)¶ create the document loader object from mxml
- params:
- mxml (string): a mxml files
- output:
- DataLoader object: a iterator object to generate xml string
-
classmethod
load_from_mxml
(input_mxml)¶ create the document loader object from mxml
- params:
- mxml (string): a mxml files
- output:
- DataLoader object: a iterator object to generate xml string
-
-
xml_miner.data_utils.data_loader.
load_from_file
(input_dir, files)¶ load document from a list files read from a directory
- params:
- input_dir (string): directory contains all files files (list): a list files
- output:
- xml string: a iterator object to generate xml string
-
xml_miner.data_utils.data_loader.
load_from_string
(xml_string, header_line)¶ load document string, the string might contain multiple xml files
- params:
- xml_string (string): input xml strings
- output:
- xml string: a iterator object to generate xml string
xml_miner.data_utils.data_saver module¶
A Module to output the selected values to the chosen format
Module contents¶
DataLoader and DataSaver classes
-
class
xml_miner.data_utils.
DataLoader
(data_generator=None)¶ Bases:
object
DataLoader: - load data from different resources - generate xml input for downstream tasks
-
TRXML_HEADER
= '<TextractorResult '¶
-
XML_HEADER
= '<begin '¶
-
classmethod
load_from_as
(host, port, query, as_user='', as_pass='')¶ create the document loader object from AnnotationServer
- params:
- host (string): hostname of the annotationserver port (int): port of the AnnotationServer query (string): query to select documents as_user: AnnotationServer username as_pass: AnnotationServer password
- output:
- DataLoader object: a iterator object to generate xml string
-
classmethod
load_from_dir
(input_dir)¶ create the document loader object from dir
- params:
- input_dir (string): director contains xml files
- output:
- DataLoader object: a iterator object to generate xml string
-
classmethod
load_from_mtrxml
(input_mxml)¶ create the document loader object from mxml
- params:
- mxml (string): a mxml files
- output:
- DataLoader object: a iterator object to generate xml string
-
classmethod
load_from_mxml
(input_mxml)¶ create the document loader object from mxml
- params:
- mxml (string): a mxml files
- output:
- DataLoader object: a iterator object to generate xml string
-
xml_miner.selectors package¶
Submodules¶
xml_miner.selectors.selector_utils module¶
utils and constants functions used by the selector and selectors class
-
xml_miner.selectors.selector_utils.
selector_attribute
(selectors, attribute_name) → str¶ fetch the selector attribute, and check the consistency of all selectors
params: - selectors: a list of selector object - attribute_name: name of the attribute
output: attibute_value: string
-
xml_miner.selectors.selector_utils.
valid_field_name
(tag_name: str = '') → bool¶ simple validation function:
params: - tag_name: string
output: - True/False
xml_miner.selectors.trxml_selector module¶
selector class for trxml
-
class
xml_miner.selectors.trxml_selector.
TRXMLSelector
(selector: str)¶ Bases:
xml_miner.selectors.xml_selector.XMLSelector
trxml selector:
- subclass of XMLSelector
- method to select values from trxml
-
field_value_from_item
(item) → str¶ given an item and a field_name, get the value of that field
-
parse_trxml_selector
()¶ converting the trxml selector to (itemgroup, index, field):
params:
- selector: string
output:
- itemgroup, index, field
conversion rules:
- ig.index.field -> (ig, index, field) - ig.*.field -> (ig, *, field) - ig.field -> (ig, *, field)
-
select_field_with_xpath
(xml_tree)¶ select the field using the selector xpath
-
select_value_with_xpath
(xml_tree) → str¶ get the value of the field where the selector matches
xml_miner.selectors.trxml_selectors module¶
TRXML Selectors class
-
class
xml_miner.selectors.trxml_selectors.
TRXMLSelectors
(selectors: List[str], trxml_selector_type=None, shared_itemgroup_name=None)¶ Bases:
object
TRXMLSelectors: - array of TRXMLSelector class - method to select values on trxml doc level or from each items
-
classmethod
from_itemgroup_and_fields
(itemgroup: str, fields: str)¶ construct from itemgroup and fields, only for trxml
- input:
- ItemGroup, e.g. experienceitem
- Fields, e.g. jobtitle,startdate,enddate
-
classmethod
from_selector_string
(selector_string: str)¶ construct the selectors from string
- input:
- selector string
-
select_trxml_fields
(trxml)¶ select values from all fields matching selectors
-
classmethod
xml_miner.selectors.xml_selector module¶
XML selector
xml_miner.selectors.xml_selectors module¶
XML Selectors class
-
class
xml_miner.selectors.xml_selectors.
XMLSelectors
(selectors: List[str])¶ Bases:
object
XMLSelectors: - array of XMLSelector class - method to select values from xml object
-
classmethod
from_selector_string
(selector_string: str)¶ construct xml selector from input string
-
select_xml_fields
(xml_tree)¶ select all values matches the selector
-
classmethod
Module contents¶
xml selectors and trxml selectors classes
-
class
xml_miner.selectors.
XMLSelectors
(selectors: List[str])¶ Bases:
object
XMLSelectors: - array of XMLSelector class - method to select values from xml object
-
classmethod
from_selector_string
(selector_string: str)¶ construct xml selector from input string
-
select_xml_fields
(xml_tree)¶ select all values matches the selector
-
classmethod
-
class
xml_miner.selectors.
TRXMLSelectors
(selectors: List[str], trxml_selector_type=None, shared_itemgroup_name=None)¶ Bases:
object
TRXMLSelectors: - array of TRXMLSelector class - method to select values on trxml doc level or from each items
-
classmethod
from_itemgroup_and_fields
(itemgroup: str, fields: str)¶ construct from itemgroup and fields, only for trxml
- input:
- ItemGroup, e.g. experienceitem
- Fields, e.g. jobtitle,startdate,enddate
-
classmethod
from_selector_string
(selector_string: str)¶ construct the selectors from string
- input:
- selector string
-
select_trxml_fields
(trxml)¶ select values from all fields matching selectors
-
classmethod
xml_miner.xml package¶
Submodules¶
xml_miner.xml.base_xml module¶
XML class: render xml file or xml strings to xml tree object
-
class
xml_miner.xml.base_xml.
XML
(top_level_obj=None)¶ Bases:
object
- XML:
- general xml class, xml tree object can be generated from - xml file - xml string
-
classmethod
from_file
(xml_file: str)¶ create xml object from filename
- params:
- xml_file (string): xml file
- output:
- xml object: ElementTree object
-
classmethod
from_string
(xml_string: str)¶ create xml object from xml_string
- params:
- xml_string (string): xml string
- output:
- xml object: ElementTree object
-
static
text_from_element
(element)¶ the text value of an xml element
xml_miner.xml.tk_trxml module¶
TRXML class: render field or strings to trxml class, and select using xpath
-
class
xml_miner.xml.tk_trxml.
TKTRXML
(top_level_obj=None)¶ Bases:
xml_miner.xml.base_xml.XML
TRXML: - render field or strings to trxml class - and select using xpath
-
filename
¶ filename of the oringal file
normally stored as an attribute of the top level tag
-
working_entity
¶ the xml element to apply searching
-
xml_entity
¶ the xml part of the tree
-
xml_miner.xml.tk_xml module¶
TRXML class: render fild or strings to xml class, and select values
-
class
xml_miner.xml.tk_xml.
TKXML
(top_level_obj=None)¶ Bases:
xml_miner.xml.base_xml.XML
- TKXML class:
- xml tree object generated from - xml file - xml string
-
working_entity
¶ the xml element to apply searching
Module contents¶
xml and trxml classes
-
class
xml_miner.xml.
TKXML
(top_level_obj=None)¶ Bases:
xml_miner.xml.base_xml.XML
- TKXML class:
- xml tree object generated from - xml file - xml string
-
working_entity
¶ the xml element to apply searching
-
class
xml_miner.xml.
TKTRXML
(top_level_obj=None)¶ Bases:
xml_miner.xml.base_xml.XML
TRXML: - render field or strings to trxml class - and select using xpath
-
filename
¶ filename of the oringal file
normally stored as an attribute of the top level tag
-
working_entity
¶ the xml element to apply searching
-
xml_entity
¶ the xml part of the tree
-
Submodules¶
xml_miner.mine_trxml module¶
the trxml selector script
-
xml_miner.mine_trxml.
get_args
()¶ get arguments
-
xml_miner.mine_trxml.
main
()¶ apply selectors to trxml files
xml_miner.mine_xml module¶
the xml selector script
-
xml_miner.mine_xml.
get_args
()¶ get arguments
-
xml_miner.mine_xml.
main
()¶ apply selectors to xml files
xml_miner.miner module¶
apply selector on input data, and output it to a csv file
-
class
xml_miner.miner.
CommonMiner
(selectors)¶ Bases:
object
CommonMiner:
shared class for both xml and trxml
-
static
normalize_string
(line: str) → str¶ - normalization selected values: - replace
- with ‘__NEWLINE__’
- replace with 4 space ‘ ‘
-
static
-
class
xml_miner.miner.
TRXMLMiner
(selectors, itemgroup=None, fields=None)¶ Bases:
xml_miner.miner.CommonMiner
TRXMLPorcessor: - iterate over the trxml files and select values - output selected values to a file, and print summary
-
load_data
(source)¶ load the data into a data generator
- params:
- source: data source
- output:
- yeild trxml
-
mine
(source)¶ iterate the input data (trxml obj), apply selector on each trxml, and output the selected values to a csv file
- params:
- source: data source
- output:
- generate selected values per doc
-
mine_and_save
(source: str, output_file: str)¶ iterate the input data (trxml obj), apply selector on each trxml, and output the selected values to a csv file
- params:
- source (string): data source output_file (string): the output filename
-
read_selectors
(selector: str, itemgroup: str = '', fields: str = '')¶ read selector strings and construct selector object
- params:
- selector: input selector strings
- itemgroup: input itemgroup strings
- fields: input fields strings
- output:
- selectors: TRXMLSelectors object
-
-
class
xml_miner.miner.
XMLMiner
(selectors, with_field_name=False)¶ Bases:
xml_miner.miner.CommonMiner
XMLPorcessor: - iterate over the xml files and select values - output selected values to a file, and print summary
-
load_data
(source: str, query: str = None, as_user: str = None, as_pass: str = None)¶ load the data into a data generator
- params:
- source: data source
- annotation server parameters: query, as_user, as_pass
- output:
- yeild xml
-
mine
(source: str, query: str = None, as_user: str = None, as_pass: str = None)¶ iterate the input data (xml obj), apply selector on each xml, and yield the selected values
- params:
- source: data source
- annotation server parameters: query, as_user, as_pass
- output:
- iterate over selected fields per doc
-
mine_and_save
(source: str, output_file: str, query: str = None, as_user: str = None, as_pass: str = None)¶ iterate the selected values and save/print to ouput
- params:
- source: data source
- output_file (string): the output filename
- annotation server parameters: query, as_user, as_pass
- output file format:
- no field name: filename value
- with field name: filename, value, field_name
-
read_selectors
(selector: str)¶ read selector strings and construct selectors object
- params:
- selector: input selector strings
- output:
- selectors: XMLSelectors object
-
0.0.5 (2019-10-14)¶
- bug fix: ElementTree xpath find will return a None if value is an empty string, restore to empty string
0.0.4 (2019-09-11)¶
- bug fix: reading always use utf8, and not continue reading if failed on encoding of one document
0.0.3 (2019-08-11)¶
- expand miner.py module to generate matched phrases per doc
0.0.2 (2019-08-09)¶
- added support for CI
0.0.1 (2019-08-09)¶
- make two script: mine-xml and mine-trxml
0.0.0 (2019-08-06)¶
- Add the first version of the mine_xml and mine_trxml