Welcome to xml-miner’s documentation!

XML/TRXML Selector

Description

This package provides two scripts: mine-xml and mine-trxml.

mine-xml selects tags from xml/mxml files, and save the selected values to file.

mine-trxml selects fields from trxml/mtrxml files, and save the selected values to file.

Status

https://travis-ci.org/tilaboy/xml-miner.svg?branch=master Documentation Status Updates

Requirements

Python 3.6+

Installation

pip install xml-selector

Usage

Use xml selector script

The xml selector supports:
  • one or more tagnames:
  • selector could be one tagname name
  • or comma separated tagnames langskill,compskill,softskills
  • multiple sources:
  • e.g. select from xml dir, xml files, mxml file, or directly from annotation server
examples:
#select from xml directory
mine-xml --source tests/xmls/ --selector name --output_file name.tsv
mine-xml --source tests/xmls/ --selector langskill,compskill,softskill --output_file skill.tsv --with_field_name

#select from xml file or mxml file
mine-xml --source tests/sample.mxml --selector experience --output_file experience.tsv

#select directly from annotation server
mine-xml --source localhost:50249 --selector name --output_file name.tsv --query "set Data2018"

Use trxml selector script

The trxml selector supports:
  • one or more selectors:
  • selector can be one field: name.0.name
  • or comma separated fields: name.0.name,address.0.address
  • single or multi item:
  • can select field from one item, e.g. experienceitem.3.experience
  • or select field value of all item, e.g. experienceitem.experience (or experienceitem.*.experience)
  • multiple sources:
  • e.g. select from trxml dir, trxml files, or mtrxml file
examples:
# one selector, single item
mine-trxml --source tests/trxmls/ --selector name.0.name --output_file name.tsv

# one selector, multiple item
mine-trxml --source tests/sample.mxml --selector experienceitem.experience --output_file experience.tsv

# more selectors, single item
mine-trxml --source tests/trxmls/ --selector name.0.name,address.0.address,phone.0.phone --output_file personal.tsv

# more selectors, multiple item
mine-trxml --source tests/sample.mxml  --itemgroup experienceitem --fields experience,experiencedate --output_file experience.tsv
mine-trxml --source tests/sample.mxml  --selector experienceitem.*.experience,experienceitem.*.experiencedate --output_file experience.tsv
mine-trxml --source tests/sample.mxml  --selector experienceitem.experience,experienceitem.experiencedate --output_file experience.tsv

Development

To install package and its dependencies, run the following from project root directory:

python setup.py install

To work the code and develop the package, run the following from project root directory:

python setup.py develop

To run unit tests, execute the following from the project root directory:

python setup.py test

selector and output details:

  • mine-xml:

    input: documents, selector(s), output

    output:

    • default (parameter with_field_name not set): filename, field_value

    e.g. select all names with selector name

    filename value
    xxxx Chao Li
    • parameter with_field_name set: filename, field_value, field_name

    e.g. select skills with selector compskill,langskill,otherskill

    filename value field
    xxxx java compskill
    xxxx dutch langskill
  • mine-trxml

    • input:
    • documents, selector(s), output,
    • documents, itemgroup, fields, output
    • single selector:
    • single item (name.0.name): filename field
    filename name.0.name
    xxxx Chao Li
    • multi items (skill.*.skill): filename item_index field
    filename item_index field
    xxxx 0 java
    xxxx 1 dutch
    • multiple selectors
    • single item: filename, field1, field2 …

    each selector points to a field of a specific item with a digital index, e.g. name.0.lastname,name.0.firstname,address.0.country

    filename name.0.lastname name.0.firstname address.0.country
    xxxx Li Chao China
    xxxx Lee Richard USA
    • multi items: filename, item_index, field1, field2 …

    each selector points to a field from all items in an itemgroup, e.g. skill.skill,skill.type,skill.date

    filename skill skill type date
    xxxx 0 java compskill 2001-2005
    xxxx 1 dutch langskill 2002-

Installation

Stable release

To install xml-miner, run this command in your terminal:

$ pip install xml-miner

This is the preferred method to install xml-miner, as it will always install the most recent stable release.

If you don’t have pip installed, this Python installation guide can guide you through the process.

From sources

The sources for xml-miner can be downloaded from the Github repo.

You can either clone the public repository:

$ git clone git://github.com/tilaboy/xml-minder

Or download the tarball:

$ curl  -OL https://github.com/tilaboy/xml-miner/tarball/master

Once you have a copy of the source, you can install it with:

$ python setup.py install

xml_miner

xml_miner package

Subpackages

xml_miner.data_utils package
Submodules
xml_miner.data_utils.asclient module

A module to communicate the TK annotation server

class xml_miner.data_utils.asclient.ASClient(host: str, port: str, as_user: str = '', as_pass: str = '')

Bases: object

Python version of annotation server client

BUFFER_SIZE = 2048
ENCODING = 'utf-8'
check_user_password()

send server the username and password and confirm the loggin

close_socket()

shutdown the connect

get_docs(query='') → Iterator[str]

get all queried documents

get_ids(query: str = '') → List[str]

get all document ids of all the queried documents

make_connection() → bool

Build connect to the Annotation Server

prepare_message(message: str) → str

prepare the query message: - add new line - encode with utf8

send_and_receive(query: str) → str

send the query to AS and decode the received response

socket_output() → str

Receive response from Annotation Server and decode

xml_miner.data_utils.data_loader module

A module to load input xml/trxml from different source

class xml_miner.data_utils.data_loader.DataLoader(data_generator=None)

Bases: object

DataLoader: - load data from different resources - generate xml input for downstream tasks

TRXML_HEADER = '<TextractorResult '
XML_HEADER = '<begin '
classmethod load_from_as(host, port, query, as_user='', as_pass='')

create the document loader object from AnnotationServer

params:
host (string): hostname of the annotationserver port (int): port of the AnnotationServer query (string): query to select documents as_user: AnnotationServer username as_pass: AnnotationServer password
output:
DataLoader object: a iterator object to generate xml string
classmethod load_from_dir(input_dir)

create the document loader object from dir

params:
input_dir (string): director contains xml files
output:
DataLoader object: a iterator object to generate xml string
classmethod load_from_mtrxml(input_mxml)

create the document loader object from mxml

params:
mxml (string): a mxml files
output:
DataLoader object: a iterator object to generate xml string
classmethod load_from_mxml(input_mxml)

create the document loader object from mxml

params:
mxml (string): a mxml files
output:
DataLoader object: a iterator object to generate xml string
xml_miner.data_utils.data_loader.load_from_file(input_dir, files)

load document from a list files read from a directory

params:
input_dir (string): directory contains all files files (list): a list files
output:
xml string: a iterator object to generate xml string
xml_miner.data_utils.data_loader.load_from_string(xml_string, header_line)

load document string, the string might contain multiple xml files

params:
xml_string (string): input xml strings
output:
xml string: a iterator object to generate xml string
xml_miner.data_utils.data_saver module

A Module to output the selected values to the chosen format

class xml_miner.data_utils.data_saver.DataSaver(output_file)

Bases: object

DataLoader: - open/create targed output file - save the selected values to file with corresponding format

close_stream()

close the file

store(row)

Store one row at a time

Module contents

DataLoader and DataSaver classes

class xml_miner.data_utils.DataLoader(data_generator=None)

Bases: object

DataLoader: - load data from different resources - generate xml input for downstream tasks

TRXML_HEADER = '<TextractorResult '
XML_HEADER = '<begin '
classmethod load_from_as(host, port, query, as_user='', as_pass='')

create the document loader object from AnnotationServer

params:
host (string): hostname of the annotationserver port (int): port of the AnnotationServer query (string): query to select documents as_user: AnnotationServer username as_pass: AnnotationServer password
output:
DataLoader object: a iterator object to generate xml string
classmethod load_from_dir(input_dir)

create the document loader object from dir

params:
input_dir (string): director contains xml files
output:
DataLoader object: a iterator object to generate xml string
classmethod load_from_mtrxml(input_mxml)

create the document loader object from mxml

params:
mxml (string): a mxml files
output:
DataLoader object: a iterator object to generate xml string
classmethod load_from_mxml(input_mxml)

create the document loader object from mxml

params:
mxml (string): a mxml files
output:
DataLoader object: a iterator object to generate xml string
class xml_miner.data_utils.DataSaver(output_file)

Bases: object

DataLoader: - open/create targed output file - save the selected values to file with corresponding format

close_stream()

close the file

store(row)

Store one row at a time

xml_miner.selectors package
Submodules
xml_miner.selectors.selector_utils module

utils and constants functions used by the selector and selectors class

xml_miner.selectors.selector_utils.selector_attribute(selectors, attribute_name) → str

fetch the selector attribute, and check the consistency of all selectors

params: - selectors: a list of selector object - attribute_name: name of the attribute

output: attibute_value: string

xml_miner.selectors.selector_utils.valid_field_name(tag_name: str = '') → bool

simple validation function:

params: - tag_name: string

output: - True/False

xml_miner.selectors.trxml_selector module

selector class for trxml

class xml_miner.selectors.trxml_selector.TRXMLSelector(selector: str)

Bases: xml_miner.selectors.xml_selector.XMLSelector

trxml selector:

  • subclass of XMLSelector
  • method to select values from trxml
field_value_from_item(item) → str

given an item and a field_name, get the value of that field

parse_trxml_selector()

converting the trxml selector to (itemgroup, index, field):

params:

  • selector: string

output:

  • itemgroup, index, field

conversion rules:

- ig.index.field    ->    (ig, index, field)
- ig.*.field        ->    (ig, *, field)
- ig.field          ->    (ig, *, field)
select_field_with_xpath(xml_tree)

select the field using the selector xpath

select_value_with_xpath(xml_tree) → str

get the value of the field where the selector matches

xml_miner.selectors.trxml_selectors module

TRXML Selectors class

class xml_miner.selectors.trxml_selectors.TRXMLSelectors(selectors: List[str], trxml_selector_type=None, shared_itemgroup_name=None)

Bases: object

TRXMLSelectors: - array of TRXMLSelector class - method to select values on trxml doc level or from each items

classmethod from_itemgroup_and_fields(itemgroup: str, fields: str)

construct from itemgroup and fields, only for trxml

input:
  • ItemGroup, e.g. experienceitem
  • Fields, e.g. jobtitle,startdate,enddate
classmethod from_selector_string(selector_string: str)

construct the selectors from string

input:
  • selector string
select_trxml_fields(trxml)

select values from all fields matching selectors

xml_miner.selectors.xml_selector module

XML selector

class xml_miner.selectors.xml_selector.XMLSelector(selector: str)

Bases: object

XMLSelector: - select all values of nodes matches selector

select_all_fields(xml_tree)

select all fields match selectors

select_all_values(xml_tree) → List[str]

select all values match selectors

xml_miner.selectors.xml_selectors module

XML Selectors class

class xml_miner.selectors.xml_selectors.XMLSelectors(selectors: List[str])

Bases: object

XMLSelectors: - array of XMLSelector class - method to select values from xml object

classmethod from_selector_string(selector_string: str)

construct xml selector from input string

select_xml_fields(xml_tree)

select all values matches the selector

Module contents

xml selectors and trxml selectors classes

class xml_miner.selectors.XMLSelectors(selectors: List[str])

Bases: object

XMLSelectors: - array of XMLSelector class - method to select values from xml object

classmethod from_selector_string(selector_string: str)

construct xml selector from input string

select_xml_fields(xml_tree)

select all values matches the selector

class xml_miner.selectors.TRXMLSelectors(selectors: List[str], trxml_selector_type=None, shared_itemgroup_name=None)

Bases: object

TRXMLSelectors: - array of TRXMLSelector class - method to select values on trxml doc level or from each items

classmethod from_itemgroup_and_fields(itemgroup: str, fields: str)

construct from itemgroup and fields, only for trxml

input:
  • ItemGroup, e.g. experienceitem
  • Fields, e.g. jobtitle,startdate,enddate
classmethod from_selector_string(selector_string: str)

construct the selectors from string

input:
  • selector string
select_trxml_fields(trxml)

select values from all fields matching selectors

xml_miner.xml package
Submodules
xml_miner.xml.base_xml module

XML class: render xml file or xml strings to xml tree object

class xml_miner.xml.base_xml.XML(top_level_obj=None)

Bases: object

XML:
general xml class, xml tree object can be generated from - xml file - xml string
classmethod from_file(xml_file: str)

create xml object from filename

params:
xml_file (string): xml file
output:
xml object: ElementTree object
classmethod from_string(xml_string: str)

create xml object from xml_string

params:
xml_string (string): xml string
output:
xml object: ElementTree object
static text_from_element(element)

the text value of an xml element

xml_miner.xml.tk_trxml module

TRXML class: render field or strings to trxml class, and select using xpath

class xml_miner.xml.tk_trxml.TKTRXML(top_level_obj=None)

Bases: xml_miner.xml.base_xml.XML

TRXML: - render field or strings to trxml class - and select using xpath

filename

filename of the oringal file

normally stored as an attribute of the top level tag

working_entity

the xml element to apply searching

xml_entity

the xml part of the tree

xml_miner.xml.tk_xml module

TRXML class: render fild or strings to xml class, and select values

class xml_miner.xml.tk_xml.TKXML(top_level_obj=None)

Bases: xml_miner.xml.base_xml.XML

TKXML class:
xml tree object generated from - xml file - xml string
working_entity

the xml element to apply searching

Module contents

xml and trxml classes

class xml_miner.xml.TKXML(top_level_obj=None)

Bases: xml_miner.xml.base_xml.XML

TKXML class:
xml tree object generated from - xml file - xml string
working_entity

the xml element to apply searching

class xml_miner.xml.TKTRXML(top_level_obj=None)

Bases: xml_miner.xml.base_xml.XML

TRXML: - render field or strings to trxml class - and select using xpath

filename

filename of the oringal file

normally stored as an attribute of the top level tag

working_entity

the xml element to apply searching

xml_entity

the xml part of the tree

Submodules

xml_miner.mine_trxml module

the trxml selector script

xml_miner.mine_trxml.get_args()

get arguments

xml_miner.mine_trxml.main()

apply selectors to trxml files

xml_miner.mine_xml module

the xml selector script

xml_miner.mine_xml.get_args()

get arguments

xml_miner.mine_xml.main()

apply selectors to xml files

xml_miner.miner module

apply selector on input data, and output it to a csv file

class xml_miner.miner.CommonMiner(selectors)

Bases: object

CommonMiner:

shared class for both xml and trxml

static normalize_string(line: str) → str
normalization selected values: - replace
with ‘__NEWLINE__’
  • replace with 4 space ‘ ‘
class xml_miner.miner.TRXMLMiner(selectors, itemgroup=None, fields=None)

Bases: xml_miner.miner.CommonMiner

TRXMLPorcessor: - iterate over the trxml files and select values - output selected values to a file, and print summary

load_data(source)

load the data into a data generator

params:
  • source: data source
output:
  • yeild trxml
mine(source)

iterate the input data (trxml obj), apply selector on each trxml, and output the selected values to a csv file

params:
source: data source
output:
generate selected values per doc
mine_and_save(source: str, output_file: str)

iterate the input data (trxml obj), apply selector on each trxml, and output the selected values to a csv file

params:
source (string): data source output_file (string): the output filename
read_selectors(selector: str, itemgroup: str = '', fields: str = '')

read selector strings and construct selector object

params:
  • selector: input selector strings
  • itemgroup: input itemgroup strings
  • fields: input fields strings
output:
  • selectors: TRXMLSelectors object
class xml_miner.miner.XMLMiner(selectors, with_field_name=False)

Bases: xml_miner.miner.CommonMiner

XMLPorcessor: - iterate over the xml files and select values - output selected values to a file, and print summary

load_data(source: str, query: str = None, as_user: str = None, as_pass: str = None)

load the data into a data generator

params:
  • source: data source
  • annotation server parameters: query, as_user, as_pass
output:
  • yeild xml
mine(source: str, query: str = None, as_user: str = None, as_pass: str = None)

iterate the input data (xml obj), apply selector on each xml, and yield the selected values

params:
  • source: data source
  • annotation server parameters: query, as_user, as_pass
output:
  • iterate over selected fields per doc
mine_and_save(source: str, output_file: str, query: str = None, as_user: str = None, as_pass: str = None)

iterate the selected values and save/print to ouput

params:
  • source: data source
  • output_file (string): the output filename
  • annotation server parameters: query, as_user, as_pass
output file format:
  • no field name: filename value
  • with field name: filename, value, field_name
read_selectors(selector: str)

read selector strings and construct selectors object

params:
  • selector: input selector strings
output:
  • selectors: XMLSelectors object

Module contents

Top-level package for xml-miner

xml_miner.define_logger(mod_name)

Set the default logging configuration

xml_miner.set_logging_level(level=30)

Change logging level

Credits

Development Lead

Contributors

  • Chao Li

0.0.5 (2019-10-14)

  • bug fix: ElementTree xpath find will return a None if value is an empty string, restore to empty string

0.0.4 (2019-09-11)

  • bug fix: reading always use utf8, and not continue reading if failed on encoding of one document

0.0.3 (2019-08-11)

  • expand miner.py module to generate matched phrases per doc

0.0.2 (2019-08-09)

  • added support for CI

0.0.1 (2019-08-09)

  • make two script: mine-xml and mine-trxml

0.0.0 (2019-08-06)

  • Add the first version of the mine_xml and mine_trxml

Indices and tables