xml_miner.data_utils package

Submodules

xml_miner.data_utils.asclient module

A module to communicate the TK annotation server

class xml_miner.data_utils.asclient.ASClient(host: str, port: str, as_user: str = '', as_pass: str = '')

Bases: object

Python version of annotation server client

BUFFER_SIZE = 2048
ENCODING = 'utf-8'
check_user_password()

send server the username and password and confirm the loggin

close_socket()

shutdown the connect

get_docs(query='') → Iterator[str]

get all queried documents

get_ids(query: str = '') → List[str]

get all document ids of all the queried documents

make_connection() → bool

Build connect to the Annotation Server

prepare_message(message: str) → str

prepare the query message: - add new line - encode with utf8

send_and_receive(query: str) → str

send the query to AS and decode the received response

socket_output() → str

Receive response from Annotation Server and decode

xml_miner.data_utils.data_loader module

A module to load input xml/trxml from different source

class xml_miner.data_utils.data_loader.DataLoader(data_generator=None)

Bases: object

DataLoader: - load data from different resources - generate xml input for downstream tasks

TRXML_HEADER = '<TextractorResult '
XML_HEADER = '<begin '
classmethod load_from_as(host, port, query, as_user='', as_pass='')

create the document loader object from AnnotationServer

params:
host (string): hostname of the annotationserver port (int): port of the AnnotationServer query (string): query to select documents as_user: AnnotationServer username as_pass: AnnotationServer password
output:
DataLoader object: a iterator object to generate xml string
classmethod load_from_dir(input_dir)

create the document loader object from dir

params:
input_dir (string): director contains xml files
output:
DataLoader object: a iterator object to generate xml string
classmethod load_from_mtrxml(input_mxml)

create the document loader object from mxml

params:
mxml (string): a mxml files
output:
DataLoader object: a iterator object to generate xml string
classmethod load_from_mxml(input_mxml)

create the document loader object from mxml

params:
mxml (string): a mxml files
output:
DataLoader object: a iterator object to generate xml string
xml_miner.data_utils.data_loader.load_from_file(input_dir, files)

load document from a list files read from a directory

params:
input_dir (string): directory contains all files files (list): a list files
output:
xml string: a iterator object to generate xml string
xml_miner.data_utils.data_loader.load_from_string(xml_string, header_line)

load document string, the string might contain multiple xml files

params:
xml_string (string): input xml strings
output:
xml string: a iterator object to generate xml string

xml_miner.data_utils.data_saver module

A Module to output the selected values to the chosen format

class xml_miner.data_utils.data_saver.DataSaver(output_file)

Bases: object

DataLoader: - open/create targed output file - save the selected values to file with corresponding format

close_stream()

close the file

store(row)

Store one row at a time

Module contents

DataLoader and DataSaver classes

class xml_miner.data_utils.DataLoader(data_generator=None)

Bases: object

DataLoader: - load data from different resources - generate xml input for downstream tasks

TRXML_HEADER = '<TextractorResult '
XML_HEADER = '<begin '
classmethod load_from_as(host, port, query, as_user='', as_pass='')

create the document loader object from AnnotationServer

params:
host (string): hostname of the annotationserver port (int): port of the AnnotationServer query (string): query to select documents as_user: AnnotationServer username as_pass: AnnotationServer password
output:
DataLoader object: a iterator object to generate xml string
classmethod load_from_dir(input_dir)

create the document loader object from dir

params:
input_dir (string): director contains xml files
output:
DataLoader object: a iterator object to generate xml string
classmethod load_from_mtrxml(input_mxml)

create the document loader object from mxml

params:
mxml (string): a mxml files
output:
DataLoader object: a iterator object to generate xml string
classmethod load_from_mxml(input_mxml)

create the document loader object from mxml

params:
mxml (string): a mxml files
output:
DataLoader object: a iterator object to generate xml string
class xml_miner.data_utils.DataSaver(output_file)

Bases: object

DataLoader: - open/create targed output file - save the selected values to file with corresponding format

close_stream()

close the file

store(row)

Store one row at a time