xml_miner.data_utils package¶
Submodules¶
xml_miner.data_utils.asclient module¶
A module to communicate the TK annotation server
-
class
xml_miner.data_utils.asclient.
ASClient
(host: str, port: str, as_user: str = '', as_pass: str = '')¶ Bases:
object
Python version of annotation server client
-
BUFFER_SIZE
= 2048¶
-
ENCODING
= 'utf-8'¶
-
check_user_password
()¶ send server the username and password and confirm the loggin
-
close_socket
()¶ shutdown the connect
-
get_docs
(query='') → Iterator[str]¶ get all queried documents
-
get_ids
(query: str = '') → List[str]¶ get all document ids of all the queried documents
-
make_connection
() → bool¶ Build connect to the Annotation Server
-
prepare_message
(message: str) → str¶ prepare the query message: - add new line - encode with utf8
-
send_and_receive
(query: str) → str¶ send the query to AS and decode the received response
-
socket_output
() → str¶ Receive response from Annotation Server and decode
-
xml_miner.data_utils.data_loader module¶
A module to load input xml/trxml from different source
-
class
xml_miner.data_utils.data_loader.
DataLoader
(data_generator=None)¶ Bases:
object
DataLoader: - load data from different resources - generate xml input for downstream tasks
-
TRXML_HEADER
= '<TextractorResult '¶
-
XML_HEADER
= '<begin '¶
-
classmethod
load_from_as
(host, port, query, as_user='', as_pass='')¶ create the document loader object from AnnotationServer
- params:
- host (string): hostname of the annotationserver port (int): port of the AnnotationServer query (string): query to select documents as_user: AnnotationServer username as_pass: AnnotationServer password
- output:
- DataLoader object: a iterator object to generate xml string
-
classmethod
load_from_dir
(input_dir)¶ create the document loader object from dir
- params:
- input_dir (string): director contains xml files
- output:
- DataLoader object: a iterator object to generate xml string
-
classmethod
load_from_mtrxml
(input_mxml)¶ create the document loader object from mxml
- params:
- mxml (string): a mxml files
- output:
- DataLoader object: a iterator object to generate xml string
-
classmethod
load_from_mxml
(input_mxml)¶ create the document loader object from mxml
- params:
- mxml (string): a mxml files
- output:
- DataLoader object: a iterator object to generate xml string
-
-
xml_miner.data_utils.data_loader.
load_from_file
(input_dir, files)¶ load document from a list files read from a directory
- params:
- input_dir (string): directory contains all files files (list): a list files
- output:
- xml string: a iterator object to generate xml string
-
xml_miner.data_utils.data_loader.
load_from_string
(xml_string, header_line)¶ load document string, the string might contain multiple xml files
- params:
- xml_string (string): input xml strings
- output:
- xml string: a iterator object to generate xml string
xml_miner.data_utils.data_saver module¶
A Module to output the selected values to the chosen format
Module contents¶
DataLoader and DataSaver classes
-
class
xml_miner.data_utils.
DataLoader
(data_generator=None)¶ Bases:
object
DataLoader: - load data from different resources - generate xml input for downstream tasks
-
TRXML_HEADER
= '<TextractorResult '¶
-
XML_HEADER
= '<begin '¶
-
classmethod
load_from_as
(host, port, query, as_user='', as_pass='')¶ create the document loader object from AnnotationServer
- params:
- host (string): hostname of the annotationserver port (int): port of the AnnotationServer query (string): query to select documents as_user: AnnotationServer username as_pass: AnnotationServer password
- output:
- DataLoader object: a iterator object to generate xml string
-
classmethod
load_from_dir
(input_dir)¶ create the document loader object from dir
- params:
- input_dir (string): director contains xml files
- output:
- DataLoader object: a iterator object to generate xml string
-
classmethod
load_from_mtrxml
(input_mxml)¶ create the document loader object from mxml
- params:
- mxml (string): a mxml files
- output:
- DataLoader object: a iterator object to generate xml string
-
classmethod
load_from_mxml
(input_mxml)¶ create the document loader object from mxml
- params:
- mxml (string): a mxml files
- output:
- DataLoader object: a iterator object to generate xml string
-