Data Science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured.
Data Science utilizes processes that act on data such as Analysis, Extractions, Transformations, Loads and creation of data stores to contain the data. Effective use of tools are required to minmize errors and maximize performance. These tools include applications such as R and Tableau, as well as programming languages such as Python.
Metadata is data that provides information about other data. This information defines and describes the Attributes that comprise the data. These attribites are used to give meaning to the data, as well as defining its behavior (how it can be used). These attributes can in turn contain their own attributes.
An Entity is an object of the Metadata. For example, a Person can be defined as an Entity.
- The Person entity can be comprised of attributes such as Name and Address.
- The Name attibute can have attributes such as datayype (Char) and data length (30).
- The Address attribute can in turn contain its own attributes such as Street which would have its attributes of datatype (char) and data length (45),
Understanding the Metadata domain for the Data Science application is crutial.
Relationships betweeen the comprised domain Entities have a direct impact on the processes.
The Metadata attributes effect how the data is manipulated, relate to other attributes and are presented for processing and viewing.
The first stage in an effective Data Science application is understanding, defining and creating the Metadata datastores that are will be utilized.
For professional IT projects, all software goes through a company defined process, where documentation is created and approved before any development/changes are started. These often include BDD (business design if software is meant for business line usage) and TDD (tech design) that constitute the Program Requirements. Individual program specifications can then be produced, adding the nonfunctional requirements. Diagram documentation, quite often using UML, is created, so as to define the flow of data and processes. A “picture is worth a thousand words” is very evident here, and diagrams serve as discussion points for interested parties.
Individual programs quite often go through a “walk though” meeting, where the program is presented and looked at by those deemed to be SME’s (subject matter experts) so as to ensure that requirements are being met and adherence to corporate standards.
Approvals are performed by stake holders, such as DBA’s and other system folks, as well as any other downstream software that is dependant/use the results of the new/changed software. This is to done to foresee any possible negative impact that would require a resolution, and to enforce standards
For testing, Use Cases are imperative so as to ensure that all requirements are meet. These are defined by the Business Analysts in consultation with the development team, so as to include the nonfunctional requirements needed for testing. They are also used when future changes are made, serving to ensure that past requirements are still being adhered to. Use cases are not static. They can be added to for new functionality, or changed if past requirements were changed.
While consulting at FISA (NYC financial system), I was giving the responsibility of ensuring that Accounting Interfaces from over 12 NYC agencies, were processed successfully by FISA’s new accounting system (FMS), in format and content.
Input XML files containing an Agency’s accounting interface will feed this process. These files will be validated for proper format and valid content.
Files failing validation will be rejected, while those deemed valid will be inserted to the appropriated Oracle table.
Failed and successfully processed files will have an appropriate message written to the output HTML report.
Java is required. The JDOM class library will be used for parsing of the XML files. JDBC will be used for access to the Oracle database, where the accounting data resides. All reports need to be created in HTML and held on the Web Server for user access in a secure manor.
For an efficient process, database calls will only be made after the transaction passes through the appropriate validation. All SQL Selects are required to have the Primary Key index utilized.
Log4j will be used for logging of errors.
The input XML file is first validated for proper XML syntax using the JDOM API. Badly formed XML files will be rejected with an error report written. No further processing will be performed for this file.
Each data item tag within the properly formed XML file needs to be validated against the associated Oracle table column. The Meta Data repository of Oracle will be used to select all the columns contained within the table. A tag name that has no matching column name will be rejected.
If the table contains a column that has no associated XML tag, then the record will also be rejected.
All files passing validation will have an SQL INSERT statement created, populated with data from the XML tags, and inserted to the Oracle table using JDBC.
- Nonfunctional requirements.
Badly formed XML (or empty file) will be caught as an Exception by JDOM.
All SQL Exceptions will be caught by program, and a database ROLLBACK will be initiated.
The SQLCODE returned from the JDBC INSERT call will be interrogated for a successful completion. Any Exceptions will be written to the error report, ROLLBACK if appropriate, and stop processing.
All caught exceptions need to be reported on, detailing the error with an appropriate message, the method and line # of the exception as well as the process flow to that exception.
If a tag is being rejected, processing will continue with the next tag, so as to have all rejects written to the report file. However, any database Exception will cause the process to ROLLBACK and exit.
A batch data processing framework that simplifies, standardizes and optimizes the development and execution process.
This results in a quicker time to market process, with reductions in problems and errors associated with not utilizing a framework.
Java Batch Enterprise(JBE) is a framework that utilizes standard Java classes (POJO), forming the basis of development and execution
A Java Batch application utilizes JBE for easy to use/optimized data access to multiple Sources and Targets, with embedded HTML/Text reporting control.
• Optimized Batch Processing of enterprise data sources and targets, including ETL and HTML reporting.
• Reduced Cost Savings and Time To Market using industry standard Java and JDBC.
No added software package to buy or depend on for it’s existence.
No Need to train or hire high priced personal to maintain. Use any professional Java developer.
• Reduced Developing and maintenance time and errors. •Reduced Redeployment cost/time of development/executing platforms.
In the future will you need to go from a Windows development platform to UNIX or Linux?
Anywhere the Java JVM runs, and Java tools for development, could be utilized.
• Standardization of development and execution with JBE as the standard.
How is it used?
Development is performed within any Java IDE, including open source Eclipse and IDE’s built upon Eclipse.
These include MyEclipse and the family of Rational Application Developer IDE’s (Rational Developer for Z, WebSphere Application Developer )
Execution can be done via standard batch files *.BAT (Windows), shell scripts (UNIX/LINUX/AIX/USS) or JCL if running on z/OS. A self contained directory structure allows for the running of Java jar files (packages/program), by a simple drag/drop of a data file into a *.bat file, which contains the executing program.
Alternatively, the *.bat file can include the data file name, with no need to drag/drop.
Execution can also be done within the Java IDE. This is an important ability for use in the development process.
For example, to produce an HTML report using an SQL statement as input (or an input File record), only one new class needs to be developed.
This new class would extend an existing java class contained in JBE. The input SQL could be contained in an extended method of the new class, ora file that is referenced in the executing *.bat file or referenced in the Main programs properties file.
Executing the jar/package of this new class would produce the HTML report.
If one needs to be in control of each row returned from the SQL result set (or input file rec), a new method within the new class is created that overrides the behavior of the default report process. The application would then be able to perform any Java operation on each column/field and decide to write out a report record including changed content.