/*
U.S. Patent and Trademark Office (USPTO) code by Franz Nisswandt (franz.nisswandt@lexisnexis.com).
Data from USPTO (www.uspto.gov):
  USPTO Site: http://www.uspto.gov/products/catalog/patent_products/index.jsp
  Bulk Downloads in XML Format: http://www.google.com/googlebooks/uspto-patents.html
  DTD/XML descriptions: http://commondatastorage.googleapis.com/patents/docs/PADX-File-Description-v2.doc
  New assignments published daily

Purpose: Demonstrate some aggregation functions on large datasets created from XML data.

Whilst many complain that there is now a backlog of frivilous patents on ordinary business 
processes and computer algorithms, there was a time when the patent office almost closed due
to a perceived dearth of invention. This led Charles H. Duell, one time commissioner of the 
US Patent Office, to rather famously declare: "Everything that can be invented has been invented."

Fortunately, for us, was he wrong and we have LOTS of patent data to explore.

The USPTO struck a deal with Google to provide (for free!) USPTO patent assignment recordation.
Google has taken the images, OCRed them, and converted them to XML and provided them for 
free download (no registration required). New XML files are produced every day.

The file format is XML in accordance with the Patent Assignment Daily ML (PADX) Version 2.0 
Document. This 3 page document (from USPTO) should be downloaded and reviewed as it describes 
the interesting data elements that are available, such as assignees (companies/individuals), 
state, country of origin, and even the reel of microfilm that contains the documents! You can 
mock up different record layouts/xpaths to extract "different portions" of the patent filings 
for useful analysis.  This document is found here: 
http://commondatastorage.googleapis.com/patents/docs/PADX-File-Description-v2.doc.

Here's the plan. First, we're going to download a patent xml file for 01/04/2011. This will 
validate our ECL attributes. Once this is complete, it will be left to the reader (instructions, 
below) to download additional files / add them to superfiles. All data is in zip files 
individual days of the current year (2011) and data for 1980-2010 is available in in 
current year and aggregate files for 1980-2010) are available at 
http://www.google.com/googlebooks/uspto-patents-assignments.html.

Let's get started! 
It is recommended you have access to a linux box for file download/prep.
Also note the ~ in the prefix of the filenames throughout the exercise.

1) Navigate to site (above) and click on ad20110104.zip. 
  This file contains patent data that was updated on 20110104.
  Save to your hard drive.
2) Unzip the file in the current directory.
3) Clean the file prior to spraying, removing DTD lines and other header stuff not needed for parsing.
  grep -v "^<!" ad20110104.xml | grep "^<" > uspto-filings-2011-01-04.xml
4) Upload the file (either Upload/Download on the left panel of ECLWATCH or use scp (preferred, below).
  scp uspto-filings-2011-01-04.xml hpccdemo@[ip_of_dali]:
  file will be dropped in the mydropzone and show up under "Spray XML Files" in ECLWATCH.
5) Spray (XML) the file. In ECLWATCH, select Spray XML (on left frame).
  Leave defaults (8192 data length/etc).
  select the file you just uploaded from your drop zone: uspto-filings-2011-01-04.xml
  Set the following values for the rest of the fields:
  RowTag: /us-patent-assignments/patent-assignments/patent-assignment
  SourceDirectory: /var/lib/HPCCSystems/mydropzone
  DestLogicalName : ~thor::in::patent::uspto-filings-2011-01-04.xml

6) Add to superfile (so we can simply add new files as we download).
  In ECLWatch, click "Browse Logical Files" in the left frame.
  Check the file you uploaded (uspto-filings-2011-01-04.xml) and click the button "Add To Superfile"
  When it prompts for the superfile name, CREATE a new superfile named:
  ~thor::in::patent::uspto-filings.xml
  This is the file we will reference from here on.
  
At this point, browse down to the bottom and run the ECL and look at the results.
When ready, you can download more files (other days, or larger file(s)) and add to our superfile.

==== More Data ====

There are 9 files on the site containing patent data from 1980-2010. Each file unzips to about 1 GB.
These can be programmatically downloaded (if you are interested, see below) or simply click on file(s).

Follow the same instructions (above) to unzip, clean, copy to HPCC, and spray:

1) Grab the files from the site
  The following 3 lines should be executed on a single line
  for i in 1 2 3 4 5 6 7 8 9; do wget --output-document="ad20101231-0${i}.zip" 
  http://commondatastorage.googleapis.com/patents/retro/2010/ad20101231-0${i}.zip; 
  done;
2) Next, unzip them:
  for i in 1 2 3 4 5 6 7 8 9; do unzip ad20101231-0${i}.zip; done;
3) Next, clean out DTD headers/etc so they are ready for spraying.
  for i in 1 2 3 4 5 6 7 8 9; do grep -v "^<!" ad20101231-0${i}.xml | grepv "^<" > uspto-filings-1980-2010-0${i}.xml; done;
4) Upload the files using scp (alternatively, use upload/download file on web interface).
  for i in 1 2 3 4 5 6 7 8 9; do scp uspto-filings-1980-2010-0$i.xml hpccdemo@[ip_of_dali]:
  -- note the colon on the end of the command line (home directory)
  -- hpccdemo's "home directory" is our dropzone. How convenient.
5) Now, the file(s) should be visible to spray, so let's spray ONE of these large files.
  In ECLWATCH, select Spray XML (on left frame).
  Leave defaults (8192 data length/etc).
  select uspto-filings-1980-2010-01.xml from your drop zone as filename
  Set the following values:
  RowTag: /patent-assignments/patent-assignment
  SourceDirectory: /var/lib/HPCCSystems/mydropzone
  DestLogicalName : ~thor::in::patent::uspto-filings-1980-2010-01.xml
6) After successful spray, add the file(s) to our superfile '~thor::in::patent::uspto-filings'.

Now, when you rerun this code, the PERSIST will detect "new data" and rebuild our output dataset
persist('~thor::in::patent::persist::uspto-assignee');

*/

IMPORT Std;

// patent-assignees/patent-assignee
//<!ELEMENT patent-assignee (name, address-1?, address-2?, city?, state?, country-name?, postcode?)>
/* 

  doc-number    A unique identifier that contains one of the following:
          8 digit numeric application number
          7 digit alphanumeric patent number
          10 digit numeric publication number (comprised of a 4 digit year followed by a 
          6 digit numeric number). The application number will be provided in the first occurrence, 
          followed by the patent number (if it exists), and then the publication number.  When a 
          patent-number has not been assigned, the publication number will follow the application number.  
          When both the patent number and the publication number exist in the assignment record, the 
          application number will appear first, followed by the patent number, followed by the publication number.    
  reel-no       1-6 digit number identifies the reel number to be used to locate the assignment on microfilm.
  frame-no      1-4 digit number that identifies the frame number to be used to locate the first image(page) of the assignment on microfilm. 
  last-update-date Identifies when the assignment record was last modified. Contains a date element with an 
          8 digit date in YYYYMMDD date format. 
  page-count    Identifies the total page count of the assignment (i.e., the number of pages captured on microfilm).
  purge-indicator “Y” indicates the assignment record has been deleted from the Historical database. 
          “N” is the default.
  recorded-date Identifies when the assignment was recorded in the USPTO. 
          Contains a date element with an 8 digit date in YYYYMMDD date format.
  execution-date  Identifies the date from the supporting legal documentation that the assignment was executed.  
          Contains a date element with an 8 digit date in YYYYMMDD date format.
  name          Identifies the party (individual name(s), or organization name) receiving an interest or transaction in a published application and/or issued/granted patent property.
  address-1     Identifies the 1st line of the address (typically the street address component of a mailing address) for the patent-assignee name element.
  address-2     Identifies the 2nd line of the address (typically the internal address component of a mailing address) for the patent-assignee name element.
  city          Identifies the city for the patent-assignee name element mailing address. 
  state         Identifies the state name (for US states) for the patent-assignee name element mailing address.
  country-name  Identifies the country name (for non US states) for the patent-assignee name element mailing address. 
  postcode      Identifies the postal code for the patent-assignee name element mailing address.
*/

r_assignee := RECORD
  STRING20 doc_number {xpath('//document-id/doc-number')};
  STRING20 reel_no {xpath('//reel-no')};
  STRING20 frame_no {xpath('//frame-no')};
  STRING8 last_update_date {xpath('//last-update-date/date')};
  STRING1 purge_indicator {xpath('//purge-indicator')};
  STRING8 recorded_date {xpath('//recorded-date/date')};
  INTEGER2 page_count {xpath('//page-count')};
  STRING name {xpath('//patent-assignee/name'), maxlength(50)};
  STRING invention_title {xpath('//invention-title'),maxlength(100)};
  STRING address_1 {xpath('//patent-assignee/address-1'), maxlength(50)};
  STRING address_2 {xpath('//patent-assignee/address-2'), maxlength(50)};
  STRING city {xpath('//patent-assignee/city'), maxlength(50)};
  STRING state {xpath('//patent-assignee/state'), maxlength(25)};
  STRING country_name {xpath('//patent-assignee/country-name'), maxlength(50)};
  STRING postcode {xpath('//patent-assignee/postcode'), maxlength(30)};
END;

// source XML file - if you have uploaded a different file, change it here.
// If you are on the "test phase" uncomment 


fn_super := '~thor::in::patent::uspto-filings.xml'; // this was our superfile named above

//note we persist the heavy string manipulation/parsed output of this dataset for future use.
ds_assignee := DATASET(fn_super, r_assignee, XML('us-patent-assignments/patent-assignments/patent-assignment')) 
  : persist('~thor::in::patent::persist::uspto-assignee');

//NOTE: Here's how PERSIST works: the compiler compares the ECL attributes being executed with all source data files
//    making up that dataset. If NOTHING has changed and the persisted dataset still exists (not deleted), then
//    it will commence to our output commands.
//    This is useful so that you only have to process your gigabytes of XML ONCE.
//    when you add new patent xml files to your superfile, you will see this in your WorkUnit info:
//    eclagent 0: Rebuilding PERSIST('~thor::in::patent::persist::uspto-assignee'): ECL has changed 

// Finally, something useful! Find patents with various company names
output(ds_assignee(Std.Str.Find(name,'IBM',1) > 0)); 
// find patents with GENERAL ELECTRIC in name
output(ds_assignee(Std.Str.Find(name,'GENERAL ELECTRIC',1) > 0)); 
// find patents with GOOGLE in name
output(ds_assignee(Std.Str.Find(name,'GOOGLE',1) > 0)); 
// find patents with LEXIS or NEXIS in name
output(ds_assignee(Std.Str.Find(name,'LEXIS',1) > 0 OR Std.Str.Find(name,'NEXIS',1) > 0)); 
// find patents with COMPUTER in invention_title
output(ds_assignee(Std.Str.Find(invention_title,'COMPUTER',1) > 0)); 
// find patents with CLUSTER in invention_title
output(ds_assignee(Std.Str.Find(invention_title,'CLUSTER',1) > 0));