Open Data Inventory Building a comprehensive data inventory as required by section 6.3 of the Directive on Open Government: “Establishing and maintaining comprehensive inventories of data and information resources of business value held by the department to determine their eligibility and priority, and to plan for their effective release.” Creating a data inventory is among the first steps in identifying federal data that is eligible for release. Departmental data inventories has been published on the Open Government portal, Open.Canada.ca, so that Canadians can see what federal data is collected and have the opportunity to indicate what data is of most interest to them, helping departments to prioritize data releases based on both external demand and internal capacity. The objective of the inventory is to provide a landscape of all federal data. While it is recognized that not all data is eligible for release due to the nature of the content, departments are responsible for identifying and including all datasets of business values as part of the inventory exercise with the exception of datasets whose title contains information that should not be released to be released to the public due to security or privacy concerns. These titles have been excluded from the inventory. Departments were provided with an open data inventory template with standardized elements to populate, and upload in the metadata catalogue, the Open Government Registry. These elements are described in the data dictionary file. Departments are responsible for maintaining up-to-date data inventories that reflect significant additions to their data holdings. For purposes of this open data inventory exercise, a dataset is defined as: “An organized collection of data used to carry out the business of a department or agency, that can be understood alone or in conjunction with other datasets”. 2016-10-31 2017-06-05 Treasury Board of Canada Secretariat open-ouvert@tbs-sct.gc.ca Information and Communicationsopen data inventoryDirective on Open Government Open Data InventoryCSV http://open.canada.ca/data/dataset/4ed351cf-95d8-4c10-97ac-6b3511f359b7/resource/d0df95a8-31a9-46c9-853b-6952819ec7b4/download/inventory.csv Open Data InventoryHTML http://open.canada.ca/en/search/inventory Open Data InventoryHTML http://ouvert.canada.ca/fr/search/inventory Data DictionaryXLS http://open.canada.ca/data/en/recombinant-dictionary/inventory

Open Data Inventory

Building a comprehensive data inventory as required by section 6.3 of the Directive on Open Government:

“Establishing and maintaining comprehensive inventories of data and information resources of business value held by the department to determine their eligibility and priority, and to plan for their effective release.”

Creating a data inventory is among the first steps in identifying federal data that is eligible for release. Departmental data inventories has been published on the Open Government portal, Open.Canada.ca, so that Canadians can see what federal data is collected and have the opportunity to indicate what data is of most interest to them, helping departments to prioritize data releases based on both external demand and internal capacity.

The objective of the inventory is to provide a landscape of all federal data. While it is recognized that not all data is eligible for release due to the nature of the content, departments are responsible for identifying and including all datasets of business values as part of the inventory exercise with the exception of datasets whose title contains information that should not be released to be released to the public due to security or privacy concerns. These titles have been excluded from the inventory.

Departments were provided with an open data inventory template with standardized elements to populate, and upload in the metadata catalogue, the Open Government Registry. These elements are described in the data dictionary file.

Departments are responsible for maintaining up-to-date data inventories that reflect significant additions to their data holdings.

For purposes of this open data inventory exercise, a dataset is defined as: “An organized collection of data used to carry out the business of a department or agency, that can be understood alone or in conjunction with other datasets”.

  • Publisher - Current Organization Name: Treasury Board of Canada Secretariat
  • Publisher - Organization Section Name: Chief Information Officer Branch
  • Licence: Open Government Licence - Canada

Resources

Resource Name Resource Type Format Language Links
Open Data Inventory Dataset CSV English
French
Access
Open Data Inventory Website HTML English Access
Open Data Inventory Website HTML French Access
Data Dictionary Guide XLS English
French
Access

Comments (15)

Possible d'obtenir les bases de données des entreprises canadiennes tout secteurs confondus?

Hi open data peeps. Is there any intention of publishing this dataset in an open data format? DCAT seems like it's meant for this (https://www.w3.org/TR/vocab-dcat/). And this resource seems relevant ( https://project-open-data.cio.gov/v1.1/schema/).

Hello, Thank you for your comment. We definitely follow the best practices we have learned from our data.gov colleagues, and have therefore decided to apply DCAT as well. We worked with them to ensure our applications aligned, and have applied a mapping on all datasets added on open.canada.ca. You can find this mapping on the right hand side of every dataset record (See the JSON and XML links on the right). Regards, Momin The Open Government team

Hi, is it possible to access to the «Nomenclature SH»? How can I refer for example «code_SH : 500790, pays : 556, état : 1000 ... » numbers to actual country names, states names? Thanks

Hi Claudia, Thank you for your question. Unfortunately, we don't quite understand your question or see how this relates to the dataset. Could you please clarify by responding to open-ouvert@tbs-sct.gc.ca, and we would be happy to help. Thanks!

Comments: I wanted to use this file as a jumping off point to investigate and process The GOC data inventory using automated tools. Although the CSV can be intelligently handled by Libre Office Calc and likely the same with other spreadsheet software, the inventory is proving to be a pretty messy file preventing other tools and scripting languages from consuming it and causing yet again a lot of time to clean up data. A lot of problems can be detected just by opening in a simple text editor or doing a 'cat' or 'less' for those on Linux (Mac too?) I think ensuring that some basic best practices are followed can help clean up this inventory a whole bunch. * Following the IETF spec for CSV would be a good starting point (https://www.ietf.org/rfc/rfc4180.txt) * Remove all Carriage Returns (hidden characters) often found in descriptions to ensure "Each record is located on a separate line, delimited by a line break (CRLF)" (IETF 4180) (767 instances of this error according to http://CSVLint.io) * Use propper quotations, embeded quotes should be of unique types "Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields." (IETF4180) The sollution for cleaning up this dataset for analytical tools and scripting (especially python), find and replace all instances of "" with ' (182 occurences as of today) It would be grteat if the dataset at source could be updated so tools like RapidMiner and Orange-Canvas can use the HTTP resource directly instead of downloading. Although I am sure manmy spreadsheet users won't ever dive this deep, it would be good for GOC Open Data to show leadership in applying best practices especialy the IETF spec 4180 Leveraging tools like https://csvlint.io can certainly help GOC's Open Data Cause (see report on this file bellow). I have alos captured the issue with Orange Canvas here so they can improve their product: https://github.com/biolab/orange3/issues/2293 , however it would be good to clean up this inventory as well. Investigation Procedure: I took a brute force procedure since there were so many best practices that were not used in the creation of this dataset. Many tools can handle common mistakes and annoyances, but we should ensure that data is as clean as possible going in. Garbage in = garbage out * platform: Ubuntu Linux 16.04 * download CSV * open in Orange Canvas (https://orange.biolab.si/), issues with python not being able to create indices (too many indices for array) * likely due to poor formatting of collumns * created a sample dataset using the first 100 rows which worked, and same with the first 150 but the first 200 and beyond failed to load. * open in Rapdiminer , 8 warnings on the first 100 rows, various issues with inconsistent collumn formats * open in text editor like gedit (https://wiki.gnome.org/Apps/Gedit) * Scroll down first collumn and find many incorrect line starts (see https://www.ietf.org/rfc/rfc4180.txt) * Tried some manual fixes, but proved to take too long for the whole inventory ID, Edit ODI-2016-00018, Removed 2 Carriage Returns (CR) ODI-2016-00096, Removed 1 CR ODI-2016-00190, Removed 1 CR ODI-2016-00214, Removed 4 CR ODI-2016-00216, Removed 2 CR ODI-2016-00217, Removed 2 CR ODI-2016-00219, Removed 3 CR ODI-2016-00220, Removed 3 CR * Tried a search and replace for carriage returns, search for '\r' and replace with nothing. But only the character was removed, records were still split across lines * Went back to Orange Canvas and a text editor to systematically track down the problem record (row) * Fixed record 'ODI-2016-00190,' at line 202 by removing a Carriage Return (CR) character that is often searchable in text editors using '\r' as an escaped search pattern. no glorry * Went back to try and fix record 'ODI-2016-00018,' by removing a CR as well. no glorry * created a sample file with the first 170 rows, worked * first 190 rows, worked * Then started removing single rows working backwards from ID 'ODI-2016-00192,' * Removed 'ODI-2016-00192,', Failed * Removed ODI-2016-00191, failed * Removed ODI-2016-00190, failed * Removed ODI-2016-00319, failed * Removed ODI-2016-00189 and ODI-2016-00188 (I am getting impatient, but getting closer), still failed * Removed ODI-2016-00186 and ODI-2016-00187, failed, 2 left to test before we are back to 190 records * Removed ODI-2016-00185, failed * Record ODI-2016-00184 might be the issue. lets remove that too just to double check with our other 190 record file. Worked * Confirmed that record ODI-2016-00184 is giving us issues. but why? * Read through the record and lets try some things * Quote the second and third collumns, just because that is a best practice for text, failed * The third collumn looks pretty messy, when using embedded quotes you should replace one set with single quotes. using double-double quotes is getting pretty silly and in the end caused the issue. Original text: ,"The ""Areas of Non-Contributing Drainage within Total Gross Drainage Areas of the AAFC Watersheds Project - 2013"" dataset is a geospatial data layer containing polygon features representing the areas within the “total gross drainage areas” of each gauging station of the Agriculture and Agri-Food Canada (AAFC) Watersheds Project that DO NOT contribute to average runoff. A “total gross drainage area” is the maximum area that could contribute runoff for a single gauging station – the “areas of non-contributing drainage” are those parts of that “total gross drainage area” that DO NOT contribute to average runoff. For each “total gross drainage area” there can be none to several unconnected “areas of non-contributing drainage”. These polygons may overlap with those from other gauging stations’ “total gross drainage area”, as upstream land surfaces form part of multiple downstream gauging stations’ “total gross drainage areas”.", Edited Text: ,"The 'Areas of Non-Contributing Drainage within Total Gross Drainage Areas of the AAFC Watersheds Project - 2013' dataset is a geospatial data layer containing polygon features representing the areas within the “total gross drainage areas” of each gauging station of the Agriculture and Agri-Food Canada (AAFC) Watersheds Project that DO NOT contribute to average runoff. A “total gross drainage area” is the maximum area that could contribute runoff for a single gauging station – the “areas of non-contributing drainage” are those parts of that “total gross drainage area” that DO NOT contribute to average runoff. For each “total gross drainage area” there can be none to several unconnected “areas of non-contributing drainage”. These polygons may overlap with those from other gauging stations “total gross drainage area”, as upstream land surfaces form part of multiple downstream gauging stations’ “total gross drainage areas”.", Notes: lets try removing the quote after "stations", failed Lets try removing double double-quotes ("") and replace with single quotes, SUCCESS!!!!! * lets open the first 200 records and replace "" with ', success * open first 500 records and do the same, 32 instances replaced, Success * Open the first 1000, replace all (67 instances), success * Try with the whole inventory, replace all (182 instances), Success!!!!! CSV Lint (report for inventory file: https://csvlint.io/validation/590ddb893036660004000010) * Kept thinking I would find the issue with this file on my own so I did not run the CSV through CSV lint until the end * It turns out CSVLint found 767 Errors, 2 warnings and 1 message for this inventory file Comments * all text should be enclosed in double quotes separated by commas, the format is pretty inconsistent in this file * double-double quotes is the issue here, replace "" with ' * This is a pretty messy file, if some basic best practices were followed then it could have taken less time to find the actual problem.

Hi Dave, thank you for your feedback. I have forwarded your comment to the team responsible for the Open Data inventories. Stay tuned for a response! Momin, the Open Government team.

Dave, here's their response: "We're looking into correcting the Content-Type header served and normalizing the embedded newlines within our generated CSV files so that automated tools like csvlint.io shows our CSV files as correct. If your tools are having trouble processing our embedded newlines and quotes, here's an example script that downloads the dataset and removes those characters: https://gist.github.com/wardi/37e1d9922113a3252071665cda19b0b6 Thanks and I hope this helps!"

Where is the data dictionary for this dataset? Differences between: Publisher, program alignment architecture, and owner org, owner org title

Hi Stephen, the team is currently working on the data dictionary. Until then, here are the descriptions: Publisher – Name of the organization primarily responsible for publishing the dataset at the time of the publication (if applicable, i.e. if different than current name). program alignment architecture - The Program Alignment Architecture (PAA) is an inventory of each organization’s programs. It provides an overview of the organization’s responsibilities. owner org – the acronym of the GC organization that uploaded the inventory owner org title – the title of the GC organization that uploaded the inventory Hope that helps! Momin, the Open Government team.

I noted under the Transport Canada inventory of 210 items, there was reference to one specific regions surveillance plan. Would all regional surveillance plans not be categorized as a data set?

Francis, could you specify which dataset you are referring to? A complete name or link would work. Momin, the Open Government team.

It might be good to standardize coding on a couple of the fields in the CSV Open Data inventory file: eligible_for_release field's values and counts 1 164 true 168 True 8300 TRUE 11 Y 10 Yes 386 language field is another example

Hi Rob, thanks for this suggestion. I will forward it to our systems team for review. Regards, Momin, the Open Government team.

Rob, here's the response from our systems team: “Thank you for your comment. It was our intention to standardize this element. However, unfortunately there were issues with its implementation that prevented us from effectively doing so. Moving forward, we will work to ensure that we standardize as many elements as possible, and use controlled vocabularies where applicable. Thanks!” Regards, Momin, Open Government team.