Horizontal Navigation Bar Page |
---|
title | Documentation Requirements |
---|
|
Panel |
---|
| This section provides information regarding CDR documentation requirements for CDR Contributors. Please explore this tab for an overview of the CDR Data Catalog, minimum requested documentation from Data Contributors, an example of an approved data dictionary documentation, as well as expectations for data contributors. Overview of CDR Data CatalogThe CDR Data Catalog provides a location for data contributors to make documentation available to CDR users. The DAMOD team provides a general layout document of each schema that includes table names, column names, and datatypes for every source in Hive. This document (located under the CDR Table Layouts column) is produced when the data is made available in the CDR or if data definitions are updated. Data contributors to the CDR are responsible for providing supplemental documentation necessary to support end-users of their data. Such examples of documentation may include: Data Dictionaries Data Models User Guides Training Documents
![Image of CDR Data Catalog table headers Image of CDR Data Catalog table headers](/download/attachments/134987761/image2021-1-27_17-8-19.png?version=1&modificationDate=1612904109238&api=v2)
Minimum Requested Documentation from Data Contributors Available documentation on the CDR Data Catalog will vary from source to source. Some mature data sources will have excellent existing documentation/artifacts while newer sources may provide less. At a minimum, the DAMOD team recommends data contributors to provide the following documents:
Document Name | Description | Necessary Information | Other Information |
---|
Data Dictionary | Data dictionaries can be various formats including Excel, PDF, and Word. We recommend all documents provided to be 508 compliant for users. We strongly recommend that data for columns that contain discrete, non-inferable or computed values have clear and detailed documentation. | | | Data Model | Data Models provide the keys for each table. This provides users a mechanism for understanding how to join data across tables and identifying unique records (if applicable to your source). | Surrogate Keys Foreign Keys Primary Keys
| |
Example Approved Data DictionaryThe example below meets all criteria including: ![](/download/attachments/134987761/image2021-2-3_9-57-25.png?version=1&modificationDate=1612904109425&api=v2)
Table and column comments can be included in DDL to assist users with understanding data. The DAMOD team encourages all contributors to either create tables with comments or plan to alter tables with comments. These comments should match data dictionaries when possible.
In this example, table and column comments are being inserted into a new table. Code Block |
---|
language | sql |
---|
title | CREATE Table with Comments |
---|
| --Create Table:
--DROP TABLE IF EXISTS <Table_name>;
CREATE EXTERNAL TABLE <Table_name>
(
Column_1 datatype Comment 'Unique identifier for provider service location',
Column_2 datatype Comment 'Providers Enrollment Identifier',
column_3 datatype Comment 'State in which the Medicare Administrative Contractor Operates',
column_n datatype Comment 'comment n'
)
COMMENT 'Table Description'
PARTITIONED BY (partition_column data_type, month int, day int)
STORED AS PARQUET
LOCATION 's3a://myBucket/myParquet/'; |
In this example, table and column comments are being inserted into an existing table via an ALTER TABLE statement. Code Block |
---|
language | sql |
---|
title | ALTER Table to Insert Comments if the Table Already Exists |
---|
| --ALTER Column Comments:
ALTER TABLE <Table_name> CHANGE Column_1 Column_1 datatype Comment 'Column Description';
--ALTER Table Comments:
ALTER TABLE <Table_name> SET TBLPROPERTIES ('comment' = "Table Description-new"); |
Expectations - Data Contributors provide updated documentation when source data changes
- Data Contributors provide consistent versioning on documents
- Data Contributors provide accurate documentation for users
- Data Contributors support questions about the data and documentation
|
|
Horizontal Navigation Bar Page |
---|
title | Data Issue Reporting |
---|
|
Panel |
---|
| As a data contributor, you are responsible to inform the Data & Analytics team when there is an issue with your data. Please submit this form below within one business day when an issue is identified with your data. Please note that the following information will be needed in order to complete this form: - Impact Data
- Issue Description
- Current Actions Taken to Resolve
- Planned Resolution Date
- Data Owner Technical Point of Contact
Submit this form to the Data & Analytics team by emailing it to us at servicecentersos@cms.hhs.gov (routing to ADO-QualityNet Analytic-Support). Please note that you are also responsible to send follow-ups and updates via email after this form is submitted. ![](/download/attachments/134987761/image2021-2-9_16-15-22.png?version=1&modificationDate=1612905323621&api=v2)
|
|
Horizontal Navigation Bar Page |
---|
title | Contributor BYOD Guide |
---|
|
Tabs Page |
---|
id | How to Become a CDR Data Contributor |
---|
title | Data Contributor |
---|
| How to Become a CDR Data ContributorThis page provides an overview of the Contributor Bring Your Own Data (BYOD) process for contributors that plan to make data available in the CDR for all organizations with an approved DUA. These sources and their documentation will be posted to the CDR Data Catalog. Step-by-step guideBackgroundThe Centralized Data Repository (CDR) provides access to CCSQ data including claims, provider, beneficiary, and other data within a secure HCQIS cloud environment. The CDR increases the accessibility, security, quality, and timeliness of data. The goal of the CDR is to make data available from source systems with less transformations and better quality data. The result is a reduction in data duplication and data conflict since all CCSQ/HCQIS users use the same data from the same source. The goal of Contributor Bring Your own Data (BYOD) is to allow CDR data contributors to make data available directly to CDR users with less copying of data. Data Onboarding OverviewThis section describes how to onboard new datasets into the CDR and share with other organizations. ![](/download/attachments/134987761/image2021-4-7_15-5-27.png?version=1&modificationDate=1622033984933&api=v2)
Footnote |
---|
Note: The Contributor Bring your own Data (BYOD) diagram shows an overview of the final state architecture after an integration is completed. |
Tabs Container |
---|
|
Tabs Page |
---|
| How to Initiate a Request to Become a ContributorData sources that wish to onboard datasets into the CDR and become a contributor should submit a request via the CCSQ Data and Analytics Request Form. The full list of current CDR data contributors and available datasets can be found on the CDR Data Catalog. Once a request is approved, contributors can expect a kick-off meeting with an overview of the integration process. A collaboration Confluence page will be setup which outlines the onboarding steps, responsible parties, datasets, and timeline for the integration. After this phase, contributors and the CCSQ Data & Analytics Modernization team will collaborate on making the data available based on approved timelines. Once an integration is scheduled and a contributor is able to commit the required resources to the effort, the setup will typically take less than one programming iteration or two weeks.
Contributor Responsibilities include: - Contributors must providing supporting documentation for their data sources. The minimum requested documentation for contributors can be found on our CDR Contributor Resources page. It is recommended for documentation to be provided at least 2 weeks prior to the data access being granted so users have advanced notice of the data that will be available. CMS may delay granting access to data if minimum required documentation is not available and/or a plan is not established to provide it in the immediate future.
- Contributors have the responsibility to answer questions on their data if the DAMOD team does not have the information to answer a data question.
- Contributors have the responsibility to keep the DAMOD team informed of data issues so proper communication can occur to the user community.
- Contributors cannot make changes to data without following the communication process outlined in the document below.
Info |
---|
| CDR contributors are responsible for completing the steps outlined in the process document as part of making data available. The timeline for integration can vary depending on CMS prioritization and available resources. |
|
Tabs Page |
---|
title | Source Parquet Files |
---|
| How to Create Source Parquet Files in S3Data may be stored in many different database formats (Postgres, Redshift, Aurora, etc.) depending on the partnering organization. In order to make this data available in the CDR (Hive), the data must be in an approved format in order to be read by Hive. The preferred data format for partnering systems is parquet due to superior performance and lower storage cost however .CSV may also be used. This data should be stored in the partnering ADO's S3 bucket. It is also recommended to create directory versioning so that if parquet files need to be re-created, they can be created in a different version directory and not impact the current parquet files. The below example provides code of how data may be extracted from a Postgres database and formatted into parquet files located in S3. We recommend following Apache best practices with parquet row groups between 512MB-1GB in size.
Code Block |
---|
language | py |
---|
firstline | 1 |
---|
title | Write Data to Parquet Files |
---|
| # Specifying dataframe column data types on read
jdbcDF3 = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.option("customSchema", "id DECIMAL(38, 0), name STRING") \
.load()
# Set DataFrame output path to targeted data bucket and saved as Spark Table
jdbcDF3.write.option("path","/data/home/schem_name/table_name/").saveAsTable("table_name")
# export data to the targeted data bucket as parquet
jdbcDF3.write.format("parquet").save("jdbcDF3.parquet") |
Info |
---|
| This code is intended to be for example/demonstration purposes only. There may be native functionality available with your data source (e.g. Redshift Unload) to allows contributors to create Parquet files. There are multiple ways to create parquet files. |
|
Tabs Page |
---|
title | Configure Bucket Policy |
---|
| Since the partnering ADO source data (parquet files) are stored in their own S3 bucket, cross-account access must be established to allow CDR necessary access to the S3 bucket. CDR adapted the resource-based policies and AWS identity and access management (IAM) policies method for accessing cross-account S3 bucket documented on AWS Support. In this case, partnering ADO is "Account A", and CDR is "Account B". Data encryption is highly recommended either with the standard AWS server-side encryption or custom key stores. Below is an example of a resource-based policy with custom key stores configuration from partnering ADO.
Code Block |
---|
language | xml |
---|
title | Partnering ADO Resource-based Policy |
---|
| {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Sid": "GrantS3BucketAccessToCDR",
"Action": [
"s3:GetObject*",
"s3:List*",
"s3:SelectObjectContent",
"kms:Encrypt",
"kms:Decrypt",
"kms:ReEncrypt*",
"kms:GenerateDataKey*",
"kms:DescribeKey"],
"Resource": [
"arn:aws:s3:::AccountABucketName/*",
"arn:aws:s3:::AccountABucketName/",
"arn:aws:s3:::AccoutA:key/AccountAKMSKey"]
}
]
} |
Info |
---|
| Resource-based policy must be granted before external tables can be created in the CDR Hive metastore. |
|
Tabs Page |
---|
title | Hive Service Account & Databases |
---|
| How to Request Creation of Hive Service Account and Databases in the CDRIn order for the partnering ADO to be able to register tables in the CDR, a service account and source hive database must be created. There is a standard format for all database names that is based off the CMS Data Taxonomy. The hive database name consists of (Taxonomy)_(Line of Business)_(Dataset Name). We rely on data contributors for input on the database names however all data sources should follow this same standardized format. The Hive service account will contain read/write permissions to the databases specified in the request. As an initial part of the integration, a request should be submitted for the creation of the service account and necessary hive databases. - Ensure that any developer(s) needing access to Hive have requested and been approved for the Quality Analytics Role. This is recommended so contributors can validate that their data is made available correctly. The Service account is for automated, system to system connections only.
- Submit a ServiceNow Request to ADO-CDR-Support for the creation of a Hive Service Account.
Info |
---|
| Once data is made available in Production and access is granted to users, data contributors can still make changes to data definitions. Any changes such as altering tables, new tables, or dropping tables will impact users. |
|
Tabs Page |
---|
title | Register Tables in CDR |
---|
| How to Register Tables in the CDR (Hive)Once a service account and necessary hive databases are created, data contributors will be able to register tables in the CDR. Partnering organizations can make a JDBC connection to the CDR (Hive) and execute queries from their service account. Certain commands should not be run in production during working hours due to the possibility for user impact. Please see below for authorized times for command execution. Comments are recommended when creating tables/columns however are not required. Depending on the individual AWS account's location, VPC Peering may be necessary. For certain accounts, a transit gateway is already created.
Example Commands | Authorized Time to Run Commands (EST) |
Code Block |
---|
language | sql |
---|
firstline | 1 |
---|
title | Create Table |
---|
| Create Table:
--DROP TABLE IF EXITS <Table_name>;
CREATE EXTERNAL TABLE <Table_name>
(
Column_1 datatype Comment 'name string',
Column_2 datatype Comment 'age int',
column_3 datatype Comment '',
column_n datatype Comment ''
)
COMMENT 'Table Description'
PARTITIONED BY (partition_column data_type, month int, day int)
STORED AS PARQUET
LOCATION 's3a://myBucket/myParquet/'; |
Reference URL: https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#specifying-storage-format-for-hive-tables | Anytime |
Code Block |
---|
language | py |
---|
firstline | 1 |
---|
title | Alter Table |
---|
| Alter Table:
1. ALTER TABLE table_name RENAME TO new_table_name;
2. ALTER TABLE table_name
[PARTITION partition_spec]
3. ALTER TABLE table_name SET TBLPROPERTIES table_properties;
table_properties:
: (property_name = property_value, property_name = property_value, ... ); |
Reference URL: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL | 0000 - 0600 |
Code Block |
---|
language | py |
---|
firstline | 1 |
---|
title | MSCK Repair |
---|
| MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS]; |
| 0000 - 0600 |
Info |
---|
title | Partitioning Guidance |
---|
| We recommend all CDR contributors to evaluate partitioning attribute(s) and assess performance before submitting a request to grant data access to users. CDR contributors should investigate use cases for their data and determine what partitioning scheme is best, based upon filters commonly used and expected query patterns. Data that has too many partitions can result in parquet file sizes that are too small and will greatly impact query performance even on small tables. Generally, it is also recommended that chosen attributes for partitioning have low cardinality. For example, data may be partitioned by a date (e.g. MM/YYYY) attribute if users commonly query for data within a certain date attribute range. |
Info |
---|
| The LOCATION statement must contain "s3a" in order to specify the proper file system protocol. To read more about "s3a", please find details on the Apache Hadoop support. |
|
Tabs Page |
---|
title | Grant Access to Data |
---|
| How to Grant Access to Data for CDR UsersOnce tables are registered in production, the partnering ADO should validate the data is displaying as expected. Once the partnering ADO is prepared to make the data available to CDR users, a notification should be submitted to grant access to users. Access to the data is controlled based on DUA. If there is a specific list of organizations that should have access granted, the partnering ADO can specify this list in the request however organizations must have a valid DUA. Organizations are expected to provide a list of valid EPPE entries for their dataset(s) that authorize organizations to receive access. - Submit a ServiceNow Request to ADO-CDR-Support to grant access to users for specified hive database(s)
- Submit all required CDR Contributor Source Documentation in the request so it can be uploaded to the CDR Data Catalog
|
Tabs Page |
---|
title | Changes to Data Definition Language |
---|
| How to Notify DAMOD Team of Changes to Data Definition LanguageOnce users are accessing your data in the CDR, any changes to data definitions will cause an impact to their analysis. It’s important for data contributors to ensure that DDL is not updated without proper communication to the DAMOD team. Since data contributors have full read/write access via their hive service account, they have the ability to make updates at any time however we request contributors follow documented processes. For any changes that impact existing user code/processes, we request 10 business days notification. For any other changes such as new columns or new tables, we request 2 business days notification. The DAMOD team recommends the following guidance for communications:
Info |
---|
| Please include updates to applicable source documentation (e.g. data dictionary) when changes occur. The updated documentation will be posted to the CDR Data Catalog. |
|
Tabs Page |
---|
| How to Notify DAMOD Team of Data RefreshesAll data refresh dates are tracked on the CDR Data Catalog. It is essential for data contributors to establish a process to communicate when data is refreshed. Once a notification occurs, the data modernization team updates the CDR Data Catalog to reflect the availability of the refreshed data. - Email notification to CDR data-loads <da-data-aaaafckf6equhoywenqh3uvaiu@hcqis.slack.com> when data is refreshed and when the next data refresh is scheduled to occur (if not on a recurrent schedule)
- AWS Simple Notification Service (SNS) can also be setup to automate the notification into DAMOD Slack or email.
|
|
|
|
|