excerpt- Resources Resourcesnopanel | true |
---|
Excerpt Include |
---|
Requesting & Getting Access | Requesting & Getting Access | nopanel | true |
---|
Panel |
---|
|
CDR Contributor Resources
This page provides resources and documentation to CDR Data Contribution Standard Operating Procedure (SOP)
This SOP provides an overview of the Centralized Data Repository (CDR)
contributors. Please explore the tabs below for information regarding Data Issue Reporting as well as CDR Data Catalog Documentation Requirements for data contributors. Horizontal Navigation Bar |
---|
id | CDR Contributor Resources |
---|
|
Horizontal Navigation Bar Page |
---|
title | Documentation Requirements |
---|
|
Panel |
---|
|
This section provides information regarding CDR documentation requirements for CDR Contributors. Please explore this tab for an overview of the CDR Data Catalog, minimum requested documentation from Data Contributors, an example of an approved data dictionary documentation, as well as expectations for data contributors.
Overview of CDR Data Catalog
The CDR Data Catalog provides a location for data contributors to make documentation available to CDR users. The DAMOD team provides a general layout document of each schema that includes table names, column names, and datatypes for every source in Hive. This document (located under the CDR Table Layouts column) is produced when the data is made available in the CDR or if data definitions are updated. Data contributors to the CDR are responsible for providing supplemental documentation necessary to support end-users of their data. Such examples of documentation may include:
Data Dictionaries
Data Models
User Guides
Training Documents
Image Removed
Minimum Requested Documentation from Data Contributors
Available documentation on the CDR Data Catalog will vary from source to source. Some mature data sources will have excellent existing documentation/artifacts while newer sources may provide less. At a minimum, the DAMOD team recommends data contributors to provide the following documents:
Document Name | Description | Necessary Information | Other Information |
---|
Data Dictionary | Data dictionaries can be various formats including Excel, PDF, and Word. We recommend all documents provided to be 508 compliant for users. We strongly recommend that data for columns that contain discrete, non-inferable or computed values have clear and detailed documentation. | | |
Data Model | Data Models provide the keys for each table. This provides users a mechanism for understanding how to join data across tables and identifying unique records (if applicable to your source). | Surrogate Keys Foreign Keys Primary Keys
| |
Example Approved Data Dictionary
The example below meets all criteria including:
Image Removed
Expectations
- Data Contributors provide updated documentation when source data changes
- Data Contributors provide consistent versioning on documents
- Data Contributors provide accurate documentation for users
- Data Contributors support questions about the data and documentation
Horizontal Navigation Bar Page |
---|
title | Data Issue Reporting |
---|
|
Panel |
---|
|
As a data contributor, you are responsible to inform the Data & Analytics team when there is an issue with your data. Please submit this form below within one business day when an issue is identified with your data. Please note that the following information will be needed in order to complete this form: - Impact Data
- Issue Description
- Current Actions Taken to Resolve
- Planned Resolution Date
- Data Owner Technical Point of Contact
Submit this form to the Data & Analytics team by emailing it to us at HCQIS_Data@hcqis.org. Please note that you are also responsible to send follow-ups and updates via email after this form is submitted. Image Removed
|
Horizontal Navigation Bar Page |
---|
|
Background
The Centralized Data Repository (CDR) provides access to CCSQ data including claims, provider, beneficiary, and other data within a secure HCQIS cloud environment. The CDR increases the accessibility, security, quality, and timeliness of data. The goal of the CDR is to make data available from source systems with less transformations and better quality data. The result is a reduction in data duplication and data conflict since all CCSQ/HCQIS users use the same data from the same source. The goal of Bring Your own Data (BYOD) is to allow CDR data contributors to make data available directly to users with less copying of data.
Data Onboarding Overview
This section describes how to onboard new datasets into the CDR and share with other organizations. To initiate a request, please submit a CCSQ Data and Analytics Request Form.
Image Removed
How to Create Source Parquet Files in S3
Data may be stored in many different database formats (Postgres, Redshift, Aurora, etc.) depending on the partnering organization. In order to make this data available in the CDR (Hive), the data must be in an approved format in order to be read by Hive. Contribution process for contributors that plan to make data available in the CDR for all organizations with an approved data usage agreement (DUA). These sources and their approved documentation will be posted to the CDR Data Catalog.
Background
The CDR provides access to CCSQ data including claims, provider, beneficiary, and other data within a secure cloud environment. The goal of the CDR is to make data available from source systems with greater accessibility, security, quality, and timeliness of data. The goal of Contribution is to allow CDR Data Contributors to make data available directly to CDR users with less data duplication.
Horizontal Navigation Bar |
---|
id | LakeHouse Data Migration |
---|
title | LakeHouse Data Migration |
---|
|
Horizontal Navigation Bar Page |
---|
|
Tabs Container |
---|
|
Tabs Page |
---|
| Submit a request through the CCSQ Data & Analytics Request Form to start the contribution process. The request will be reviewed and prioritized by the CMS Business Owners of the CDR. Use the following as a guide for what information to include in the request. The following information is required to be included in the request to become a contributor: - A short description of the data with the minimum required information.
- The data dictionary with the minimum required information.
- The point of contact for the contribution effort.
- The anticipated timeline to load, validate, and release data to end users.
- The list of any validation groups that will require access to the pre-prod data for testing.
- The Enterprise Privacy Policy Engine (EPPE) DUA Entry associated with this data.
- The Access Control Model
The CDR will need one of the following for support purposes to forward user inquiries specific to the data: - A CCSQ ServiceNow Support Group name
- A support email address
The team will also need to know: - The access control model for this data (see below).
- Whether a data catalog entry can be made for this data on the public data catalog or if access to the data documentation should be restricted.
Required Data DocumentationAll documentation submitted for contributed data must meet the minimum required standards as outlined below. Short DescriptionShort description of data must include the following: - Title - The name of the dataset or resource.
- Domain - The category of the resource such as (eg., Clinical, Provider, Claims, Beneficiary)
- Source - The system or activity that generates or provides the dataset.
- Granularity - The level of data detail that the resource covers, such as Atomic, Aggregated, Hybrid
- Load/Refresh Frequency - How often this data source will be refreshed/updated.
- Geographic Scope - What level of geographic scope this data entails (e.g., national, regional, state, county, zip)
- Timespan Covered - What timespan this data covers (e.g., 2019-2022)
- Personally Identifiable Information (PII)/Protected Health Information (PHI)/Sensitive Information - Whether this data contains PII/PHI/Other Sensitive information or not.
Data DictionaryAt minimum, documentation for the data contributed to the CDR should be in the form of a data dictionary that includes: - Table name
- Table description
- Field Name (Short Name)
- Field Name (Long Name)
- Data Type/Length
- Field Description (provide details)
- Possible Values/Range
- Partition Key
- Comments
Submitted data documentation should be in a comma-separated values (CSV) format that meets the required data dictionary template. The preferred data format for partnering systems is parquet due to superior performance and lower storage cost |
|
|
|
however . However, CSV or JavaScript Object Notation (JSON) may also be used. |
|
|
|
This data should be stored in the partnering Application Development Organization’s (ADO's) Simple Storage Service (S3) bucket. |
|
|
|
It is also recommended to create directory versioning so that if parquet files need to be re-created, they can be created in a different version directory and not impact the current parquet files. |
|
|
|
The below example provides code of how data may be extracted from a Postgres database and formatted into parquet files located in S3. We recommend following Apache best practices with parquet row groups between 512MB-1GB in size.
Code Block |
---|
language | py |
---|
firstline | 1 |
---|
title | Write Data to Parquet Files |
---|
collapse | true |
---|
|
# Specifying dataframe column data types on read
jdbcDF3 = spark.read \
.format("jdbc") \
.option("url", "jdbc:postgresql:dbserver") \
.option("dbtable", "schema.tablename") \
.option("user", "username") \
.option("password", "password") \
.option("customSchema", "id DECIMAL(38, 0), name STRING") \
.load()
# Set DataFrame output path to targeted data bucket and saved as Spark Table
jdbcDF3.write.option("path","/data/home/schem_name/table_name/").saveAsTable("table_name")
# export data to the targeted data bucket as parquet
jdbcDF3.write.format("parquet").save("jdbcDF3.parquet") |
Info |
---|
|
This code is intended to be for example/demonstration purposes only. There may be native functionality available with your data source (e.g. Redshift Unload) to allows contributors to create Parquet files. There are multiple ways to create parquet files. |
Since the partnering ADO source - Preferred Data Format: Parquet
- Accepted Data Formats: CSV, JSON
Access Control ModelThe access control model refers to how contributors would like to restrict or grant access to the data that they are contributing. The CDR supports two access control models: - Automated (DUA Based)
- Access to this data will automatically be granted to any end user organization that has the appropriate DUA entry on their DUA. Contributors will provide the DUA entry that corresponds to the contributed data.
- Manual (Notification Based)
- In addition to the DUA entry, access to this data is manually approved by the Business Owners of the data. End user organizations request access to the data, and we notify the Business Owners for either approval or rejection.
Contributors must indicate which access model they would like to utilize for your data. |
Tabs Page |
---|
| CDR Onboarding StepsOnboarding into the CDR is central to becoming a Data Contributor. If the prospective contributor is not already onboarded, see the training and onboarding page for more information. Being onboarded into the CDR comes with the following automatically provisioned resources for the contributor's organization/group: - A production database/schema
- A pre-production (test) database/schema
- A Service Principal/Service Account
Once onboarded, the team will generate a schema specific to the contributor's BYOD data in accordance with CMS Data Taxonomy. |
Tabs Page |
---|
| Contribution WorkflowThe following is a generalized workflow that outlines the steps involved in contributing to the CDR.
Phase | Step | POC |
---|
Phase 1 | Request Submitted with Required Information | Contributor | Phase 1 | Onboarding Initiated (if not already onboarded) | Contributor | Phase 1 | Service Principal Automatically Generated | Automated upon onboarding | Phase 2 | Bring Your Own Data (BYOD) Prod and pre-PROD Schemas created | CDR | Phase 2 | Elevated Service Principal granted r/w/x access | CDR | Phase 2 | Data Documentation Reviewed | CDR | Phase 2 | Data Catalog Entry Created (if applicable) | CDR | Phase 3 (optional) | Data published to pre-PROD schema | Contributor | Phase 3 (optional) | Green light for Validation Groups | Contributor | Phase 3 (optional) | Validation groups granted r/x access | CDR | Phase 3 (optional) | Validation period conducted | Contributor | Phase 4 | Data published to PROD BYOD Schema | Contributor | Phase 4 | Green light for end user access | Contributor | Phase 4 | End users granted r/x access per access model | CDR | Phase 4 | End user communication released | CDR |
|
|
|
Horizontal Navigation Bar Page |
---|
|
Tabs Container |
---|
|
Tabs Page |
---|
| Cross Account AccessA contributor's data (parquet files) are stored in their own Simple Storage Service (S3) bucket |
|
|
|
, so cross-account access must be established to allow |
|
|
|
CDR necessary CDR necessary access to the S3 bucket. CDR has adapted the resource-based policies and Amazon Web Services (AWS) identity and access management (IAM) policies method for accessing cross-account S3 bucket documented on AWS Support. In this case, the partnering |
|
|
|
ADO contributor is "Account A", and CDR is "Account B". Data encryption is highly recommended either with the standard AWS server-side encryption or custom key stores. Below is an example of a resource-based policy with custom key stores configuration from a partnering |
|
|
|
ADOcontributor.
Code Block |
---|
language | xml |
---|
title | Partnering |
---|
|
|
|
|
|
ADO Contributor Resource-based Policy |
|
|
|
|
|
collapse | true |
---|
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Sid": "GrantS3BucketAccessToCDR",
"Action": [
"s3:GetObject*",
"s3:List*",
"s3:SelectObjectContent",
"kms:Encrypt",
"kms:Decrypt",
"kms:ReEncrypt*",
"kms:GenerateDataKey*",
"kms:DescribeKey"],
"Resource": [
"arn:aws:s3:::AccountABucketName/*",
"arn:aws:s3:::AccountABucketName/",
"arn:aws:s3:::AccoutA:key/AccountAKMSKey"]
}
]
} |
|
|
|
|
infoResourceA resource-based policy must be granted before external tables can be created in the CDR Hive metastore. |
|
Tabs Page |
---|
| Connecting to the CDRData Contributors may desire to contribute data to the CDR using either our suite of tools that they gain access to through onboarding or through their own external. See the following articles for details on making inbound and outbound connections to or from the CDR. |
Tabs Page |
---|
| Validating Contributed DataA contributor's service principle is for automated, system to system connections only. Validation should be done through user accounts associated with the designated validation group - a contributor's group will be automatically provisioned validation access. Ensure that any users performing validation request the HCQIS Access Roles and Profile (HARP) role for the appropriate group so that they can access the data in the CDR. |
|
|
Horizontal Navigation Bar Page |
---|
|
Tabs Container |
---|
|
Tabs Page |
---|
| Changes to Data or Data DocumentationData Contributors are expected to notify the CDR team of any changes to their data or the data documentation. Submit a request through the CCSQ Data & Analytics Request Form in advance of any changes, outlining the expected changes and with any new documentation attached. |
Tabs Page |
---|
| Reporting IssuesData Contributors are responsible for informing the Data & Analytics team when there is an issue with their data. Notify the CDR team within one business day when an issue is identified with their data. The following information will be needed: - Impacted Data
- Issue Description
- Current Actions Taken to Resolve
- Planned Resolution Date
- Data Owner Technical Point of Contact
Note that contributors are also responsible for sending follow-ups and updates via email after the issue is identified. |
Tabs Page |
---|
| User SupportEnd users of the CDR may have questions specific to the data that Data Contributors have added that require support from a data subject matter expert (SME). Data Contributors and Data Source Owners are expected to support the end users of their data with any questions or concerns that they may have that cannot be answered through the provided documentation or the CDR support team. In such a case, end user inquiries would be routed to one of the following, to be addressed by the data source SME: - A CCSQ ServiceNow Support Group name
- A support email address
|
|
|
|