excerpt-

Include Page

	Data

Resources

Nav_DR
	Data

Resourcesnopaneltrue

Panel

borderWidth	0

Excerpt IncludeRequesting & Getting AccessRequesting & Getting Accessnopaneltrue

Panel

borderWidth	0

Livesearch

spaceKey	DATA
type	page

Panel

borderWidth	0

CDR Contributor Resources

Nav_DR

CDR Data Contribution Standard Operating Procedure (SOP)

This SOP provides an overview of

This page provides resources and documentation to

the Centralized Data Repository (CDR) Contribution process for contributors

. Please explore the tabs below for information regarding Data Issue Reporting as well as CDR Data Catalog Documentation Requirements for data contributors.

CDR contributors to evaluate partitioning attribute(s) and assess performance before submitting a request to grant data access to users. CDR contributors should investigate use cases for their data and determine what partitioning scheme is best, based upon filters commonly used and expected query patterns. Data that has too many partitions can result in parquet file sizes that are too small and will greatly impact query performance even on small tables. Generally, it is also recommended that chosen attributes for partitioning have low cardinality. For example, data may be partitioned by a date (e.g., MM/YYYY) attribute if users commonly query for data within a certain date attribute range.

Horizontal Navigation Bar

id	CDR Contributor Resources

Horizontal Navigation Bar Page

title	Documentation Requirements

Panel

borderWidth	1 px

This section provides information regarding CDR documentation requirements for CDR Contributors. Please explore this tab for an overview of the CDR Data Catalog, minimum requested documentation from Data Contributors, an example of an approved data dictionary documentation, as well as expectations for data contributors.

Overview of CDR Data Catalog

The CDR Data Catalog provides a location for data contributors to make documentation available to CDR users. The DAMOD team provides a general layout document of each schema that includes table names, column names, and datatypes for every source in Hive. This document (located under the CDR Table Layouts column) is produced when the data is made available in the CDR or if data definitions are updated. Data contributors to the CDR are responsible for providing supplemental documentation necessary to support end-users of their data. Such examples of documentation may include:

Data Dictionaries
Data Models
User Guides
Training Documents

Image Removed

Minimum Requested Documentation from Data Contributors

Available documentation on the CDR Data Catalog will vary from source to source. Some mature data sources will have excellent existing documentation/artifacts while newer sources may provide less. At a minimum, the DAMOD team recommends data contributors to provide the following documents:

Document Name

Description

Necessary Information

Other Information

Data Dictionary

Data dictionaries can be various formats including Excel, PDF, and Word. We recommend all documents provided to be 508 compliant for users. We strongly recommend that data for columns that contain discrete, non-inferable or computed values have clear and detailed documentation.

Table Name
Column Name
Column Description
Where Relevant - Domains of values, either listed in their entirety, or a link to a reference document

Data Types
Examples
Other relevant information
Null Option

Data Model

Data Models provide the keys for each table. This provides users a mechanism for understanding how to join data across tables and identifying unique records (if applicable to your source).

Surrogate Keys
Foreign Keys
Primary Keys

Indexes

Example Approved Data Dictionary

The example below meets all criteria including:

Table Name
Column Name
Column Description (Comment)
Code values for non-inferable data
Null Option
Data Type

Image Removed

Expectations

Data Contributors provide updated documentation when source data changes
Data Contributors provide consistent versioning on documents
Data Contributors provide accurate documentation for users
Data Contributors support questions about the data and documentation

Horizontal Navigation Bar Page

title	Data Issue Reporting

Panel

borderWidth	1 px

As a data contributor, you are responsible to inform the Data & Analytics team when there is an issue with your data. Please submit this form below within one business day when an issue is identified with your data. Please note that the following information will be needed in order to complete this form:

Impact Data
Issue Description
Current Actions Taken to Resolve
Planned Resolution Date
Data Owner Technical Point of Contact

Submit this form to the Data & Analytics team by emailing it to us at HCQIS_Data@hcqis.org. Please note that you are also responsible to send follow-ups and updates via email after this form is submitted.

Image Removed

Horizontal Navigation Bar Page

title	Contributor BYOD Guide

Background

The Centralized Data Repository (CDR) provides access to CCSQ data including claims, provider, beneficiary, and other data within a secure HCQIS cloud environment. The CDR increases the accessibility, security, quality, and timeliness of data. The goal of the CDR is to make data available from source systems with less transformations and better quality data. The result is a reduction in data duplication and data conflict since all CCSQ/HCQIS users use the same data from the same source. The goal of Bring Your own Data (BYOD) is to allow CDR data contributors to make data available directly to users with less copying of data.

Data Onboarding Overview

This section describes how to onboard new datasets into the CDR and share with other organizations. To initiate a request, please submit a CCSQ Data and Analytics Request Form.

Image Removed

Tabs Container

direction	horizontal

Tabs Page

title	Source Parquet Files

How to Create Source Parquet Files in S3

that plan to make data available in the CDR for all organizations with an approved data usage agreement (DUA). These sources and their approved documentation will be posted to the CDR Data Catalog.

Background

The CDR provides access to CCSQ data including claims, provider, beneficiary, and other data within a secure cloud environment. The goal of the CDR is to make data available from source systems with greater accessibility, security, quality, and timeliness of data. The goal of Contribution is to allow CDR Data Contributors to make data available directly to CDR users with less data duplication.

Data may be stored in many different database formats (Postgres, Redshift, Aurora, etc.) depending on the partnering organization. In order to make this data available in the CDR (Hive), the data must be in an approved format in order to be read by Hive.

Horizontal Navigation Bar

id	LakeHouse Data Migration
title	LakeHouse Data Migration

Horizontal Navigation Bar Page

title	Getting Started

Tabs Container

direction	vertical

Tabs Page

title	Requirements

Request Form

Submit a request through the CCSQ Data & Analytics Request Form to start the contribution process. The request will be reviewed and prioritized by the CMS Business Owners of the CDR. Use the following as a guide for what information to include in the request.

Required Information

The following information is required to be included in the request to become a contributor:

A short description of the data with the minimum required information.
The data dictionary with the minimum required information.
The point of contact for the contribution effort.
The anticipated timeline to load, validate, and release data to end users.
The list of any validation groups that will require access to the pre-prod data for testing.
The Enterprise Privacy Policy Engine (EPPE) DUA Entry associated with this data.
The Access Control Model

The CDR will need one of the following for support purposes to forward user inquiries specific to the data:

A CCSQ ServiceNow Support Group name
A support email address

The team will also need to know:

The access control model for this data (see below).
Whether a data catalog entry can be made for this data on the public data catalog or if access to the data documentation should be restricted.

Required Data Documentation

All documentation submitted for contributed data must meet the minimum required standards as outlined below.

Short Description

Short description of data must include the following:

Title - The name of the dataset or resource.
Domain - The category of the resource such as (eg., Clinical, Provider, Claims, Beneficiary)
Source - The system or activity that generates or provides the dataset.
Granularity - The level of data detail that the resource covers, such as Atomic, Aggregated, Hybrid
Load/Refresh Frequency - How often this data source will be refreshed/updated.
Geographic Scope - What level of geographic scope this data entails (e.g., national, regional, state, county, zip)
Timespan Covered - What timespan this data covers (e.g., 2019-2022)
Personally Identifiable Information (PII)/Protected Health Information (PHI)/Sensitive Information - Whether this data contains PII/PHI/Other Sensitive information or not.

Data Dictionary

At minimum, documentation for the data contributed to the CDR should be in the form of a data dictionary that includes:

Table name
Table description
Field Name (Short Name)
Field Name (Long Name)
Data Type/Length
Field Description (provide details)
Possible Values/Range
Partition Key
Comments

Data Documentation Format

Submitted data documentation should be in a comma-separated values (CSV) format that meets the required data dictionary template.

Required Data Format

The preferred data format for partnering systems is parquet due to superior performance and lower storage cost

however

. However, CSV or JavaScript Object Notation (JSON) may also be used.

This data should be stored in the partnering Application Development Organization’s (ADO's) Simple Storage Service (S3) bucket.

It is also recommended to create directory versioning so that if parquet files need to be re-created, they can be created in a different version directory and not impact the current parquet files.

The below example provides code of how data may be extracted from a Postgres database and formatted into parquet files located in S3.

We recommend following Apache best practices with parquet row groups between 512MB-1GB in size.

Code Block

language	py
firstline	1
title	Write Data to Parquet Files
collapse	true

# Specifying dataframe column data types on read
jdbcDF3 = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql:dbserver") \
    .option("dbtable", "schema.tablename") \
    .option("user", "username") \
    .option("password", "password") \
    .option("customSchema", "id DECIMAL(38, 0), name STRING") \
    .load()

# Set DataFrame output path to targeted data bucket and saved as Spark Table
jdbcDF3.write.option("path","/data/home/schem_name/table_name/").saveAsTable("table_name")

# export data to the targeted data bucket as parquet
jdbcDF3.write.format("parquet").save("jdbcDF3.parquet")

Info

title	Caution

This code is intended to be for example/demonstration purposes only. There may be native functionality available with your data source (e.g. Redshift Unload) to allows contributors to create Parquet files. There are multiple ways to create parquet files.

Tabs Page

title	Configure Bucket Policy

How to Configure Bucket Policy to allow for Reading of Data

Since the partnering ADO source

Preferred Data Format: Parquet
Accepted Data Formats: CSV, JSON

Access Control Model

The access control model refers to how contributors would like to restrict or grant access to the data that they are contributing. The CDR supports two access control models:

Automated (DUA Based)
- Access to this data will automatically be granted to any end user organization that has the appropriate DUA entry on their DUA. Contributors will provide the DUA entry that corresponds to the contributed data.
Manual (Notification Based)
- In addition to the DUA entry, access to this data is manually approved by the Business Owners of the data. End user organizations request access to the data, and we notify the Business Owners for either approval or rejection.

Contributors must indicate which access model they would like to utilize for your data.

Tabs Page

title	CDR Onboarding

CDR Onboarding Steps

Onboarding into the CDR is central to becoming a Data Contributor. If the prospective contributor is not already onboarded, see the training and onboarding page for more information.

Being onboarded into the CDR comes with the following automatically provisioned resources for the contributor's organization/group:

A production database/schema
A pre-production (test) database/schema
A Service Principal/Service Account

Once onboarded, the team will generate a schema specific to the contributor's BYOD data in accordance with CMS Data Taxonomy.

Tabs Page

title	Workflow

Contribution Workflow

The following is a generalized workflow that outlines the steps involved in contributing to the CDR.

Phase	Step	POC
Phase 1	Request Submitted with Required Information	Contributor
Phase 1	Onboarding Initiated (if not already onboarded)	Contributor
Phase 1	Service Principal Automatically Generated	Automated upon onboarding
Phase 2	Bring Your Own Data (BYOD) Prod and pre-PROD Schemas created	CDR
Phase 2	Elevated Service Principal granted r/w/x access	CDR
Phase 2	Data Documentation Reviewed	CDR
Phase 2	Data Catalog Entry Created (if applicable)	CDR
Phase 3 (optional)	Data published to pre-PROD schema	Contributor
Phase 3 (optional)	Green light for Validation Groups	Contributor
Phase 3 (optional)	Validation groups granted r/x access	CDR
Phase 3 (optional)	Validation period conducted	Contributor
Phase 4	Data published to PROD BYOD Schema	Contributor
Phase 4	Green light for end user access	Contributor
Phase 4	End users granted r/x access per access model	CDR
Phase 4	End user communication released	CDR

Horizontal Navigation Bar Page

title	Contributing Data

Tabs Container

direction	vertical

Tabs Page

title	Data Access

Cross Account Access

A contributor's data (parquet files) are stored in their own Simple Storage Service (S3) bucket

,

so cross-account access must be established to allow

CDR necessary

CDR necessary access to the S3 bucket. CDR has adapted the resource-based policies and Amazon Web Services (AWS) identity and access management (IAM) policies method for accessing cross-account S3 bucket documented on AWS Support. In this case, the partnering

ADO

contributor is "Account A", and CDR is "Account B". Data encryption is highly recommended either with the standard AWS server-side encryption or custom key stores.

Below is an example of a resource-based policy with custom key stores configuration from a partnering

ADO

contributor.

Code Block

language	xml
title	Partnering

ADO

Contributor Resource-based Policy

collapsetrue

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
	        "Sid": "GrantS3BucketAccessToCDR", 
            "Action": [
                "s3:GetObject*",
				"s3:List*",
  				"s3:SelectObjectContent",
    			"kms:Encrypt",
    			"kms:Decrypt",
    			"kms:ReEncrypt*",
    			"kms:GenerateDataKey*",
    			"kms:DescribeKey"],
            "Resource": [
				"arn:aws:s3:::AccountABucketName/*",
				"arn:aws:s3:::AccountABucketName/",
				"arn:aws:s3:::AccoutA:key/AccountAKMSKey"]
        }
    ]
}

info

Note

title	Caution

Resource

A resource-based policy must be granted before external tables can be created in the CDR Hive metastore.

Tabs Page

title

Hive Service Account & Databases

How to Request Creation of Hive Service Account and Databases in the CDR

In order for the partnering ADO to be able to register tables in the CDR, a service account and source hive database must be created. There is a standard format for all database names that is based off CMS Data Reference Model. The hive database name consists of (Taxonomy)_(Line of Business)_(Dataset Name). We rely on data contributors for input on the database names however all data sources should follow this same standardized format. The Hive service account will contain read/write permissions to the databases specified in the request. As an initial part of the integration, a request should be submitted for the creation of the service account and necessary hive databases.

CDR Connections

Connecting to the CDR

Data Contributors may desire to contribute data to the CDR using either our suite of tools that they gain access to through onboarding or through their own external.

See the following articles for details on making inbound and outbound connections to or from the CDR.

For accessing an external database through the CDR, see this article on making Outbound Database Connections through the CDR.
For accessing an external datastore through the CDR, see this article on making Outbound Datastore Connections through the CDR.
For using an external tool to connect to the CDR Datastore to write data to the CDR, see this article on making Inbound Connections to the CDR Datastore.

Tabs Page

title	Validation

Validating Contributed Data

A contributor's service principle is for

Ensure that any developer(s) needing access to Hive have requested and been approved for the Quality Analytics Role. This is recommended so contributors can validate that their data is made available correctly. The Service account is for

automated, system to system connections only. Validation should be done through user accounts associated with the designated validation group - a contributor's group will be automatically provisioned validation access.

Ensure that any users performing validation request the HCQIS Access Roles and Profile (HARP) role for the appropriate group so that they can access the data in the CDR.

Horizontal Navigation Bar Page

title	Post Contribution

Tabs Container

direction	vertical

Tabs Page

title	Data Changes

Changes to Data or Data Documentation

Data Contributors are expected to notify the CDR team of any changes to their data or the data documentation. Submit a request through the CCSQ Data & Analytics Request Form in advance of any changes, outlining the expected changes and with any new documentation attached.

Tabs Page

title	Data Issues

Reporting Issues

Data Contributors are responsible for informing the Data & Analytics team when there is an issue with their data. Notify the CDR team within one business day when an issue is identified with their data. The following information will be needed:

Impacted Data
Issue Description
Current Actions Taken to Resolve
Planned Resolution Date
Data Owner Technical Point of Contact

Note that contributors are also responsible for sending follow-ups and updates via email after the issue is identified.

Tabs Page

title	User Support

User Support

End users of the CDR may have questions specific to the data that Data Contributors have added that require support from a data subject matter expert (SME). Data Contributors and Data Source Owners are expected to support the end users of their data with any questions or concerns that they may have that cannot be answered through the provided documentation or the CDR support team. In such a case, end user inquiries would be routed to one of the following, to be addressed by the data source SME:

A CCSQ ServiceNow Support Group name
A support email address

HTML
<style>#title-text { display:none !important; } .aui-navgroup-primary { font-size: 1.25em; font-weight: bold; } .page-metadata-modification-info { display:none !important; } </style>

Submit a ServiceNow Request to ADO-CDR-Support for the creation of a Hive Service Account.

Info

title	Caution

Once data is made available in Production and access is granted to users, data contributors can still make changes to data definitions. Any changes such as altering tables, new tables, or dropping tables will impact users.

Tabs Page

title	Register Tables in CDR

How to Register Tables in the CDR (Hive)

Once a service account and necessary hive databases are created, data contributors will be able to register tables in the CDR. Partnering organizations can make a JDBC connection to the CDR (Hive) and execute queries from their service account. Certain commands should not be run in production during working hours due to the possibility for user impact. Please see below for authorized times for command execution. Comments are recommended when creating tables/columns however are not required. Depending on the individual AWS account's location, VPC Peering may be necessary. For certain accounts, a transit gateway is already created.

Example Commands

Authorized Time to Run Commands (EST)

Code Block

language	sql
firstline	1
title	Create Table
collapse	true

Create Table:

--DROP TABLE IF EXITS <Table_name>;
CREATE EXTERNAL TABLE <Table_name> 
( 
Column_1 datatype Comment 'name string',
Column_2 datatype Comment 'age int',
column_3 datatype Comment '',
column_n datatype Comment ''
)
COMMENT 'Table Description'
PARTITIONED BY (partition_column data_type, month int, day int)
STORED AS PARQUET 
LOCATION 's3a://myBucket/myParquet/';

Reference URL: https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#specifying-storage-format-for-hive-tables

Anytime

Code Block

language	py
firstline	1
title	Alter Table
collapse	true

Alter Table:

1. ALTER TABLE table_name RENAME TO new_table_name;

2. ALTER TABLE table_name
[PARTITION partition_spec]

3. ALTER TABLE table_name SET TBLPROPERTIES table_properties;
table_properties:
: (property_name = property_value, property_name = property_value, ... );

Reference URL: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

0000 - 0600

Code Block

language	py
firstline	1
title	MSCK Repair
collapse	true

MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];

0000 - 0600

Info

title	Partitioning Guidance

We recommend all CDR contributors to evaluate partitioning attribute(s) and assess performance before submitting a request to grant data access to users. CDR contributors should investigate use cases for their data and determine what partitioning scheme is best, based upon filters commonly used and expected query patterns. Data that has too many partitions can result in parquet file sizes that are too small and will greatly impact query performance even on small tables. Generally, it is also recommended that chosen attributes for partitioning have low cardinality. For example, data may be partitioned by a date (e.g. MM/YYYY) attribute if users commonly query for data within a certain date attribute range.

Info

title	Note

The LOCATION statement must contain "s3a" in order to specify the proper file system protocol. To read more about "s3a", please find details on the Apache Hadoop support.

Tabs Page

title	Grant Access to Data

How to Grant Access to Data for CDR Users

Once tables are registered in production, the partnering ADO should validate the data is displaying as expected. Once the partnering ADO is prepared to make the data available to CDR users, a ServiceNow request should be submitted to grant access to users. Access to the data is controlled based on DUA. If there is a specific list of organizations that should have access granted, the partnering ADO can specify this list in the request however organizations must have a valid DUA.

Submit a ServiceNow Request to ADO-CDR-Support to grant access to users for specified hive database(s)
Submit all required CDR Contributor Source Documentation in the request so it can be uploaded to the CDR Data Catalog

Tabs Page

title	Changes to Data Definition Language

How to Notify DAMOD Team of Changes to Data Definition Language

Once users are accessing your data in the CDR, any changes to data definitions will cause an impact to their analysis. It’s important for data contributors to ensure that DDL is not updated without proper communication to the DAMOD team. Since data contributors have full read/write access via their hive service account, they have the ability to make updates at any time however we request contributors follow documented processes. For any changes that impact existing user code/processes, we request 10 business days notification. For any other changes such as new columns or new tables, we request 2 business days notification. The DAMOD team recommends the following guidance for communications:

Data Definition ChangeNotification ProcessNotification Timeframe

Drop Tables
Drop Columns
Partition Key Changes
Alter Column Names
Alter Data type

Email notification to CDR data-loads <data-loads@cvpcorp.com>

At least 10 Business Days

New Columns
New Tables

Email notification to CDR data-loads <data-loads@cvpcorp.com>

At least 2 Business DaysAll Data Definition Updates

Email notification to CDR data-loads <data-loads@cvpcorp.com>

Upon Code Execution in PROD

Info

title	Note

Please include updates to applicable source documentation (e.g. data dictionary) when changes occur. The updated documentation will be posted to the CDR Data Catalog.

Tabs Page

title	Data Refreshes

How to Notify DAMOD Team of Data Refreshes

All data refresh dates are tracked on the CDR Data Catalog. It is essential for data contributors to establish a process to communicate when data is refreshed. Once a notification occurs, the data modernization team updates the CDR Data Catalog to reflect the availability of the refreshed data.

Email notification to CDR data-loads <data-loads@cvpcorp.com> when data is refreshed and when the next data refresh is scheduled to occur (if not on a recurrent schedule)
AWS Simple Notification Service (SNS) can also be setup to automate the notification into DAMOD Slack or email.

Excerpt IncludeCCSQ Data & AnalyticsCCSQ Data & Analyticsnopaneltrue

Versions Compared

Old Version 22

New Version Current

Key

CDR Contributor Resources

CDR Data Contribution Standard Operating Procedure (SOP)

Overview of CDR Data Catalog

Image Removed

Minimum Requested Documentation from Data Contributors

Example Approved Data Dictionary

Image Removed

Expectations

Background

Data Onboarding Overview

How to Create Source Parquet Files in S3

Request Form

Required Information

Required Data Documentation

Short Description

Data Dictionary

Data Documentation Format

Required Data Format

How to Configure Bucket Policy to allow for Reading of Data

Access Control Model

CDR Onboarding Steps

Contribution Workflow

Cross Account Access

How to Request Creation of Hive Service Account and Databases in the CDR

Connecting to the CDR

Validating Contributed Data

Changes to Data or Data Documentation

Reporting Issues

User Support

How to Register Tables in the CDR (Hive)

How to Grant Access to Data for CDR Users

How to Notify DAMOD Team of Changes to Data Definition Language

How to Notify DAMOD Team of Data Refreshes

Page History

Versions Compared

Old Version 22

New Version Current

Key

CDR Contributor Resources

CDR Data Contribution Standard Operating Procedure (SOP)

Overview of CDR Data Catalog

Image Removed

Minimum Requested Documentation from Data Contributors

Example Approved Data Dictionary

Image Removed

Expectations

Background

Data Onboarding Overview

How to Create Source Parquet Files in S3

Request Form

Required Information

Required Data Documentation

Short Description

Data Dictionary

Data Documentation Format

Required Data Format

How to Configure Bucket Policy to allow for Reading of Data

Access Control Model

CDR Onboarding Steps

Contribution Workflow

Cross Account Access

How to Request Creation of Hive Service Account and Databases in the CDR

Connecting to the CDR

Validating Contributed Data

Changes to Data or Data Documentation

Reporting Issues

User Support

How to Register Tables in the CDR (Hive)

How to Grant Access to Data for CDR Users

How to Notify DAMOD Team of Changes to Data Definition Language

How to Notify DAMOD Team of Data Refreshes