2022-2-24 CCSQ D&A Townhall

Back to Townhalls Home Page

CCSQ D&A Townhall
Date	February 24, 2022 at 1:00pm ET
Agenda	General Announcements User Collaboration Wiki - Demo CDR HDFS Migration System Updates File Lock Issues Lagging Issues GitHub Issue Maintenance Schedule Data Source and Usage Updates Best Practices Demo Partition Pruning Optimizing Joins Breaking up Large Queries Q&A
General Announcements	User Collaboration Wiki - New! What is the User Collaboration Wiki? A collaboration forum through which you can submit your solutions and best practices for others within the CAP & CDR user community. How can you use the User Collaboration Wiki? Submit best practices or how-to articles Read through other user submissions, providing feedback, comments, or likes Leverage solutions from other users to enhance your analytic work Articles that have been reviewed and approved by D&A SMEs will be posted out on the Knowledge Base Important Notes The forum is currently available to those who have HARP Confluence/Atlassian licenses. Either request a license via HARP, or work with your COR to see if an Atlassian licenses can be granted for your organization. After the demo, we will open the mic for any feedback or suggestions. This is an MVP release, so we are open to your feedback to ensure this space is designed for your benefit. Questions from users should continue to be posted to the #ccsq_data_analytics Slack channel CDR HDFS Migration Data & Analytics Team will start reaching out to CDR users to migrate legacy hive data sets from HDFS to S3 in phases After migration, Data & Analytics Team will provide guidance and documentation on storing future data sets in S3 instead Benefits: Frees up disk space within CDR Cluster S3 storage provides unlimited capacity, better persistence, and lifecycle capabilities Supports future transformations of CDR Architecture to EMR Dependency: Data & Analytics will depend on teams for coordination and validation of migrated data sets
System Updates	SAS Viya Issues File Lock Issues Issue Description: SAS Viya users have reported file lock errors when reading data from their workbenches during SAS job runs What to Expect: The team has found a solution to this, which requires re-mounting storage gateway workbenches. In order to mitigate the impact to production workflows, D&A team is coordinating with each affected organization to re-mount their workbenches at the most convenient time Affected Communities: Select SAS Viya Users with affected workbenches Call to Action: If you are facing this issue, please open a ServiceNow ticket for our team to remount your workbench Lagging Issues SAS Viya users have reported intermittent lagging and freezing while utilizing the application. Due to the issue's intermittent nature, some users have reported that their issue has gone away by the time that a help desk ticket has been created. In order to capture the details for the D&A Team to investigate, please follow these instructions found within the Known Issues Log After following these instructions, please provide the information via a helpdesk ticket Github Issue Issue Description: SAS Viya users have reported issues accessing data within Hive due to issues with two of the environment’s worker nodes. What to Expect: The team continues to work with the SAS Vendor to obtain a patch that will resolve the issue. More updates will be shared once more information about the patch is available. Affected Communities: SAS Viya Users Call to Action: A new workaround has been published on this issue. Please follow these instructions found within the Known Issues Log Maintenance Schedule Updating dates for scheduled CAP & CDR Maintenance events: March 4 April 1 May 6 June 3 All events will begin at 8:00 pm ET and end approximately at 11:00pm ET. A communication will be sent out once maintenance is complete. As a reminder, whenever there is maintenance on the environment, you will need to make sure all of your code and table changes are saved.
Data Source and Usage Updates	Known Issues BIC MBI Sequence - 8/25/2021 Ongoing – TBD QMARS Appeals - 6/18/2021 Ongoing – TBD Consider using healthcare_service_qmars_ng if your DUA supports it QMARS - 1/12/2021 Ongoing - TBD Consider using healthcare_service_qmars_ng if your DUA supports it
Best Practices Demo	Learning Objectives: Partition Pruning Optimizing Joins Breaking Up Large Queries Partition Pruning Partition pruning is the mechanism where a query can skip reading data files that correspond to one or more partitions Critical to reducing amount of data scanned in hive query for optimization Reference the recording for an example Best Practice: Always add lower and upper bounds for partitions in query criteria Hint: You can check whether you’re applying proper partition pruning by viewing execution plan of your hive query with %hive_explain(<sql>) When viewing the execution plan, there are a few thing to watch out for: Check to make sure you are using partition keys in your query Large number of rows Reference the powerpoint for more examples Optimizing Joins Optimizing JOINs in hive queries can significantly reduce cluster resource utilization and improve query performance Strategies include: Always include JOIN conditions Choosing the proper JOIN type INNER JOIN: The INNER JOIN keyword selects all rows from both the tables as long as the condition satisfies. LEFT JOIN: This join returns all the rows of the table on the left side of the join and matching rows for the table on the right side of join. RIGHT JOIN: RIGHT JOIN is similar to LEFT JOIN. This join returns all the rows of the table on the right side of the join and matching rows for the table on the left side of join Best Practice: Use LEFT JOIN and RIGHT JOIN deliberately Reference the recording for examples Break Up Large Queries Break up queries spanning large time periods into smaller time windows and merge them later Benefits: Reduces the total amount of data scanned in single query Reduces use of cluster resources in single query De-risks your work and allows you to restart where jobs may have stopped
Q&A	Would there be a performance improvement moving to s3 or is this being done more for redundancy? A - This is being done to accomplish both of those goals above. We want to avoid data loss as well as decoupling the system to scale it appropriately. Not related to today’s call but I have a question on part B data, I would like to pull data about Pcp visit, which table has all information about the the pcp visit from part B? For future how to get help in getting the right info about the tables. A - Please open a help desk ticket for this question. Is there a feature to "search among files"? Or is it just the search function under Confluence? A - There is a search feature in the upper right corner can search the Confluence pages and attachments. We are also looking for ways to enhance the search capabilities in the space. Will there be a vetting process? in other words, will articles be posted immediately or will they go through some sort of QA process? A - There will be technical SME that will be looking through submissions. Also, articles that are found to be helpful for numerous groups will be posted out on the Knowledge Base. Is the search for keyword in the title only? A - It can search the title, content, and attachments. not sure how to get to the wiki though?? A - Access the page by following this link: User Collaboration Wiki Can issues be listed anonymous? A - We are currently investigating to see if submissions can be anonymous. Will the admin flag entries that are similar to each other? A - We are looking for ways to link articles and increase ease of navigation. Are there search capabilities across files? How will this be organized as the volume of files increase? A - A linking system will be used to enhance navigation and searchability. Is it better to 1) limit a dataset and then join in a separate step or 2) join and limit in one step? A - Limit the data in subqueries when you can. After this, then join it with the other tables. We use two global macros to help manage our HIVE project data base...%list_tables() and %usage_hive(). we have noticed that the %usage_hive() is significantly lagged (tables dropped today will STILL appear in the listing 24-48 hours later), but it includes the total size of our DB and the size of each table. the %list_tables() is very current (shows dropped tables immediately), but it doesn't include the total DB size or the size of the tables. Is there any way that these two macros could be combined OR that the %usage_hive() be made more real time? A - Please submit a help desk ticket for this question.

2022-2-24 CCSQ D&A Townhall

User Collaboration Wiki - New!

CDR HDFS Migration

SAS Viya Issues

File Lock Issues

Lagging Issues

Github Issue

Maintenance Schedule

Known Issues

Learning Objectives:

Partition Pruning

Optimizing Joins

Break Up Large Queries