Skip to content

tarente/FabricDataEngineering

Repository files navigation

This repository contains some Notebooks (PySpark) for the following functionality:

Import the 3 notebooks described bellow in a same folder in a Fabric Workspace and change of the configuration of the notebook NTB 901 Table History.

Read the rest of this README and Enjoy.


The notebook NTB 000 Utils:

  • Import common libraries.
  • Set some Spark configuration.
  • Defines some commonly used PySpark functions.
  • Two functions to print (or try to print) secrets. I.e., when they are printed as [REDACTED]:
    • fnGetPrintableSecretLong (strSecret, strSeparator = chr(29), intIndentationLevel = 0): strPossiblyPrintableSecret
      • strSecret: secret to be modified to be printable. Normally, stored in Azure Key Vault and obtained by using notebookutils.credentials.getSecret.
      • strSeparator (optional): the string to be used to separate each character in the strSecret.
      • intIndentationLevel (optional): defines the indentation level of the information printed by the function.
    • fnGetPrintableSecretShort (strSecret, strSeparator = chr(29), intIndentationLevel = 0): strPossiblyPrintableSecret
      • strSecret: secret to be modified to be printable. Normally, stored in Azure Key Vault and obtained by using notebookutils.credentials.getSecret.
      • strSeparator (optional): the string to be used to separate each character in the strSecret.
      • intIndentationLevel (optional): defines the indentation level of the information printed by the function.
    • The main difference between the functions fnGetPrintableSecretLong and fnGetPrintableSecretShort, is that the former retunrs a bigger string representation of the strSecret, since it will happend the strSeparator after each character in strSecret. While the latter, will only had a strSeparator between the first and second character of the strSecret.

To be used in other notebooks in this repository.

Limitations:

  • The secret needs to have at least two characters to be printable.
  • To prevent security breaches the fnGetPrintableSecretLong and fnGetPrintableSecretShort will only return a printable secret if global boolDebug = True. Otherwise a warning message will be displayed instructing to set global boolDebug = True.

The notebook NTB 002 Table History defines the following PySpark functions:

  • fnGetDeltaTableHistory (strTableName, strColumnName, strGroupByColumnName="", boolversionAsOf=True, intNrVersions=100, intIndentationLevel = 0): dfTableHist
    • strTableName: the name of the table to get the delta history.
    • strColumnName: a single column name to group by and get the number of rows for column in each delta history for the table.
    • strGroupByColumnName (optional): an additional column name to group by and get the number of rows for column in each delta history for the table.
    • boolversionAsOf (optional): if True then the delta history will be based on the 'version As Of'. Otherwise, it will use the delta history will be based on the 'timestamp As Of'.
    • intNrVersions (optional): number of delta history to retrieve. If there are less versions in the delta history, only the available number is retrieved.
    • intIndentationLevel (optional): defines the indentation level of the information printed by the function.
  • fnGetDeltaTableHistoryDynamic (strTableName, strColumnName, strListGroupByColumnNames="", boolversionAsOf=True, intNrVersions=100, intIndentationLevel = 0): dfTableHist
    • strTableName: the name of the table to get the delta history.
    • strColumnName: a single column name to group by and get the number of rows for column in each delta history for the table.
    • strListGroupByColumnNames (optional): an additional list of column name to group by and get the number of rows for column in each delta history for the table.
    • boolversionAsOf (optional): if True then the delta history will be based on the 'version As Of'. Otherwise, it will use the delta history will be based on the 'timestamp As Of'.
    • intNrVersions (optional): number of delta history to retrieve. If there are less versions in the delta history, only the available number is retrived.
    • intIndentationLevel (optional): defines the indentation level of the information printed by the function.

The main difference between the functions fnGetDeltaTableHistory and fnGetDeltaTableHistoryDynamic, is that the former only accepts one additional optional column, while the later accepts and optional list of columns to aggregate the data.

Limitations:


The notebook NTB 011 get_lakehouse_tables_schema re-defines the Semantic Link Labs' get_lakehouse_tables PySpark (see https://semantic-link-labs.readthedocs.io/en/stable/sempy_labs.lakehouse.html#sempy_labs.lakehouse.get_lakehouse_tables) to work with schema enabled Lakehouse.

Instead of using the redefined function, you can create Shortcuts in a new Lakehouse and then call the Semantic Link Labs' in the Lakehouse with the Shortcuts.

Limitations:

  • If any table in the Lakehouse has been inserted as a partitioned and as a non-partitioned table, then the function will fail. This also true for the original function.

The notebook NTB 221 Check Direct Lake Guardrails sample notebook to call the new version of the get_lakehouse_tables and check the differences between executing through the schema enabled Lakehouse and the Lakehouse with Shortcuts. Assuming that both Lakehouses are in the same Workspace you just need to change the value of the following variables:

  • lakehouse = "Lakehouse Name containing the shortcuts"
  • lakehouseSchema = "Schema enabled Lakehouse Name"
  • workspace = ""

Limitations:

  • If any table in the Lakehouse has been inserted as a partitioned and as a non-partitioned table, then the Notebook will fail.

The notebook NTB 701 list_activity_events_multiple_days sample notebook tp call the new function list_activity_events_multiple_days: Shows a list of audit activity events for a tenant.

This is a wrapper function for the following API: Admin - Get Activity Events <https://learn.microsoft.com/rest/api/power-bi/admin/get-activity-events>_.

Service Principal Authentication is supported (see here <https://github.com/microsoft/semantic-link-labs/blob/main/notebooks/Service%20Principal.ipynb>_ for examples).

Parameters

  • start_time : str - Start date and time of the window for audit event results. Example: "2024-09-25T07:55:00".
  • end_time : str - End date and time of the window for audit event results. Example: "2024-09-25T08:55:00".
  • activity_filter : str, default=None - Filter value for activities. Example: 'viewreport'.
  • user_id_filter : str, default=None - Email address of the user.
  • return_dataframe : bool, default=True - If True the response is a pandas.DataFrame. If False returns a dict. Default True

Returns

  • pandas.DataFrame | dict - A pandas dataframe or dict showing a list of audit activity events for a tenant.

To use this notebook, you need to set the following values in sixth cell:

Example 1 - get all activity events in the your tenant from the last 28 days as a pandas Dataframe:

  • start_day = (datetime.utcnow().date() - timedelta(days=27)).strftime("%Y-%m-%d")
  • num_days = 28
  • inc_days = 1
  • activity_filter = None
  • user_id_filter = None
  • return_dataframe = True

Example 2 - get all activity events for the 3 consecutive weekdays in the your tenant as a dict:

  • start_day = (datetime.utcnow().date() - timedelta(days=27)).strftime("%Y-%m-%d")
  • num_days = 3
  • inc_days = 7
  • activity_filter = None
  • user_id_filter = None
  • return_dataframe = False

The notebook NTB 801 delete_capacity_tenant_settings_overrides Deletes and returns list of tenant setting overrides that override at the capacities after applying the tenant_setting filter.

To use this notebook, you need to set the following values:

  • Fourth cell - Lakehouse configuration:
    • <capacity>: The capacity name or ID. Defaults to None which resolves to showing/deleting all capacities.
    • <tenant_setting>: The tenant setting name. Example: "TenantSettingForCapacityDelegatedSwitch" Defaults to None which resolves to showing/deleting all tenant settings.
    • <dry_run>: Show or delete the tenant settings override at the capacities Defaults to True which resolves to showing the tenant settings override at the capacities.

The notebook NTB 901 Table History allows to dynamically attach the notebook to a Fabric Lakehouse (as described in Programmatically defining the default lakehouse of a notebook, see https://fabric.guru/how-to-attach-a-default-lakehouse-to-a-notebook-in-fabric).

To use this notebook, you need to set the following values:

  • Second cell - Lakehouse configuration:
    • <Lakehouse name>: the name of the Lakehouse where the table to get the history is located.
    • <Lakehouse Id>: the Id of the Lakehouse where the table to get the history is located.
    • <Workspace Id>: the Id of the Workspace where the Lakehouse is located.
  • Fourth cell - Table configuration:
    • <Table Name>: the name of the table to get the delta history.
    • <Column Name>: a single column name to group by and get the number of rows for column in each delta history for the table.
    • You can also change the value of the following variables strGroupByColumnName, strListGroupByColumnNames and intNrVersions.

Limitations:


The notebook NTB 911 Print Key Vault Secret shows examples of displaying three secrets stored in an Azure Key Vault. This notebook has inspired after watching this https://www.youtube.com/watch?v=XI94dJyxHwU (Keep your secrets SAFE in Microsoft Fabric and Azure Key Vault) and is getting the following values from a user defined Azure Key Vault:

  • strAKVURL: <replace by your Azure Key Vault URL>.
  • key-YouTubeURL: https://www.youtube.com/watch.
  • key-YouTubeParameters: ?v=
  • key-YouTubeVideoID XI94dJyxHwU

You can see from the notebook execution that the functions fnGetPrintableSecretLong and fnGetPrintableSecretShort have a different behavior when used to print a secret or a concatenation of secrets, but I will let you look at the results and take your own conclusions.

REMEMBER: the above functions should be used for debugging purposes! That is why you need to ensure the to set global boolDebug = True.

About

Sample code to be used in Microsoft Fabric - Data Engineering

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors