Unstructured File Replication with Snowflake Cortex Search Private Preview
Use this tutorial to learn how to leverage replicated unstructured files with Snowflake Cortex Search.
Snowflake Usage Examples
In this section, we outline how to utilize replicated unstructured files within Snowflake for querying, analysis, and integration.
Natural language Q&A based on organizational knowledge
Perform the following steps to create a Cortex Search Service and preview it:
Create a SharePoint connection in Merge Mode.
NOTE: We assume that the destination table name is set to
TECHNICAL_MANUAL
.In the connection setup form, go to File type drop-down menu and select unstructured to sync any document, image, or plain text file types to the destination’s object storage.
NOTE: Do not select the unstructured file type option to sync compressed files and structured file formats.
Create a
DOCUMENT_CHUNK
table to maintain chunked text using the following script:NOTE: We require this table to create vector embeddings or for Cortex Search Service to use it.
CREATE OR REPLACE TABLE document_chunk ( file_id TEXT, source_url TEXT, modified_at TIMESTAMP, relative_path TEXT, stage_url TEXT, chunk_index NUMBER, chunk TEXT )
After every sync, do the following:
i. Extract the text from the new and newly modified files using Snowflake’s PARSE_DOCUMENT feature:
CREATE OR REPLACE TEMPORARY TABLE parsed_document AS SELECT metadata.file_id, metadata.url AS source_url, metadata.modified_at, stage_dir.relative_path, stage_dir.file_url AS stage_url, TO_VARCHAR( SNOWFLAKE.CORTEX.PARSE_DOCUMENT( @technical_manual, stage_dir.relative_path, {'mode': 'LAYOUT'}):content ) AS parsed_content FROM technical_manual metadata LEFT JOIN DIRECTORY(@technical_manual) stage_dir ON metadata._fivetran_file_path = stage_dir.relative_path WHERE NOT metadata._fivetran_deleted and metadata.modified_at > ( SELECT COALESCE(max(document_chunk.modified_at), '1990-01-01') as modified_at FROM document_chunk )
ii. Delete the old chunks for the newly modified and deleted files using the following script:
MERGE INTO document_chunk USING technical_manual ON document_chunk.file_id = technical_manual.file_id WHEN MATCHED AND (technical_manual.modified_at > document_chunk.modified_at OR technical_manual._fivetran_deleted) THEN DELETE
iii. Chunk the extracted text and insert the chunks for the new or newly modified files into the
DOCUMENT_CHUNK
table using the following script:INSERT INTO document_chunk (file_id, source_url, modified_at, relative_path, stage_url, chunk_index, chunk) SELECT file_id, source_url, modified_at, relative_path, stage_url, c.index, c.value FROM parsed_document p, LATERAL FLATTEN( input => SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER ( parsed_content, 'markdown', 2500, 100 )) c;
Create a Cortex Search Service to augment LLM queries with your organizational context using the following script:
NOTE: Replace
<warehouse_name>
with the actual name of the warehouse to initially fetch the results for the below query and every time we change theDOCUMENT_CHUNK
table.CREATE OR REPLACE CORTEX SEARCH SERVICE document_search_service ON chunk WAREHOUSE = <warehouse_name> TARGET_LAG = '6 hours' AS ( SELECT * FROM document_chunk );
Preview the Cortex Search Service using the following script:
-- replace <search_query> with a query relevant to your documents SELECT PARSE_JSON( SNOWFLAKE.CORTEX.SEARCH_PREVIEW( 'document_search_service', '{ "query": "<search_query>", "columns":[ "source_url", "relative_path", "stage_url", "chunk_index", "chunk" ], "limit":5 }' ) )['results'] as results;
To learn more about how to query a cortex search service using REST API endpoints, see Query a Cortex Search Service documentation.
Extract structured data from homogenous and unstructured documents
Perform the following steps to extract structured data from unstructured and homogenous documents:
Create a SharePoint connection in Merge Mode.
In the connection setup form, go to the File type drop-down menu and select unstructured.
Skip creating a stage and create a SharePoint connection stream on the stage after you publish a Document AI model build.
(Optional) Unpack the JSON output from Document AI into separate columns and join with the metadata table for further analysis.