Unstructured File Replication with Snowflake Cortex Search Private Preview

Use this tutorial to learn how to leverage replicated unstructured files with Snowflake Cortex Search.

Snowflake Usage Examples

In this section, we outline how to utilize replicated unstructured files within Snowflake for querying, analysis, and integration.

Natural language Q&A based on organizational knowledge

Perform the following steps to create a Cortex Search Service and preview it:

Create a SharePoint connection in Merge Mode.
We assume that the destination table name is set to TECHNICAL_MANUAL.
In the connection setup form, go to File type drop-down menu and select unstructured to sync any document, image, or plain text file types to the destination’s object storage.
Do not select the unstructured file type option to sync compressed files and structured file formats.

Create a DOCUMENT_CHUNK table to maintain chunked text using the following script:

We require this table to create vector embeddings or for Cortex Search Service to use it.

CREATE OR REPLACE TABLE document_chunk (
   file_id TEXT,
   source_url TEXT,
   modified_at TIMESTAMP,
   relative_path TEXT,
   stage_url TEXT,
   chunk_index NUMBER,
   chunk TEXT 
   )

After every sync, do the following:

i. Extract the text from the new and newly modified files using Snowflake’s PARSE_DOCUMENT feature:

CREATE OR REPLACE TEMPORARY TABLE parsed_document AS
    SELECT
        metadata.file_id, 
        metadata.url AS source_url,
        metadata.modified_at,
        stage_dir.relative_path,
        stage_dir.file_url AS stage_url,
        TO_VARCHAR(
            SNOWFLAKE.CORTEX.PARSE_DOCUMENT(
                @technical_manual,
                stage_dir.relative_path,
                {'mode': 'LAYOUT'}):content ) AS parsed_content
        FROM technical_manual metadata
        LEFT JOIN DIRECTORY(@technical_manual) stage_dir ON metadata._fivetran_file_path = stage_dir.relative_path
        WHERE NOT metadata._fivetran_deleted and metadata.modified_at > (
            SELECT
                COALESCE(max(document_chunk.modified_at), '1990-01-01') as modified_at
                FROM document_chunk
                )

ii. Delete the old chunks for the newly modified and deleted files using the following script:

MERGE INTO document_chunk USING technical_manual
ON document_chunk.file_id = technical_manual.file_id
WHEN MATCHED AND (technical_manual.modified_at > document_chunk.modified_at OR technical_manual._fivetran_deleted) THEN DELETE

iii. Chunk the extracted text and insert the chunks for the new or newly modified files into the DOCUMENT_CHUNK table using the following script:

INSERT INTO document_chunk (file_id, source_url, modified_at, relative_path, stage_url, chunk_index, chunk)
    SELECT file_id, source_url, modified_at, relative_path, stage_url,
        c.index, c.value 
        FROM
        parsed_document p,    
        LATERAL FLATTEN( input => SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER (
           parsed_content,
          'markdown',
          2500,
          100
       )) c;

Create a Cortex Search Service to augment LLM queries with your organizational context using the following script:
Replace <warehouse_name> with the actual name of the warehouse to initially fetch the results for the below query and every time we change the DOCUMENT_CHUNK table.
```
CREATE OR REPLACE CORTEX SEARCH SERVICE document_search_service
      ON chunk
      WAREHOUSE = <warehouse_name>
      TARGET_LAG = '6 hours'
      AS (
      SELECT * FROM document_chunk
      );
```

Preview the Cortex Search Service using the following script:

-- replace <search_query> with a query relevant to your documents
SELECT PARSE_JSON(
  SNOWFLAKE.CORTEX.SEARCH_PREVIEW(
      'document_search_service',
      '{
        "query": "<search_query>",
        "columns":[
            "source_url",
            "relative_path",
            "stage_url",
            "chunk_index",
            "chunk" 
        ],
        "limit":5
      }'
  )
)['results'] as results;

To learn more about how to query a cortex search service using REST API endpoints, see Query a Cortex Search Service documentation.

Extract structured data from homogenous and unstructured documents

Perform the following steps to extract structured data from unstructured and homogenous documents:

Create a SharePoint connection in Merge Mode.
In the connection setup form, go to the File type drop-down menu and select unstructured.
Create a document processing pipeline with Document AI.
Skip creating a stage and create a SharePoint connection stream on the stage after you publish a Document AI model build.
(Optional) Unpack the JSON output from Document AI into separate columns and join with the metadata table for further analysis.