Unstructured File Sync Private Preview
The SharePoint and Google Drive connectors support syncing unstructured files, allowing you to seamlessly sync PDFs, images, and other file types to your Snowflake destination. This feature streamlines the management of unstructured data within your organization and enables you to use Snowflake's AI capabilities, including Cortex Search and Document AI, to extract meaningful data.
Data storage location
Files
Files are stored in a named internal stage in the destination schema. The stage name corresponds to the table name specified during setup.
- The relative source folder structure including the provided parent folder is replicated in the stage.
- A Snowflake directory table is maintained and updated after every sync by the connector.
Metadata
The metadata for each file is maintained in the specified destination schema and table. For more information, see our ERD.
- _fivetran_file_path: relative source path and reference to the respective file’s location in the destination. Use this to join with the Snowflake directory table.
- url: source URL linking to the file’s original location.
Changes in the source
Source observation | Action in stage | Action in metadata table |
---|---|---|
New file | Inserted | A new metadata row is inserted. |
Modified file | Old file is replaced | Old row is updated per metadata of the modified file. |
Deleted file | No change | _fivetran_deleted set to true . |
Limitation
We do not support renaming folders and moving files between folders.
Snowflake usage examples
Natural language Q&A grounded in organizational knowledge
Create a Fivetran SharePoint connection in Merge Mode.
- We assume that the destination table name is set to
technical_manual
. - In the File type drop-down menu, select unstructured. This option enables you to sync any document, image, or plain text file types to the destination’s object storage.
NOTE: We do not sync this option with the compressed files and structured file formats.
- We assume that the destination table name is set to
Create a
document_chunk
table to maintain chunked text which will be used to create vector embeddings or used by a Cortex Search Service.CREATE OR REPLACE TABLE document_chunk ( file_id TEXT, source_url TEXT, modified_at TIMESTAMP, relative_path TEXT, stage_url TEXT, chunk_index NUMBER, chunk TEXT )
After every sync
- Extract text from new and newly modified files using Snowflake’s PARSE_DOCUMENT feature.
CREATE OR REPLACE TEMPORARY TABLE parsed_document AS SELECT metadata.file_id, metadata.url AS source_url, metadata.modified_at, stage_dir.relative_path, stage_dir.file_url AS stage_url, TO_VARCHAR( SNOWFLAKE.CORTEX.PARSE_DOCUMENT( @technical_manual, stage_dir.relative_path, {'mode': 'LAYOUT'}):content ) AS parsed_content FROM technical_manual metadata LEFT JOIN DIRECTORY(@technical_manual) stage_dir ON metadata._fivetran_file_path = stage_dir.relative_path WHERE NOT metadata._fivetran_deleted and metadata.modified_at > ( SELECT COALESCE(max(document_chunk.modified_at), '1990-01-01') as modified_at FROM document_chunk )
- Delete old chunks for newly modified and deleted files.
MERGE INTO document_chunk USING technical_manual ON document_chunk.file_id = technical_manual.file_id WHEN MATCHED AND (technical_manual.modified_at > document_chunk.modified_at OR technical_manual._fivetran_deleted) THEN DELETE
- Chunk extracted text and insert chunks for the new or newly modified files into the
document_chunk
table.INSERT INTO document_chunk (file_id, source_url, modified_at, relative_path, stage_url, chunk_index, chunk) SELECT file_id, source_url, modified_at, relative_path, stage_url, c.index, c.value FROM parsed_document p, LATERAL FLATTEN( input => SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER ( parsed_content, 'markdown', 2500, 100 )) c;
- Extract text from new and newly modified files using Snowflake’s PARSE_DOCUMENT feature.
Create a Cortex Search Service to augment LLM queries with your organizational context. Replace <warehouse_name> with the warehouse which will be used for materializing the results of the specified query initially and each time the document_chunk table is changed.
CREATE OR REPLACE CORTEX SEARCH SERVICE document_search_service ON chunk WAREHOUSE = <warehouse_name> TARGET_LAG = '6 hours' AS ( SELECT * FROM document_chunk );
Preview the Cortex Search Service
-- replace <search_query> with a query relevant to your documents SELECT PARSE_JSON( SNOWFLAKE.CORTEX.SEARCH_PREVIEW( 'document_search_service', '{ "query": "<search_query>", "columns":[ "source_url", "relative_path", "stage_url", "chunk_index", "chunk" ], "limit":5 }' ) )['results'] as results;
To learn more about how to query a cortex search service using REST API endpoints, see Query a Cortex Search Service documentation.
Extract structured data from homogenous unstructured documents
- Create a Fivetran SharePoint connection in Merge Mode.
- In the File type drop-down menu, select
unstructured
.
- In the File type drop-down menu, select
- Create a document processing pipeline with Document AI.
- Skip stage creation and create a stream on the stage created by the Fivetran SharePoint connection after you publish a Document AI model build.
- (Optional) Unpack the JSON output from Document AI into separate columns and join with the metadata table for further analysis.