working with stage / files

Working with files in Stage

Snowflake allows us to place files in any stage and query stage.

Once the data is available we can read data from any stage as long as you have access to that location.

There are multiple steps involved in the process of loading data from Stage with is called bulk loading

Step 1 : Create a stage Object
Step 2 : upload (put) file into the stage
Step 3 : read file from Stage
Step 4 : Copy data from stage to table (We are not discussing this now)

As I have already mentioned, stage environment is location where you can store your structured, semi-structured or unstructured files before loading into original tables.

Snow site (modern web UI) allows us to upload files to named internal stages from our local system.

upload file from Snowsite :

Data => Database => Schema => stage => +File (in top upper right corner click)

But these is some file size limitation to use this.

put files into Stage using CLI :

best and easy way to upload files to stage is snow SQL (CLI)

sample data emp.csv

EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO

7369,"SMITH","CLERK",7902,"17-DEC-1980",800,,20

7499,"ALLEN","SALESMAN",7698,"20-FEB-1981",1600,300,30

7521,"WARD","SALESMAN",7698,"22-FEB-1981",1250,500,30

7566,"JONES","MANAGER",7839,"02-APR-1981",2975,,20

7654,"MARTIN","SALESMAN",7698,"28-SEP-1981",1250,1400,30

7698,"BLAKE","MANAGER",7839,"01-MAY-1981",2850,,30

7782,"CLARK","MANAGER",7839,"09-JUN-1981",2450,,10

7788,"SCOTT","ANALYST",7566,"09-DEC-1982",3000,,20

7839,"KING","PRESIDENT",,"17-NOV-1981",,10

7844,"TURNER","SALESMAN",7698,"08-SEP-1981",1500,0,30

7876,"ADAMS","CLERK",7788,"12-JAN-1983",1100,,20

7900,"JAMES","CLERK",7698,"03-DEC-1981",950,,30

7902,"FORD","ANALYST",7566,"03-DEC-1981",3000,,20

7934,"MILLER","CLERK",7782,"23-JAN-1982",1300,,10

put file://Downloads/dept.csv @training.demo.emp_stg;

Note : we need to use different syntax for different types of stages.
It is applicable when you try to put, get, list, query files

list @`~ ; -- user Stage
list @%<table_name> -- table Stage
list @<Stage_name> -- for named stage

in this case one more folder will be created in emp_stg with name dept and upload the file to stage area.

list @training.demo.emp_stg -- this will show you all files in the stage location. if you want to see files in specific folder you need to mention that extensively.

list @training.demo.emp_stg/dept

Removing / deleting files from stage :

To remove files from any stage we need to use remove command. It works good either in snow SQL, Snow site or legacy webUI.

remove @training.demo.emp_stg/emp/dept.csv;

Querying files from stage :

we have flexibility to query files directly from stage on any file using $notaion. we can do that on both structed files like CSV or semi structured files like json.

select $1,$2,$3,$4,$5,$6,$7 from @emp_stg/emp.csv;

observe the headers, Date field etc which is not clean enough to do any operations.

To overcome this we can use file formats with tells snowflake how to read and interpret the file. for example if you get a csv file with ' | ' delimited or tab delimited, we will see the result like this.

we see all fields as null except $1 it is happening because we are not telling how to deal with ' | " to snowflake so snowflake when reading data from csv file looking for ',' as delimiter.

For more cleaner way of reading this is using file format. If you have decided to place a specific type of files ex: csv , json and you know the exact file format you place in the stage, then we can create stage with file format as below and skip using file format option when selecting or copying.

CREATE OR REPLACE STAGE my_stage
  file_format = (type = 'CSV' FIELD_DELIMITER = '|' SKIP_HEADER = 1);

create file format MY_CSV_PIPE_FORMAT
TYPE = 'CSV'
COMPRESSION = 'AUTO'
FIELD_DELIMITER = '|'
RECORD_DELIMITER = '\n'
SKIP_HEADER = 1
FIELD_OPTIONALLY_ENCLOSED_BY = '"'
TRIM_SPACE = FALSE
ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE
ESCAPE = 'NONE'
ESCAPE_UNENCLOSED_FIELD = '\134';

select

$1::int as emp_ID,

$2::varchar(20) as NAME ,

$3::varchar(20) as job,

$4::int as manager,

$5::date as doj,

$6::int as sal,

$7::int as comm

from @emp_stg/emp1/emp_new.csv

(FILE_FORMAT => MY_CSV_PIPE_FORMAT);

Reading files from multiple folders using pattern :

create file format MY_CSV_FORMAT

TYPE = 'CSV'

COMPRESSION = 'AUTO'

FIELD_DELIMITER = ','

RECORD_DELIMITER = '\n'

SKIP_HEADER = 1

FIELD_OPTIONALLY_ENCLOSED_BY = '"'

TRIM_SPACE = FALSE

ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE

ESCAPE = 'NONE'

ESCAPE_UNENCLOSED_FIELD = '\134';

-- Read a csv file from stage using $notation

select

$1::int as emp_ID,

$2::varchar(20) as NAME ,

$3::varchar(20) as job,

$4::int as manager,

$5::date as doj,

$6::int as sal,

$7::int as comm

from @emp_stg

(FILE_FORMAT => MY_CSV_FORMAT,

pattern=>'.*emp.*[.]csv.gz'

) tbl ;

Note :

from @emp_stg

(FILE_FORMAT => MY_CSV_FORMAT, pattern=>'.*emp.*[.]csv.gz') -- note all of this part is part of table and when you want to give allias name to this we need to give it at the end as shown above

Working with JSON file format :

(we discuss more about this in upcoming session)

When you try to query semi structed files you will see the data as variant type where it is kind of json format.

sample json file as below with name nested_json.json

For better understanding showing like this for you

[
	---------------- Record 1 ---------
	 { <-- record is coming as a dictionary 
		  "book_title":"Food", <-- Field name Book title dictionary element Key : value
		  "year_published":2001, <-- year published dictionary element Key : value
		  "authors": [           <-- Authors details are coming in nested list
				{        <-- Author name is a dictionary with key : value
					"first_name":"Fiona",
					"middle_name":null,
					"last_name":"Macdonald"
				},
				{
					"first_name":"Gian",
					"middle_name":"Paulo",
					"last_name":"Faleschini"
				}
			]
	 },
	 ---------------- Record 2 ---------
	{
		  "book_title":"Food",
		  "year_published":2006,
		  "authors": 
			[
				{
					"first_name":"Laura",
					"middle_name":"K",
					"last_name":"Egendorf"
				}
			]
	 }
]

step 1 : put file in stage

put file://Downloads/nested_json.json @training.demo.emp_stg/json;

step 2 : create a file format

CREATE OR REPLACE FILE FORMAT books_json_format 
TYPE = 'json'
STRIP_OUTER_ARRAY = TRUE;

Step 3 : query the json file using file format and resolve the nested json structure

SELECT 
    parse_json($1):book_title, 
    parse_json($1):year_published,
    parse_json($1):authors[0]:first_name::varchar(20) as Author1_First_Name, 
    parse_json($1):authors[0]:last_name::varchar(20) as Author1_Last_Name,
    parse_json($1):authors[0]:middle_name::varchar(20) as Author1_Middle_Name,
    parse_json($1):authors[1]:first_name::varchar(20) as Author2_First_Name, 
    parse_json($1):authors[1]:last_name::varchar(20) as Author2_Last_Name,
    parse_json($1):authors[1]:middle_name::varchar(20) as Author2_Middle_Name
FROM 
@emp_stg/json/nested_json.json.gz (file_format => books_json_format);

we can think of it as combination of list and dictionary combined together

parse_json($1) --> json will be read as variant filed

:book_title --> get the book_title details

parse_json($1):authors[0] --> will give the authors details in variant format (Dictionary type)
 in that we are taking first element by specifying [0] (it is list notation)

parse_json($1):authors[0]:first_name --> In the first element we are taking frist_Name using 
dictionary notation.

Search This Blog

Snowflake Simplified

working with stage / files

Comments

Post a Comment