Skip to main content

Load Data from Web Services

In Cloudberry Database, to load data from web services or from any source accessible by command lines, you can create external web tables. The supported data formats are TEXT and CSV.

External web tables allow Cloudberry Database to treat dynamic data sources like regular database tables. Because web table data can change as a query runs, the data is not rescannable.

CREATE EXTERNAL WEB TABLE creates a web table definition. You can define command-based or URL-based external web tables. The definition forms are different. Do not mix command-based and URL-based definitions.

Command-based external web tables

The output of a shell command or script defines command-based web table data. Specify the command in the EXECUTE clause of CREATE EXTERNAL WEB TABLE. The data is current as of the time the command runs. The EXECUTE clause runs the shell command or script on the specified coordinator, and/or segment host or hosts. The command or script must reside on the hosts corresponding to the host(s) defined in the EXECUTE clause.

By default, the command is run on segment hosts when active segments have output rows to process. For example, if each segment host runs 4 primary segment instances that have output rows to process, the command runs 4 times per segment host. You can optionally limit the number of segment instances that run the web table command. All segments included in the web table definition in the ON clause run the command in parallel.

The command that you specify in the external table definition is run from the database and cannot access environment variables from .bashrc or .profile. Set environment variables in the EXECUTE clause. For example:

=# CREATE EXTERNAL WEB TABLE output (output text)
EXECUTE 'PATH=/home/gpadmin/programs; export PATH; myprogram.sh'
FORMAT 'TEXT';

Scripts must be executable by the gpadmin user and reside in the same location on the coordinator or segment hosts.

The following command defines a web table that runs a script. The script runs on each segment host where a segment has output rows to process.

=# CREATE EXTERNAL WEB TABLE log_output 
(linenum int, message text)
EXECUTE '/var/load_scripts/get_log_data.sh' ON HOST
FORMAT 'TEXT' (DELIMITER '|');

URL-based external web tables

A URL-based web table accesses data from a web server using the HTTP protocol. Web table data is dynamic; the data is not rescannable.

Specify the LOCATION of files on a web server using http://. The web data file(s) must reside on a web server that Cloudberry Database segment hosts can access. The number of URLs specified corresponds to the number of segment instances that work in parallel to access the web table. For example, if you specify 2 external files to a Cloudberry Database system with 8 primary segments, 2 of the 8 segments access the web table in parallel at query runtime.

The following sample command defines a web table that gets data from several URLs.

=# CREATE EXTERNAL WEB TABLE ext_expenses (name text, 
date date, amount float4, category text, description text)
LOCATION (

'http://intranet.company.com/expenses/sales/file.csv',
'http://intranet.company.com/expenses/exec/file.csv',
'http://intranet.company.com/expenses/finance/file.csv',
'http://intranet.company.com/expenses/ops/file.csv',
'http://intranet.company.com/expenses/marketing/file.csv',
'http://intranet.company.com/expenses/eng/file.csv'

)
FORMAT 'CSV' ( HEADER );