API¶
DB, Array, and Operator¶
Classes for connecting to SciDB and executing queries.
- class scidbpy.db.DB(scidb_url=None, scidb_auth=None, http_auth=None, verify=None, admin=False, namespace=None, use_arrow=False, page_size=1000000, result_size_limit=256, no_ops=False, inactivity_timeout=None, progress_check_secs=None, progress_check_callback=None, reauth_callback=None, reauth_tries=None, backoff_fn=None)[source]¶
SciDB connection object.
>>> DB() ... DB('https://...', ..., None, ..., False, None, False, 256, False)
>>> print(DB()) scidb_url = https://... scidb_auth = ... http_auth = None verify = ... admin = False namespace = None use_arrow = False result_size_limit = 256 no_ops = False
Constructor parameters:
- Parameters:
scidb_url (string) – SciDB connection URL. The URL for the Shim server or for the native client API. If
None
, use the value of theSCIDB_URL
environment variable, if present (defaulthttp://localhost:8080
)scidb_auth (tuple) –
Credentials for connecting to scidb, if scidb is configured to use password authentication. Either a (username, password) tuple, or the path to an authentication file which can be in INI or JSON format with the structure:
{"user-name": "name", "user-password": "password"}
.If not provided, credentials are read from a file in the first location that exists among:
$SCIDB_AUTH_FILE
$XDG_CONFIG_DIR/scidb/iquery.auth
~/.config/scidb/iquery.auth
http_auth (tuple) – Tuple with username and password for connecting to Shim, if Shim authentication is used (default
None
)verify (bool) –
Either a bool, or a path to a cert file.
If
True
, the HTTPS certificate is verified against the system’s trusted CA store.If
False
, the HTTPS certificate is not verified. This will generate a warning.If a string, the string must be a path to a cert or ca-cert file. The connection’s HTTPS certificate is verified against that file.
If omitted or
None
, defaults to the setting in theSCIDBPY_VERIFY_HTTPS
environment variable if present, otherwise defaults toTrue
.
See Python requests library SSL Cert Verification section for details on the
verify
argument (defaultNone
)admin (bool) – Set to
True
to open a higher-priority session. This is identical with the--admin
flag for theiquery
SciDB client, see SciDB Documentation for details (defaultFalse
)namespace (string) – Initial namespace for the connection. Only applicable for SciDB Enterprise Edition. The namespace can changed at any time using the
set_namespace
SciDB operator (defaultNone
)use_arrow (bool) – If
True
, download SciDB array using Apache Arrow library. Requiresaccelerated_io_tools
andaio
enabled inShim
. IfTrue
, a Pandas DataFrame is returned (as_dataframe
has no effect) and null-able types are promoted as per Pandas promotion scheme (dataframe_promo
has no effect). It can be overridden for eachiquery
call (defaultFalse
)page_size (int) – Maximum number of cells per page of output when executing paged queries, that is, non-upload queries that save their output. Client API only; ignored for Shim. (default 1,000,000)
result_size_limit (int) – absolute limit of the output file in Megabytes. Effective only when the
accelerated_io_tools
plug-in is installed in SciDB andaio
is enabled in Shim (default256
MB)inactivity_timeout (int) – Seconds until SciDB server will cancel a paged query, unless the client requests another page. Should only need to be increased for multithreaded apps where a thread holding the GIL may interfere with paged response processing. Client API only; ignored for Shim. (default 60s)
no_ops (bool) – If
True
, the list of operators is not fetched at this time and the connection is not implicitly verified. This expedites the execution of the function but disallows for calling the SciDB operators directly from theDB
instance e.g.,db.scan
(defaultFalse
)progress_check_secs (int) – Client API only. After every interval of this many seconds, interrupt and resume the HTTP request so the client can check on the query’s progress, and so the server can confirm that this client is still active. The
progress_check_callback
is called, if provided. Setting this to a lower value provides more frequent progress updates, but might have the effect of slowing the query down because it gets interrupted and resumed more often. (This “progress check” is only enabled if the query is resumable.) (Default: 60 seconds)progress_check_callback (func) –
Client API only. A callback function of the form:
func(query_info, page_number, response_number, cumulative_nbytes, ...)
which returns None. This is called once when the page is first requested, and again after everyprogress_check_secs
seconds while the query is executing. If the callback raises an exception, the query gets canceled.- The callback arguments include:
- query_info: an object with attributes that include
id
(the ID of the query) and
schema
(the SciDB schema returned by the query)
- query_info: an object with attributes that include
page_number: the page number being fetched (1 for the first page)
response_number: increments by 1 each time through the callback.
- cumulative_nbytes: the cumulative total number of content bytes
received for this page, including the current response and all previous responses for the page.
The callback MUST have
*args
and**kwargs
arguments to allow for future changes to the callback interface. It must ignore any arguments it doesn’t understand. (This callback is only enabled if the query is resumable.)reauth_callback –
An optional function matching the signature:
Callable[[requests.HTTPError], Optional[Tuple[str, str]]]
or:
Callable[[Exception], Optional[Tuple[str, str]]]
The return value should be a (username, password) tuple or None.
This function is called whenever a request fails with a 401 “Unauthorized” or 403 “Forbidden” response. The function can provide stored credentials, display a login prompt, trigger 2FA, etc. If the function returns None, the reauthentication is canceled and the error response is returned as usual.
The function can inspect the HTTPError argument for details in order to generate a message to show the user (e.g. “session on <host> expired”, “second factor required to log in to <domain>”, “incorrect password”, etc.).
reauth_tries –
The maximum number of times to call
reauth_callback()
if a request returns a 401/403 response response and subsequent reauth attempts also return 401/403.Set this to 0 to disable reauthentication.
Set this to 1 (default) if the callback returns stored credentials (i.e. if repeating the callback won’t change anything).
Set this to >1 if the callback prompts the user for credentials and you want to let the user try again after making a mistake.
backoff_fn –
A callback function that waits some number of seconds. It should have signature like:
def backoff_fn(err: requests.HTTPError, delay: int)
If we receive a 429 “Too Many Requests” response from the server, this lets the application perform other tasks while waiting for the server to become available.
The HTTPError argument gives the app information it can use to display a message to the user, e.g. “the server is busy processing other queries”.
Note that backoff_fn may choose to return earlier, or later, than
delay
. To stop waiting, it can raise the HTTPError it received as its first argument.
- iquery(query, fetch=False, use_arrow=None, use_arrow_stream=False, atts_only=False, as_dataframe=True, dataframe_promo=True, schema=None, page_size=1000000, upload_data=None, upload_schema=None, **kwargs)[source]¶
Execute query in SciDB
- Parameters:
query (string) – SciDB AFL query to execute
fetch (bool) – If
True
, download SciDB array (defaultFalse
)use_arrow (bool) –
If
True
, download SciDB array using Apache Arrow library. Requiresaccelerated_io_tools
andaio
enabled inShim
. IfTrue
, a Pandas DataFrame is returned (as_dataframe
has no effect) and null-able types are promoted as per Pandas promotion scheme (dataframe_promo
has no effect). IfNone
theuse_arrow
value set at connection time is used (defaultNone
)use_arrow_stream (bool) – If
True
, return aRecordBatchStreamReader
object to the user. The user will extract the records from the stream reader. This parameter only had effect ifuse_arrow
is set toTrue
(defaultFalse
)atts_only (bool) – If
True
, download only SciDB array attributes without dimensions (defaultFalse
)as_dataframe (bool) – If
True
, return a Pandas DataFrame. IfFalse
, return a NumPy array (defaultTrue
)dataframe_promo (bool) –
If
True
, null-able types are promoted as per Pandas promotion scheme IfFalse
, object records are used for null-able types (defaultTrue
)schema – Schema of the SciDB array to use when downloading the array. Schema is not verified. If schema is a Schema instance, it is copied. Otherwise, a :py:class:
Schema
object is built using :py:func:Schema.fromstring
(defaultNone
)page_size (int) – Maximum number of cells per page of output when executing paged queries, that is, non-upload queries that save their output. Client API only; ignored for Shim. (default 1,000,000)
>>> DB().iquery('build(<x:int64>[i=0:1; j=0:1], i + j)', fetch=True) i j x 0 0 0 0.0 1 0 1 1.0 2 1 0 1.0 3 1 1 2.0
>>> DB().iquery("input({sch}, '{fn}', 0, '{fmt}')", ... fetch=True, ... upload_data=numpy.arange(3, 6)) i x 0 0 3 1 1 4 2 2 5
- iquery_readlines(query, page_size=1000000, **kwargs)[source]¶
Execute query in SciDB
>>> DB().iquery_readlines('build(<x:int64>[i=0:2], i * i)') ... [...'0', ...'1', ...'4']
>>> DB().iquery_readlines( ... 'apply(build(<x:int64>[i=0:2], i), y, i + 10)') ... [[...'0', ...'10'], [...'1', ...'11'], [...'2', ...'12']]
- class scidbpy.db.Operator(db, name, upload_data=None, upload_schema=None, *args)[source]¶
Store SciDB operator and arguments. Hungry operators (e.g., remove, store, etc.) evaluate immediately. Lazy operators evaluate on data fetch.
- scidbpy.db.iquery(self, query, fetch=False, use_arrow=None, use_arrow_stream=False, atts_only=False, as_dataframe=True, dataframe_promo=True, schema=None, page_size=1000000, upload_data=None, upload_schema=None, **kwargs)¶
Execute query in SciDB
- Parameters:
query (string) – SciDB AFL query to execute
fetch (bool) – If
True
, download SciDB array (defaultFalse
)use_arrow (bool) –
If
True
, download SciDB array using Apache Arrow library. Requiresaccelerated_io_tools
andaio
enabled inShim
. IfTrue
, a Pandas DataFrame is returned (as_dataframe
has no effect) and null-able types are promoted as per Pandas promotion scheme (dataframe_promo
has no effect). IfNone
theuse_arrow
value set at connection time is used (defaultNone
)use_arrow_stream (bool) – If
True
, return aRecordBatchStreamReader
object to the user. The user will extract the records from the stream reader. This parameter only had effect ifuse_arrow
is set toTrue
(defaultFalse
)atts_only (bool) – If
True
, download only SciDB array attributes without dimensions (defaultFalse
)as_dataframe (bool) – If
True
, return a Pandas DataFrame. IfFalse
, return a NumPy array (defaultTrue
)dataframe_promo (bool) –
If
True
, null-able types are promoted as per Pandas promotion scheme IfFalse
, object records are used for null-able types (defaultTrue
)schema – Schema of the SciDB array to use when downloading the array. Schema is not verified. If schema is a Schema instance, it is copied. Otherwise, a :py:class:
Schema
object is built using :py:func:Schema.fromstring
(defaultNone
)page_size (int) – Maximum number of cells per page of output when executing paged queries, that is, non-upload queries that save their output. Client API only; ignored for Shim. (default 1,000,000)
>>> DB().iquery('build(<x:int64>[i=0:1; j=0:1], i + j)', fetch=True) i j x 0 0 0 0.0 1 0 1 1.0 2 1 0 1.0 3 1 1 2.0
>>> DB().iquery("input({sch}, '{fn}', 0, '{fmt}')", ... fetch=True, ... upload_data=numpy.arange(3, 6)) i x 0 0 3 1 1 4 2 2 5
Attribute, Dimension, and Schema¶
Classes for accessing SciDB data and schemas.
- class scidbpy.schema.Attribute(name, type_name, not_null=False, default=None, compression=None)[source]¶
Represent SciDB array attribute
Construct an attribute using Attribute constructor:
>>> Attribute('foo', 'int64', not_null=True) ... Attribute(name='foo', type_name='int64', not_null=True, default=None, compression=None)
>>> Attribute('foo', 'int64', default=100, compression='zlib') ... Attribute(name='foo', type_name='int64', not_null=False, default=100, compression='zlib')
Construct an attribute from a string:
>>> Attribute.fromstring('foo:int64') ... Attribute(name='foo', type_name='int64', not_null=False, default=None, compression=None)
>>> Attribute.fromstring( ... "taz : string NOT null DEFAULT '' compression 'bzlib'") ... Attribute(name='taz', type_name='string', not_null=True, default="''", compression='bzlib')
- class scidbpy.schema.Dimension(name, low_value=None, high_value=None, chunk_overlap=None, chunk_length=None)[source]¶
Represent SciDB array dimension
Construct a dimension using the Dimension constructor:
>>> Dimension('foo') ... Dimension(name='foo', low_value=None, high_value=None, chunk_overlap=None, chunk_length=None)
>>> Dimension('foo', -100, '10', '?', '1000') ... Dimension(name='foo', low_value=-100, high_value=10, chunk_overlap='?', chunk_length=1000)
Construct a dimension from a string:
>>> Dimension.fromstring('foo') ... Dimension(name='foo', low_value=None, high_value=None, chunk_overlap=None, chunk_length=None)
>>> Dimension.fromstring('foo=-100:*:?:10') ... Dimension(name='foo', low_value=-100, high_value='*', chunk_overlap='?', chunk_length=10)
- class scidbpy.schema.Schema(name=None, atts=(), dims=())[source]¶
Represent SciDB array schema
Construct a schema using Schema, Attribute, and Dimension constructors:
>>> Schema('foo', (Attribute('x', 'int64'),), (Dimension('i', 0, 10),)) ... Schema(name='foo', atts=(Attribute(name='x', type_name='int64', not_null=False, default=None, compression=None),), dims=(Dimension(name='i', low_value=0, high_value=10, chunk_overlap=None, chunk_length=None),))
Construct a schema using Schema constructor and fromstring methods of Attribute and Dimension:
>>> Schema('foo', ... (Attribute.fromstring('x:int64'),), ... (Dimension.fromstring('i=0:10'),)) ... Schema(name='foo', atts=(Attribute(name='x', type_name='int64', not_null=False, default=None, compression=None),), dims=(Dimension(name='i', low_value=0, high_value=10, chunk_overlap=None, chunk_length=None),))
Construct a schema from a string:
>>> Schema.fromstring( ... 'foo@1<x:int64 not null, y:double>[i=0:*; j=-100:0:0:10]') ... Schema(name='foo@1', atts=(Attribute(name='x', type_name='int64', not_null=True, default=None, compression=None), Attribute(name='y', type_name='double', not_null=False, default=None, compression=None)), dims=(Dimension(name='i', low_value=0, high_value='*', chunk_overlap=None, chunk_length=None), Dimension(name='j', low_value=-100, high_value=0, chunk_overlap=0, chunk_length=10)))
Print a schema constructed from a string:
>>> print(Schema.fromstring('<x:int64,y:float> [i=0:2:0:1000000; j=0:*]')) ... <x:int64,y:float> [i=0:2:0:1000000; j=0:*]
Format Schema object to only print the schema part without the array name:
>>> '{:h}'.format(Schema.fromstring('foo<x:int64>[i]')) '<x:int64> [i]'
- make_dims_atts()[source]¶
Make attributes from dimensions and pre-append them to the attributes list.
>>> s = Schema(None, (Attribute('x', 'bool'),), (Dimension('i'),)) >>> print(s) <x:bool> [i] >>> s.make_dims_atts() >>> print(s) <i:int64 NOT NULL,x:bool> [i]
>>> s = Schema.fromstring('<x:bool>[i;j]') >>> s.make_dims_atts() >>> print(s) <i:int64 NOT NULL,j:int64 NOT NULL,x:bool> [i; j]
- make_unique()[source]¶
Make dimension and attribute names unique within the schema. Return
True
if any dimension or attribute was renamed.>>> s = Schema(None, (Attribute('i', 'bool'),), (Dimension('i'),)) >>> print(s) <i:bool> [i] >>> s.make_unique() True >>> print(s) <i:bool> [i_1]
>>> s = Schema.fromstring('<i:bool, i:int64>[i;i_1;i]') >>> s.make_unique() True >>> print(s) <i:bool,i_2:int64> [i_3; i_1; i_4]
- promote(data)[source]¶
Promote nullable attributes in the DataFrame to types which support some type of null values as per Pandas promotion scheme