LOOKUP FILE
A lookup file is an
indexed dataset and it actually consists of
two files : one files holds data and the other holds an hash index into
the data file.We commonly use a lookup file to hold in physical memory the data
that a transform component frequently needs to access.
LOOKUP FILE
How to use a LOOKUP FILE COMPONENT:
•
To perform a memory-resident lookup using a
Lookup File component:
•
Place a LOOKUP_FILE component in the
graph and open its Properties dialog.
•
On the Description tab, set the Label to a name
we will use in the lookup functions that reference this file.
•
Click Browse to locate the file to use as the
lookup file.
•
Set the RecordFormat parameter to the record
format of the lookup file.
•
Set the key parameter to specify the fields to
search.
LOOKUP FILE
•
Set the Special attribute of the key to the type
of lookup we want to do.
•
Add a lookup function to the transform of the
component that will use the lookup file.
•
The first argument to a lookup function is the
name of the lookup file. The remaining arguments are values to be matched
against the fields named by the key parameter of the lookup file.
lookup("MyLookupFile", in.key)
•
If the lookup file key's Special attribute (in
the Key Specifier Editor) is exact, the lookup functions return a record that
matches the key values and has the format specified by the RecordFormat
parameter.
Partitioned lookup files:
Lookup files can be either serial or partitioned
(multifiles). The lookup
functions we use to access lookup data come in both local
and non‑local
varieties, depending on whether the lookup data files are
partitioned.
When a component accesses a serial lookup file, the
Co>Operating System
loads the entire file into the component’s memory. If the
component is
running in parallel (and you use a _local lookup function),
the
Co>Operating System splits the lookup file into
partitions.
The benefits of partitioning lookup files are:
1. The
per-process footprint is lower. This means the lookup file as a whole can
exceed the 2 GB limit.
2. If the
component is partitioned across machines, the total memory needed on any one
machine is reduced.
DYNAMIC LOOKUP
A disadvantage of statically loading a lookup file is that
the dataset occupies a fixed amount of
memory even when the graph isn’t using the data.
By dynamically loading lookup data, we control how many
lookup datasets are loaded, which
lookup datasets are loaded, and when lookup datasets are
loaded. This control is useful in
conserving memory; applications can unload datasets that are
not immediately needed and load
only the ones needed to process the current input record.
The
idea behind dynamically loading lookup data is to:
1. Load the dataset into memory when it is needed.
2. Retrieve data with your graph.
3. Free up memory by unloading the dataset
after use.
DYNAMIC LOOKUP
How to look up data dynamically:
To look
up data dynamically:
1. Prepare a
LOOKUP TEMPLATE component:
a. Add a
Lookup Template component to the graph and open its Properties dialog.
b. On the
Description tab of the Properties dialog, enter a label in the Label text box.
c. On the
Parameters tab, set the RecordFormat parameter.
Here,
we specify the DML record format of the lookup data file.
•
Set the key parameter to the key we will use for
the lookup.
•
Load the lookup file using the lookup_load
function inside a transform function.
DYNAMIC LOOKUP
For example, enter:
let lookup_identifier_type LID =
lookup_load(MyData, MyIndex, "MyTemplate", -1)
where:
LID is
a variable to hold the lookup ID returned by the lookup_load function. This ID
references the lookup file in memory. The lookup ID is valid only within the
scope of the transform.
MyData
is the pathname of the lookup data file.
MyIndex
is the index of the pathname of the lookup index file.
If no index file exists, we must enter the DML keyword NULL.
The graph creates an
index on the fly.
LOOKUP TEMPLATE component:
A LOOKUP TEMPLATE component substitutes for a LOOKUP FILE
component when we want to load lookup data dynamically.
Defining a lookup template:
When you place a LOOKUP TEMPLATE component in our graph, we
define
it by specifying two parameters:
RecordFormat — A DML description of the data
key — The field or fields by which the data is to be
searched
Note: In a lookup template, we do not provide a static
URL for the
dataset’s location as we do with a lookup file. Instead,
we specify the
dataset’s location in a call to the lookup_load function
when the data is
actually loaded.
Appendable lookup files (ALFs)
Data has a tendency to change and grow. In situations where
new data is arriving all the time,
static datasets loaded into memory at graph execution time
are not up to the task. Even
dynamically loaded lookup datasets may require complex logic
to check whether the data has
changed before using it.
Appendable lookup files (ALFs) are a special kind of
dynamically loaded lookup file in which a
newly arriving record is made available to our graph as soon
as the complete record appears on
disk. ALFs can enable the applications to process new data
quickly— often less than a second
after it is landed to disk.
DYNAMIC LOOKUP
How to create an appendable lookup file (ALF):
To
create an ALF:
Call
the lookup_load function, specifying:
The
integer -2 as the load-behavior argument to lookup_load
The DML
constant NULL for the index
For example, this command creates an ALF using the data in
the existing disk file mydata, and following the record format specified in the
lookup template My_Template:
let lookup_identifier_type LID =
lookup_load($DATA/mydata, NULL, "My_Template", -2)
Compressed lookup data:
The data stored in a lookup file can be either uncompressed
or block-compressed. Block-compressed data
forms the basis of indexed compressed flat files (ICFFs).
This kind of data is both compressed and divided
into blocks of roughly equal size.
Obviously, we can store data more efficiently when it is
compressed. On the other hand, raw data can be
read faster, since it does not need to be uncompressed
first.
Typically, we would use compressed lookup data when the
total size of the data is large but only a
relatively small amount of it is needed at any given time.
Compressed LOOKUP
Block-compressed lookup data:
With block-compressed data, only the index resides in
memory. The lookup function uses the index file to
locate the proper block, reads the indicated block from
disk, decompresses the block in memory, and
searches it for matching records.
•
Exact and range lookup operations only.
•
The only lookup operations we can perform on
block-compressed lookup data are exact and range.
•
Interval and regex lookup operations are not
supported.
•
In addition, we must use only fixed-length keys
for block-compressed lookup operations.
Compressed LOOKUP
Handling compressed versus uncompressed data:
The Co>Operating System manages memory differently when
handling block-compressed
And uncompressed lookup data.
Uncompressed lookup data
Any file can serve as an uncompressed lookup file as long as
the data is not compressed
and has a field you can define as a key.
We can also create an uncompressed lookup file using the
WRITE LOOKUP (or
WRITE MULTIPLE LOOKUPS) component. The component writes two
files:
a file containing the lookup data
and an index file that references the data file.
With an uncompressed lookup file, both the data and its index
reside in memory. The
lookup function uses the index to find the probable location
of the lookup key value in the
data file. Then it goes to that location and retrieves the
matching record.
ICFF
An indexed compressed flat file (ICFF) is a specific kind of
lookup file that can store large
volumes of data while also providing quick access to
individual records.
Why use indexed compressed flat files?
A disadvantage of using a lookup file like is that there is
a limit to how much data we can
keep in it. What happens when the dataset grows large? Is
there a way to maintain the
benefits of a lookup file without swamping physical memory?
Yes, there is a way: it
involves using indexed compressed flat files.
ICFFs present advantages in a number of categories:
•
Disk requirements — Because ICFFs
store compressed data in flat files without the overhead associated with a DBMS, they
require much less disk storage capacity than databases — on the order of 10
times less.
•
Memory requirements — Because ICFFs
organize data in discrete blocks, only a small portion of the data needs to be
loaded in memory at any one time.
ICFF
•
Speed — ICFFs allow us to create
successive generations of updated information without any pause in processing.
This means the time between a transaction taking place and the results of that
transaction being accessible can be a matter of seconds.
•
Performance — Making large numbers
of queries against database tables that are continually being updated can slow
down a DBMS. In such applications, ICFFs outperform databases.
•
Volume of data — ICFFs can easily
accommodate very large amounts of data — so large, in fact, that it can be
feasible to take hundreds of terabytes of data from archive tapes, convert it
into ICFFs, and make it available for online access and processing.
ICFF
•
ICFFs are usually dynamically loaded. To define
an ICFF dataset, place a BLOCK-COMPRESSED LOOKUP TEMPLATE component in your
graph.
About the BLOCK-COMPRESSED LOOKUP TEMPLATE component
:
•
A BLOCK-COMPRESSED LOOKUP TEMPLATE component is
identical to a LOOKUP TEMPLATE, except that in the former the block_compressed
and keep_on_disk parameters are set to True by default, while in the latter
they are False.
Defining a BLOCK-COMPRESSED LOOKUP TEMPLATE component:
•
When we place a BLOCK-COMPRESSED LOOKUP TEMPLATE
component in the graph, we define it by specifying two parameters:
RecordFormat
— A DML description of the data
key
— The field or fields by which the data is to be searched
Note: In a
BLOCK-COMPRESSED LOOKUP TEMPLATE component, we do not provide a static URL
for the dataset’s location as we do
with a lookup file. Instead, we specify the dataset’s location in a call to the
lookup_load function when the data is actually loaded.
Lookup Functions :
Lookup -- Returns the first record from a lookup file that matches a specified expression.
lookup_local -- Behaves like lookup, except that this function searches only one partition of a lookup file
lookup_match -- Searches for records matching a specified expression in a lookup file.
lookup_match_local -- Behaves like lookup_match, except that this function searches only one partition of a lookup file.
lookup_first -- Returns the first record from a lookup file that matches a specified expression. In Co>Operating System Version 2.15.2 and later, this is another name for the lookup function.
lookup_first_local -- Returns the first record from a partition of a lookup file that matches a specified expression. In Co>Operating System Version 2.15.2 and later, this is another name for the lookup_local function.
lookup_last -- Returns the last record from a lookup file that matches a specified expression.
lookup_last_local -- Behaves the same as lookup_last, except that this function searches only one partition of a lookup file.
lookup_count -- Returns the number of records in a lookup file that matches a specified expression.
lookup_next -- Returns the next successive matching record or the next successive record in a range,
if any, that appears in the lookup file.
lookup_nth -- Returns a specific record from a lookup file.
lookup_previous -- Returns the record from the lookup file that precedes the record returned by the last successful call to a lookup function.
Lookup_add -- Adds a record or vector to a specified lookup table.
lookup_create -- Creates a lookup table in memory.
Lookup_load -- Returns a lookup identifier that you can pass to other lookup functions.
Lookup_not_loaded -- Initializes a global lookup identifier for a lookup operation.
Lookup_range -- Returns the first record whose key matches a value in a specified range. For use only with block-compressed lookup files.
Lookup_range_count -- Returns the number of records whose keys match a value in a specified range. For use only with block-compressed lookup files.
Lookup_range_last -- Returns the last record whose key matches a value in a specified range.
Lookup_unload -- Unloads a lookup file previously loaded by lookup_load.
----------------------------------------------------------------------------------------------------------------------
Lookup -- Returns the first record from a lookup file that matches a specified expression.
lookup_local -- Behaves like lookup, except that this function searches only one partition of a lookup file
lookup_match -- Searches for records matching a specified expression in a lookup file.
lookup_match_local -- Behaves like lookup_match, except that this function searches only one partition of a lookup file.
lookup_first -- Returns the first record from a lookup file that matches a specified expression. In Co>Operating System Version 2.15.2 and later, this is another name for the lookup function.
lookup_first_local -- Returns the first record from a partition of a lookup file that matches a specified expression. In Co>Operating System Version 2.15.2 and later, this is another name for the lookup_local function.
lookup_last -- Returns the last record from a lookup file that matches a specified expression.
lookup_last_local -- Behaves the same as lookup_last, except that this function searches only one partition of a lookup file.
lookup_count -- Returns the number of records in a lookup file that matches a specified expression.
lookup_next -- Returns the next successive matching record or the next successive record in a range,
if any, that appears in the lookup file.
lookup_nth -- Returns a specific record from a lookup file.
lookup_previous -- Returns the record from the lookup file that precedes the record returned by the last successful call to a lookup function.
Lookup_add -- Adds a record or vector to a specified lookup table.
lookup_create -- Creates a lookup table in memory.
Lookup_load -- Returns a lookup identifier that you can pass to other lookup functions.
Lookup_not_loaded -- Initializes a global lookup identifier for a lookup operation.
Lookup_range -- Returns the first record whose key matches a value in a specified range. For use only with block-compressed lookup files.
Lookup_range_count -- Returns the number of records whose keys match a value in a specified range. For use only with block-compressed lookup files.
Lookup_range_last -- Returns the last record whose key matches a value in a specified range.
Lookup_unload -- Unloads a lookup file previously loaded by lookup_load.
Ab-Initio errors and resolution
details:
What does the error message
"Mismatched straight flow" mean?
Answer: This error message appears
when you have two components that are connected by a straight flow and running
at different levels of parallelism. A straight flow requires the depths — the number
of partitions in the layouts — to be the same for both the source and
destination. Very often, this error occurs after a graph is moved to a new
environment. A common cause for this error is that the depths set in
development.
What
does the error message "File table overflow" mean?
Answer: This error message
indicates that the system-wide limit on open files has been exceeded. Either
there are too many processes running on the system, or the kernel configuration
needs to be changed. This error message might occur if the maximum number of
open files allowed on the machine is set too low, or if max-core is set too low
in the components that are processing large amounts of data. In the latter
case, much of the data processed in a component (such as a SORT or JOIN
component) spills to disk, causing many files to be opened. Increasing the
value of max-core is an appropriate first step in the case of a sort, because
it reduces the number of separate merge files that must be opened at the
conclusion of the sort. NOTE: Because increasing max-core also results in the
memory requirements of your graph increasing be careful not to increase it too
much (and you might need to consider changing the graph's phasing to reduce
memory requirements). It is seldom necessary to increase max-core beyond 100MB.
If the error still occurs, see your system administrator. Note that the kernel
setting for the maximum number of system-wide open files is operating
system-dependent (for example, this is the nfile parameter on Unix systems),
and, on many platforms, requires a reboot in order to take effect. See the
Co>Operating System Installation Notes for the recommended settings.
What does the error message
"broken pipe" mean?
Answer:
This error message means that a downstream component has gone away
unexpectedly, so the flow is broken. For example, the database might have run
out of memory making database components in the graph unavailable. In general,
broken pipe errors indicate the failure of a downstream component, often a
custom component or a database component. When the downstream component failed,
the named pipe the component was writing to broke. In the majority of cases,
the problem is that the database ran out of memory, or some other problem
occurred during database load. There could be a networking problem, seen in
graphs running across multiple machines where a TCP/IP problem causes the
sender to see a "Connection reset by peer" message from the remote
machine. If a component has failed, you typically see either of two scenarios.
What does the error message "Trouble
writing to socket: No space left on device" mean?
Answer:
This error message means your work directory (AB_WORK_DIR) is full. NOTE: Any
jobs running when AB_WORK_DIR fills up are unrecoverable. An error message like
the following means you have run out of space in your work directory,
AB_WORK_DIR: ABINITIO: host.foo.bar: Trouble writing to socket: No space left
on device Trouble creating layout "layout1": [B9]
/~ab_work_dir/host/a0c5540-3dd4143c-412c/history.000
[/var/abinitio/host/a0c5540- 3dd4143c-412c/history.000]: No space left on
device [Hide Details] Url: /~ab_work_dir/host/a0c5540-3dd4143c-412c/history.000
[/var/abinitio/host/a0c5540- 3dd4143c-412c/history.000] Check the disk where
this directory resides to see if it is full. If it is, you can try to clean it
up. Note t,hat although utilities are provided to clean up AB_WORK_DIR, they
succeed only for those files for which you have permissions (nonprivileged
users can clean up only the temporary files from their own jobs; root should be
able to clean up any jobs It is critically important that you not clean up
files that are associated with a job that is still running, or that you want to
be able to recover later. Be aware that some types of Unix filesystems allocate
a fixed number of inodes (information nodes) when the filesystem is created,
and you cannot make more files than that. Use df -i to see the status of
inodes. If you make many little files, inodes can run out well ahead of data
space on the disk. The way to deal with that would be to make sure any
extraneous files on your system are backed up and removed.
What does the error
message "Failed to allocate bytes" mean?
Answer: This error message is
generated when an Ab Initio process has exceeded its limit for some type of
memory allocation. Three things can prevent a process from being able to
allocate memory: • The user data limit (ulimit -Sd and ulimit -Hd). These
settings do not apply to Windows systems. • Address space limit. • The entire
computer is out of swap space.
What is ABLOCAL and
how can I use it to resolve failures when unloading in parallel (Failed parsing
SQL)?
Answer: Some complex SQL statements
contain grammar that is not recognized by the Ab Initio parser when unloading
in parallel. In this case you can use the ABLOCAL construct to prevent the
input component from parsing the SQL (it will get passed through to the
database). It also specifies which table to use for the parallel clause.