AbInitio Q&A

Wednesday, May 27, 2015

Lookups and Abinitio errors

LOOKUP FILE

A lookup file is an indexed dataset and it actually consists of two files : one files holds data and the other holds an hash index into the data file.We commonly use a lookup file to hold in physical memory the data that a transform component frequently needs to access.

LOOKUP FILE

How to use a LOOKUP FILE COMPONENT:

• To perform a memory-resident lookup using a Lookup File component:

• Place a LOOKUP_FILE component in the graph and open its Properties dialog.

• On the Description tab, set the Label to a name we will use in the lookup functions that reference this file.

• Click Browse to locate the file to use as the lookup file.

• Set the RecordFormat parameter to the record format of the lookup file.

• Set the key parameter to specify the fields to search.

LOOKUP FILE

• Set the Special attribute of the key to the type of lookup we want to do.

• Add a lookup function to the transform of the component that will use the lookup file.

• The first argument to a lookup function is the name of the lookup file. The remaining arguments are values to be matched against the fields named by the key parameter of the lookup file.

lookup("MyLookupFile", in.key)

• If the lookup file key's Special attribute (in the Key Specifier Editor) is exact, the lookup functions return a record that matches the key values and has the format specified by the RecordFormat parameter.

Partitioned lookup files:

Lookup files can be either serial or partitioned (multifiles). The lookup

functions we use to access lookup data come in both local and non‑local

varieties, depending on whether the lookup data files are partitioned.

When a component accesses a serial lookup file, the Co>Operating System

loads the entire file into the component’s memory. If the component is

running in parallel (and you use a _local lookup function), the

Co>Operating System splits the lookup file into partitions.

The benefits of partitioning lookup files are:

1. The per-process footprint is lower. This means the lookup file as a whole can exceed the 2 GB limit.

2. If the component is partitioned across machines, the total memory needed on any one machine is reduced.

DYNAMIC LOOKUP

A disadvantage of statically loading a lookup file is that the dataset occupies a fixed amount of

memory even when the graph isn’t using the data.

By dynamically loading lookup data, we control how many lookup datasets are loaded, which

lookup datasets are loaded, and when lookup datasets are loaded. This control is useful in

conserving memory; applications can unload datasets that are not immediately needed and load

only the ones needed to process the current input record.

The idea behind dynamically loading lookup data is to:

1. Load the dataset into memory when it is needed.

2. Retrieve data with your graph.

3. Free up memory by unloading the dataset after use.

DYNAMIC LOOKUP

How to look up data dynamically:

To look up data dynamically:

1. Prepare a LOOKUP TEMPLATE component:

a. Add a Lookup Template component to the graph and open its Properties dialog.

b. On the Description tab of the Properties dialog, enter a label in the Label text box.

c. On the Parameters tab, set the RecordFormat parameter.

Here, we specify the DML record format of the lookup data file.

• Set the key parameter to the key we will use for the lookup.

• Load the lookup file using the lookup_load function inside a transform function.

DYNAMIC LOOKUP

For example, enter:

let lookup_identifier_type LID =

lookup_load(MyData, MyIndex, "MyTemplate", -1)

where:

LID is a variable to hold the lookup ID returned by the lookup_load function. This ID references the lookup file in memory. The lookup ID is valid only within the scope of the transform.

MyData is the pathname of the lookup data file.

MyIndex is the index of the pathname of the lookup index file.

If no index file exists, we must enter the DML keyword NULL. The graph creates an

index on the fly.

LOOKUP TEMPLATE component:

A LOOKUP TEMPLATE component substitutes for a LOOKUP FILE

component when we want to load lookup data dynamically.

Defining a lookup template:

When you place a LOOKUP TEMPLATE component in our graph, we define

it by specifying two parameters:

RecordFormat — A DML description of the data

key — The field or fields by which the data is to be searched

Note: In a lookup template, we do not provide a static URL for the

dataset’s location as we do with a lookup file. Instead, we specify the

dataset’s location in a call to the lookup_load function when the data is

actually loaded.

Appendable lookup files (ALFs)

Data has a tendency to change and grow. In situations where new data is arriving all the time,

static datasets loaded into memory at graph execution time are not up to the task. Even

dynamically loaded lookup datasets may require complex logic to check whether the data has

changed before using it.

Appendable lookup files (ALFs) are a special kind of dynamically loaded lookup file in which a

newly arriving record is made available to our graph as soon as the complete record appears on

disk. ALFs can enable the applications to process new data quickly— often less than a second

after it is landed to disk.

DYNAMIC LOOKUP

How to create an appendable lookup file (ALF):

To create an ALF:

Call the lookup_load function, specifying:

The integer -2 as the load-behavior argument to lookup_load

The DML constant NULL for the index

For example, this command creates an ALF using the data in the existing disk file mydata, and following the record format specified in the lookup template My_Template:

let lookup_identifier_type LID =

lookup_load($DATA/mydata, NULL, "My_Template", -2)

Compressed lookup data:

The data stored in a lookup file can be either uncompressed or block-compressed. Block-compressed data

forms the basis of indexed compressed flat files (ICFFs). This kind of data is both compressed and divided

into blocks of roughly equal size.

Obviously, we can store data more efficiently when it is compressed. On the other hand, raw data can be

read faster, since it does not need to be uncompressed first.

Typically, we would use compressed lookup data when the total size of the data is large but only a

relatively small amount of it is needed at any given time.

Compressed LOOKUP

Block-compressed lookup data:

With block-compressed data, only the index resides in memory. The lookup function uses the index file to

locate the proper block, reads the indicated block from disk, decompresses the block in memory, and

searches it for matching records.

• Exact and range lookup operations only.

• The only lookup operations we can perform on block-compressed lookup data are exact and range.

• Interval and regex lookup operations are not supported.

• In addition, we must use only fixed-length keys for block-compressed lookup operations.

Compressed LOOKUP

Handling compressed versus uncompressed data:

The Co>Operating System manages memory differently when handling block-compressed

And uncompressed lookup data.

Uncompressed lookup data

Any file can serve as an uncompressed lookup file as long as the data is not compressed

and has a field you can define as a key.

We can also create an uncompressed lookup file using the WRITE LOOKUP (or

WRITE MULTIPLE LOOKUPS) component. The component writes two files:

a file containing the lookup data

and an index file that references the data file.

With an uncompressed lookup file, both the data and its index reside in memory. The

lookup function uses the index to find the probable location of the lookup key value in the

data file. Then it goes to that location and retrieves the matching record.

ICFF

An indexed compressed flat file (ICFF) is a specific kind of lookup file that can store large

volumes of data while also providing quick access to individual records.

Why use indexed compressed flat files?

A disadvantage of using a lookup file like is that there is a limit to how much data we can

keep in it. What happens when the dataset grows large? Is there a way to maintain the

benefits of a lookup file without swamping physical memory? Yes, there is a way: it

involves using indexed compressed flat files.

ICFFs present advantages in a number of categories:

• Disk requirements — Because ICFFs store compressed data in flat files without the overhead associated with a DBMS, they require much less disk storage capacity than databases — on the order of 10 times less.

• Memory requirements — Because ICFFs organize data in discrete blocks, only a small portion of the data needs to be loaded in memory at any one time.

ICFF

• Speed — ICFFs allow us to create successive generations of updated information without any pause in processing. This means the time between a transaction taking place and the results of that transaction being accessible can be a matter of seconds.

• Performance — Making large numbers of queries against database tables that are continually being updated can slow down a DBMS. In such applications, ICFFs outperform databases.

• Volume of data — ICFFs can easily accommodate very large amounts of data — so large, in fact, that it can be feasible to take hundreds of terabytes of data from archive tapes, convert it into ICFFs, and make it available for online access and processing.

ICFF

• ICFFs are usually dynamically loaded. To define an ICFF dataset, place a BLOCK-COMPRESSED LOOKUP TEMPLATE component in your graph.

About the BLOCK-COMPRESSED LOOKUP TEMPLATE component :

• A BLOCK-COMPRESSED LOOKUP TEMPLATE component is identical to a LOOKUP TEMPLATE, except that in the former the block_compressed and keep_on_disk parameters are set to True by default, while in the latter they are False.

Defining a BLOCK-COMPRESSED LOOKUP TEMPLATE component:

• When we place a BLOCK-COMPRESSED LOOKUP TEMPLATE component in the graph, we define it by specifying two parameters:

RecordFormat — A DML description of the data

key — The field or fields by which the data is to be searched

Note: In a BLOCK-COMPRESSED LOOKUP TEMPLATE component, we do not provide a static URL for the dataset’s location as we do with a lookup file. Instead, we specify the dataset’s location in a call to the lookup_load function when the data is actually loaded.

Lookup Functions :

Lookup -- Returns the first record from a lookup file that matches a specified expression.

lookup_local -- Behaves like lookup, except that this function searches only one partition of a lookup file

lookup_match -- Searches for records matching a specified expression in a lookup file.

lookup_match_local -- Behaves like lookup_match, except that this function searches only one partition of a lookup file.

lookup_first -- Returns the first record from a lookup file that matches a specified expression. In Co>Operating System Version 2.15.2 and later, this is another name for the lookup function.

lookup_first_local -- Returns the first record from a partition of a lookup file that matches a specified expression. In Co>Operating System Version 2.15.2 and later, this is another name for the lookup_local function.

lookup_last -- Returns the last record from a lookup file that matches a specified expression.

lookup_last_local -- Behaves the same as lookup_last, except that this function searches only one partition of a lookup file.

lookup_count -- Returns the number of records in a lookup file that matches a specified expression.

lookup_next -- Returns the next successive matching record or the next successive record in a range,
if any, that appears in the lookup file.

lookup_nth -- Returns a specific record from a lookup file.

lookup_previous -- Returns the record from the lookup file that precedes the record returned by the last successful call to a lookup function.

Lookup_add -- Adds a record or vector to a specified lookup table.

lookup_create -- Creates a lookup table in memory.

Lookup_load -- Returns a lookup identifier that you can pass to other lookup functions.

Lookup_not_loaded -- Initializes a global lookup identifier for a lookup operation.

Lookup_range -- Returns the first record whose key matches a value in a specified range. For use only with block-compressed lookup files.

Lookup_range_count -- Returns the number of records whose keys match a value in a specified range. For use only with block-compressed lookup files.

Lookup_range_last -- Returns the last record whose key matches a value in a specified range.

Lookup_unload -- Unloads a lookup file previously loaded by lookup_load.

----------------------------------------------------------------------------------------------------------------------

Ab-Initio errors and resolution details:

What does the error message "Mismatched straight flow" mean?

Answer: This error message appears when you have two components that are connected by a straight flow and running at different levels of parallelism. A straight flow requires the depths — the number of partitions in the layouts — to be the same for both the source and destination. Very often, this error occurs after a graph is moved to a new environment. A common cause for this error is that the depths set in development.

What does the error message "File table overflow" mean?

Answer: This error message indicates that the system-wide limit on open files has been exceeded. Either there are too many processes running on the system, or the kernel configuration needs to be changed. This error message might occur if the maximum number of open files allowed on the machine is set too low, or if max-core is set too low in the components that are processing large amounts of data. In the latter case, much of the data processed in a component (such as a SORT or JOIN component) spills to disk, causing many files to be opened. Increasing the value of max-core is an appropriate first step in the case of a sort, because it reduces the number of separate merge files that must be opened at the conclusion of the sort. NOTE: Because increasing max-core also results in the memory requirements of your graph increasing be careful not to increase it too much (and you might need to consider changing the graph's phasing to reduce memory requirements). It is seldom necessary to increase max-core beyond 100MB. If the error still occurs, see your system administrator. Note that the kernel setting for the maximum number of system-wide open files is operating system-dependent (for example, this is the nfile parameter on Unix systems), and, on many platforms, requires a reboot in order to take effect. See the Co>Operating System Installation Notes for the recommended settings.

What does the error message "broken pipe" mean?

Answer: This error message means that a downstream component has gone away unexpectedly, so the flow is broken. For example, the database might have run out of memory making database components in the graph unavailable. In general, broken pipe errors indicate the failure of a downstream component, often a custom component or a database component. When the downstream component failed, the named pipe the component was writing to broke. In the majority of cases, the problem is that the database ran out of memory, or some other problem occurred during database load. There could be a networking problem, seen in graphs running across multiple machines where a TCP/IP problem causes the sender to see a "Connection reset by peer" message from the remote machine. If a component has failed, you typically see either of two scenarios.

What does the error message "Trouble writing to socket: No space left on device" mean?

Answer: This error message means your work directory (AB_WORK_DIR) is full. NOTE: Any jobs running when AB_WORK_DIR fills up are unrecoverable. An error message like the following means you have run out of space in your work directory, AB_WORK_DIR: ABINITIO: host.foo.bar: Trouble writing to socket: No space left on device Trouble creating layout "layout1": [B9] /~ab_work_dir/host/a0c5540-3dd4143c-412c/history.000 [/var/abinitio/host/a0c5540- 3dd4143c-412c/history.000]: No space left on device [Hide Details] Url: /~ab_work_dir/host/a0c5540-3dd4143c-412c/history.000 [/var/abinitio/host/a0c5540- 3dd4143c-412c/history.000] Check the disk where this directory resides to see if it is full. If it is, you can try to clean it up. Note t,hat although utilities are provided to clean up AB_WORK_DIR, they succeed only for those files for which you have permissions (nonprivileged users can clean up only the temporary files from their own jobs; root should be able to clean up any jobs It is critically important that you not clean up files that are associated with a job that is still running, or that you want to be able to recover later. Be aware that some types of Unix filesystems allocate a fixed number of inodes (information nodes) when the filesystem is created, and you cannot make more files than that. Use df -i to see the status of inodes. If you make many little files, inodes can run out well ahead of data space on the disk. The way to deal with that would be to make sure any extraneous files on your system are backed up and removed.

What does the error message "Failed to allocate bytes" mean?

Answer: This error message is generated when an Ab Initio process has exceeded its limit for some type of memory allocation. Three things can prevent a process from being able to allocate memory: • The user data limit (ulimit -Sd and ulimit -Hd). These settings do not apply to Windows systems. • Address space limit. • The entire computer is out of swap space.

What is ABLOCAL and how can I use it to resolve failures when unloading in parallel (Failed parsing SQL)?

Answer: Some complex SQL statements contain grammar that is not recognized by the Ab Initio parser when unloading in parallel. In this case you can use the ABLOCAL construct to prevent the input component from parsing the SQL (it will get passed through to the database). It also specifies which table to use for the parallel clause.

Monday, May 4, 2015

We know rollup component in Abinitio is used to summarize group of data record then why do we use aggregation?

- Aggregation and Rollup, both are used to summarize the data.

- Rollup is much better and convenient to use.

- Rollup can perform some additional functionality, like input filtering and output filtering of records.

- Aggregate does not display the intermediate results in main memory, where as Rollup can.

- Analyzing a particular summarization is much simpler compared to Aggregations.

What kind of layouts does Abinitio support?

- Abinitio supports serial and parallel layouts.

- A graph layout supports both serial and parallel layouts at a time.

- The parallel layout depends on the degree of the data parallelism

- A multi-file system is a 4-way parallel system

- A component in a graph system can run 4-way parallel system.

How do you add default rules in transformer?

The following is the process to add default rules in transformer

- Double click on the transform parameter in the parameter tab page in component properties

- Click on Edit menu in Transform editor

- Select Add Default Rules from the dropdown list box.

- It shows Match Names and Wildcard options. Select either of them.

What is a look-up?

- A lookup file represents a set of serial files / flat files

- A lookup is a specific data set that is keyed.

- The key is used for mapping values based on the data available in a particular file

- The data set can be static or dynamic.

- Hash-joins can be replaced by reformatting and any of the input in lookup to join should contain less number of records with a slim length of records

- Abinitio has certain functions for retrieval of values using the key for the lookup

What is a ramp limit?

- A limit is an integer parameter which represents a number of reject events

- Ramp parameter contain a real number representing a rate of reject events of certain processed records

- The formula is - No. of bad records allowed = limit + no. of records x ramp

- A ramp is a percentage value from 0 to 1.

- These two provides the threshold value of bad records.

What is a Rollup component? Explain about it.

- Rollup component allows the users to group the records on certain field values.

- It is a multi stage function and contains

- Initialize 2. Rollup 3. Finalize functions which are mandatory

- To counts of a particular group Rollup needs a temporary variable

- The initialize function is invoked first for each group

- Rollup is called for each of the records in the group.

- The finally function calls only once at the end of last rollup call.

How to add default rules in transformer?

- Open Add Default Rules dialog box.

- Select Match Names – to match the names that generates a set of rules to copy input fields to out fields with same name

- Use Wildcard(. *) Rule : This rule generates only one rule to copy input fields to output fields with the same name

- If not displayed – display the Transform Editor Grid

- Click the Business Rule tab . Select Edit?Add Default Rules

- Nothing is needed to write in the reformat .xfr file in case of reformat, if there is no need to use any real transform other than reducing the set of fields.

What is the difference between partitioning with key / hash and round robin?

Partitioning by Key / Hash Partition :

- The partitioning technique that is used when the keys are diverse

- Large data skew can exist when the key is present in large volume

- It is apt for parallel data processing

Round Robin Partition :

- This partition technique uniformly distributes the data on every destination data partitions

- When number of records is divisible by number of partitions, then the skew is zero.

- For example – a pack of 52 cards is distributed among 4 players in a round-robin fashion.
Explain the methods to improve performance of a graph?

The following are the ways to improve the performance of a graph :

- Make sure that a limited number of components are used in a particular phase

- Implement the usage of optimum value of max core values for the purpose of sorting and joining components.

- Utilize the minimum number of sort components

- Utilize the minimum number of sorted join components and replace them by in-memory join / hash join, if needed and possible

- Restrict only the needed fields in sort, reformat, join components

- Utilize phasing or flow buffers when merged or sorted joins

- Use sorted join, when two inputs are huge, otherwise use hash join

What is the function that transfers a string into a decimal?

- Use decimal cast with the size in the transform() function, when the size of the string and decimal is same.

- Ex: If the source field is defined as string(8).

- The destination is defined as decimal(8)

- Let us assume the field name is salary.

- The function is out.field :: (decimal(8)) in salary

- If the size of the destination field is lesser that the input then string_substring() function can be used

- Ex : Say the destination field is decimal(5) then use…

- out.field :: (decimal(5))string_lrtrim(string_substring(in.field,1,5))

- The ‘ lrtrim ‘ function is used to remove leading and trailing spaces in the string
Describe the Evaluation of Parameters order.

Following is the order of evaluation:

- Host setup script will be executed first

- All Common parameters, that is, included , are evaluated

- All Sandbox parameters are evaluated

- The project script – project-start.ksh is executed

- All form parameters are evaluated

- Graph parameters are evaluated

- The Start Script of graph is executed

Explain PDL with an example?

- To make a graph behave dynamically, PDL is used

- Suppose there is a need to have a dynamic field that is to be added to a predefined DML while executing the graph

- Then a graph level parameter can be defined

- Utilize this parameter while embedding the DML in output port.

- For Example : define a parameter named myfield with a value “string(“ | “”) name;”

- Use ${mystring} at the time of embedding the dml in out port.

- Use $substitution as an interpretation option

State the working process of decimal_strip function.

- A decimal strip takes the decimal values out of the data.

- It trims any leading zeros

- The result is a valid decimal number

Ex:
decimal_strip("-0184o") := "-184"
decimal_strip("oxyas97abc") := "97"
decimal_strip("+$78ab=-*&^*&%cdw") := "78"
decimal_strip("Honda") "0"

State the first_defined function with an example.

- This function is similar to the function NVL() in Oracle database

- It performs the first values which are not null among other values available in the function and assigns to the variable

Example: A set of variables, say v1,v2,v3,v4,v5,v6 are assigned with NULL.
Another variable num is assigned with value 340 (num=340)
num = first_defined(NULL, v1,v2,v3,v4,v5,v6,NUM)
The result of num is 340

What is MAX CORE of a component?

- MAX CORE is the space consumed by a component that is used for calculations

- Each component has different MAX COREs

- Component performances will be influenced by the MAX CORE’s contribution

- The process may slow down / fasten if a wrong MAX CORE is set

What are the operations that support avoiding duplicate record?

Duplicate records can be avoided by using the following:

- Using Dedup sort

- Performing aggregation

- Utilizing the Rollup component

What parallelisms does Abinitio support?

AbInitio supports 3 parallelisms. They are

- Data Parallelism : Same data is parallelly worked in a single application

- Component Parallelism : Different data is worked parallelly in a single application

- Pipeline Parallelism : Data is passed from one component to another component. Data is worked on both of the components.

State the relation between EME, GDE and Co-operating system.

EME:

- EME stands for Enterprise Metadata Environment

- It is a repository to AbInitio. It holds transformations, database configuration files, metadata and target information

GDE:

- GDE – Graphical Development Environment

- It is an end user environment. Graphs are developed in this environment

- It provides GUI for editing and executing AbInitio programs

Co-operative System:

- Co-operative system is the server of AbInitio.

- It is installed on a specific OS platform known as Native OS.

- All generated graphs in GDE are later deployed and executed in co-operative system

What is a deadlock and how it occurs?

- A graphical / program hand is known as deadlock.

- The progression of a program would be stopped when a dead lock occurs.

- Data flow pattern likely causes a deadlock

- If a graph flows diverge and converge in a single phase, it is potential for a deadlock

- A component might wait for the records to arrive on one flow during the flow converge, even though the unread data accumulates on others.

- In GDE version 1.8, the occurrence of a dead lock is very rare

What is the difference between check point and phase?

Check point:

- When a graph fails in the middle of the process, a recovery point is created, known as Check point

- The rest of the process will be continued after the check point

- Data from the check point is fetched and continue to execute after correction.

Phase:

- If a graph is created with phases, each phase is assigned to some part of memory one after another.

- All the phases will run one by one

- The intermediate file will be deleted

AbInitio Q&A

Search This Blog

Search This Blog

Search This Blog

Wednesday, May 27, 2015

Lookups and Abinitio errors

Monday, May 4, 2015