SQL/Data Warehousing— Interview Prep.
- Order of Operations
From — choose Where the data comes from and what tables to be joined
Join is the first thing that happens when a query is executed. If you are joining tables and then using a filter on the joined data, it is a good practice to filter the data first before joining as it is quicker and less memory intensive. Pre-filtering and Pre-aggregation can be used to speed up queries by almost 50 times.
Where — Filter the base data
Yo can’t use a where clause to filter aggregated data as the Where is executed first before the aggregation, so if you need to filter data after aggregation you must use the Having clause.
Group By — Aggregate the data -Groups the data
Having — Filter the aggregated data. Having is analogous to where but for aggregated data, before they added this clause to SQL, you had to use a nested where query to filter on aggregated data. If you are using any window functions, they are calculated just after the having clause.
Select — Returns the final data
Order By — Sort the final data
Limit — Limit number of Rows - Truncate vs Delete vs Drop.
Truncate is a DDL command, uses a table lock to remove all records in the table, need alter permission on table, minimal logging in transaction log so performance wise it is faster.
Delete is a DML command, deletes rows using a row lock, We can use where clause to filter or delete only specific rows, maintains a log so slower than truncate.
Drop is a DDL command, drops the whole table including indexes and privileges, operation cannot be rolled back, Delete operations can be rolled back but not Drop and Truncate. - View/Materialized View. Views are not stored on disk but fetch the data from the database every time they are executed where as Materialized Views are physically stored on disk and therefor do not actually fetch the underlying data. Materialized Views are updated periodically based on parameters defined when they are created.
- Dynamic SQL allows you to run queries whose full text is not known until execution time. If you want to run a complex query with a user selectable sort order. Instead of coding the query twice, with different
ORDER BY
clauses, you can construct the query dynamically to include a user specifiedORDER BY
clause. You can use theEXECUTE IMMEDIATE
statement to execute anonymous PL/SQL blocks. - Stored Procedures are a batch of SQL statements that are stored inside the database which can be executed multiple times.
Oracle traditionally uses PL/SQL where as SQL Server uses T/SQL. A stored procedure usually takes a parameter, executes a query and returns a result.
6. The MERGE command in SQL is actually a combination of three SQL statements: INSERT, UPDATE and DELETE. In simple words, the MERGE statement in SQL provides a convenient way to perform all these three operations together which can be very helpful when it comes to handle the large running databases. But unlike INSERT, UPDATE and DELETE statements MERGE statement requires a source table to perform these operations on the required table which is called as target table.
//Target Table
PRODUCT_DETAILS
P_ID P_NAME P_PRICE
101 TEA 10.00
102 COFFEE 15.00
103 BISCUIT 20.00//Source Table
UPDATED_DETAILS
P_ID P_NAME P_PRICE
101 TEA 10.00
102 COFFEE 25.00
104 CHIPS 22.00
--------------------------------------MERGE PRODUCT_DETAILS AS TARGET
USING UPDATED_DETAILS AS SOURCE
ON (TARGET.P_ID = SOURCE.P_ID)
WHEN MATCHED
AND TARGET.P_NAME SOURCE.P_NAME
OR TARGET.P_PRICE SOURCE.P_PRICE THEN
//when records are matched (on the basis
//of P_ID) then do the update operation
UPDATE SET TARGET.P_Name = SOURCE.P_NAME,
//if there are changes in P_NAME OR P_PRICE
TARGET.P_PRICE = SOURCE.P_PRICE
WHEN NOT MATCHED BY TARGET THEN
//When no records are matched with target table
//then insert the records in the target table
INSERT (P_ID,P_NAME,P_PRICE)
VALUES (SOURCE.P_ID,SOURCE.P_NAME,SOURCE.P_PRICE)
WHEN NOT MATCHED BY SOURCE THEN
//when no records are matched with source table
//the delete that record in target table
DELETE;
7. Union vs Union All. Union removes duplicate rows after merging the two tables while union all preserves duplicate rows.
8. A SQL Join statement is used to combine data or rows from two or more tables based on a common field between them.
Types of Joins:
a. Inner Join
b. Left Join
c. Right Join
d. Full Outer Join
e. Cartesian join (Cross Join) - There is a join for each row of one table to each row of the other table. This usually happens when the Where condition is absent from join or there is no matching column. In the presence of a Where clause it acts exactly as an Inner Join.
f. Self Join
9. The purpose of an index is to provide pointers to the rows in a table that contain a given key value. In a regular index, this is achieved by storing a list of row-ids for each key corresponding to the rows with that key value. In a Bitmap index, a bitmap for each key value is used instead of a list of row-ids . By Default, Oracle creates B-tree Indexes.
Bitmap Indexes are useful on columns with low cardinality i.e. columns which have a small number of distinct values that are repeated where as B-tree indexes are more useful with columns that have high cardinality.
In a Clustered Index data is stored on the database in the same order as the index and therefore there is only one clustered index where as in a Non-Clustered Index there is a second list that has pointers to the actual location of data. It is generally faster to read from a clustered index
10. SQL constraints are used to specify rules for the data in a table. These are used to limit the type of data that can go into a table. This ensures the accuracy and reliability of the data in the table. If there is any violation between the constraint and the data action, the action is aborted.
Commonly used Constraints:
NOT NULL — Ensures that a column cannot have a NULL value.
UNIQUE — Ensures that all values in a column are different.
PRIMARY KEY — A combination of a NOT NULL and UNIQUE. Uniquely identifies each row in a table.
FOREIGN KEY — Uniquely identifies a row/record in another table.
CHECK — Ensures that all values in a column satisfies a specific condition.
DEFAULT — Sets a default value for a column when no value is specified.
INDEX — Used to create and retrieve data from the database very quickly.
11. A table can have multiple foreign keys but can have only a single primary key, it might have multiple primary keys which are called candidate keys but only one out of them is chosen as the primary key which best represents the table. A composite key is a combination of more than one columns that uniquely identifies each row in a table.
A surrogate key is a system generated value with no business meaning that is solely generated for the purpose of uniquely identifying each row in a table as opposed to a natural key which is a column or a set of columns that already exist in a table and can uniquely identify each row in a table.
Natural Keys may need to be reworked if business requirements change, difficult to maintain if key requires multiple columns and also performs poorly since the key values are usually large. Surrogate keys need to be generated so they take up extra disk space and require extra IO when performing insert/update operations and requires more table joins to child tables since data has no meaning on it’s own.
12. Temporary tables. These are tables in SQL that exist temporarily on the database server, these are particularly useful when you have a large number of records in a table but only want to interact with a small subset of that data, instead of filtering it every time to fetch the subset, you can just create a temporary table. These are stored inside a system database “tempdb”.
To create a temporary table you use the INTO
statement within a SELECT
query as below. The name must start with a #.
//Temporary table
SELECT name, age, gender
INTO #MaleStudents
FROM student
WHERE gender = ‘Male'//Global Temporary Table
SELECT name, age, gender
INTO ##FemaleStudents
FROM student
WHERE gender = ‘Female'
The temporary table exists only within the connection from which it is created and is automatically deleted once the connection that created the table is closed.
We can also create temporary tables that are accessible from all open connections by creating a global temporary table. This is done by using a name followed by “##”.
Data Warehousing
13. Star Schema separates the business process data into facts which hold quantitative and measurable data and dimensions which hold descriptive and categorical attributes related to fact data. This type of schema usually consists of a central fact table surrounded my multiple dimension tables resembling the shape of a star, hence the name.
14. Kimball Architecture is a bottom-up approach to designing a data warehouse. Dimensional data marts are first created to provide reporting and analytical capabilities for specific business areas such as “Sales” or “Production”. These data marts are eventually integrated together to create a data warehouse using a bus architecture, which consists of conformed dimensions between all the data marts. So the data warehouse ends up being segmented into a number of logically self-contained and consistent data marts, rather than a big and complex centralized model.
Business value can be returned as quickly as the first data marts can be created, and the method lends itself well to an exploratory and iterative approach to building data warehouses so that no master plan is required upfront.
Inmon Architecture on the other hand is a top-down approach to data warehouse design using a normalized enterprise data model. Dimensional data marts containing data needed for specific business processes or specific departments are created from the enterprise data warehouse only after the complete data warehouse has been created.
This normalized model makes loading the data less complex, but using this structure for querying is hard as it involves many tables and joins. The up-front cost is significant and construction lasts a long time, but the return is expected to be a long-lasting and reliable data architecture.
15. A Conformed Dimension is a dimension that has the same exact meaning when being referred from different fact tables. A conformed dimension can refer to multiple tables in multiple data marts within the same organization. For two dimension tables to be considered as conformed, they must either be identical or one must be a subset of another.
16. Slowly Changing Dimensions (SCD) are dimensions that contain relatively static data like the geographic locations, customers or products but which change slowly but unpredictably rather than on a regular schedule. You might have a fact table with the sales records and a dimension table with salespersons data, but the salespersons are occasionally moved or transferred to different regions, for historical reporting purposes you might want to keep track of this change. Dealing with these issues involves SCD management methodologies which are as follows:
Type 0 : Retain Original
Type 1: This methodology overwrites old with new data, and therefore does not track historical data.
Type 2: add new row with a column for version flag.
Type 3: add new attribute. A column for original_value and another column for current_value.
Type 4: add history table
Type 6: Combined Approach
Foot Notes:
Check out this great article written by Jay Feng from InterviewQuery for a more comprehensive guide on tackling SQL Interview questions.
If you want to learn more about SQL in Google cloud, check out these articles on BigQuery 101 and Using BigQuery with R/Python.