My Workday

Friday, December 7, 2012

Why CPU Load is High?

Often time I come across situations where server comes under high CPU load. There is a simple way to find out which application thread(s) are responsible for the load. To get the thread which has the highest CPU load, issue:

ps -mo ps -mo pid,lwp,stime,time,cpu -p

LWP stands for Light-Weight Process and typically refers to kernel threads. To identify the thread, get the LWP with highest CPU load, and convert its unique number (xxxx) into a hexadecimal number (0xxxx).

Get the java stack dump using jstack -l command and find the thread based on the hex humber identified above.

Tuesday, September 11, 2012

Simple command line parser in Bash

I wanted to create a simple yet flexible way to parse command line arguments in bash. I used case statement, and some expression expansion technique to read arguments in a simple manner. I find this very handy, and hoping you will find it useful in solving or simplifying your task as well. Whether it is a serious script or a quick hack, clean programming makes your script more efficient and also easier to understand.

usage() {
echo -e "No command-line argument\n"
echo "Usage: $0 "
echo "Arguments:"
echo -e " --copy-from-hdfs\tcopy data set resides in HDFS"
echo -e " --copy-to-s3\t\tcopy files to S3 in AWS"
echo -e " --gzip\t\t\tcompress source files, recommended before sending data set to S3"
echo -e " --remote-dir=\t\tpath to input directory (HDFS directory)"
echo -e " --local-dir=\t\tlocal tmp directory (local directory)"
echo -e " --s3-bucket-dir=\ts3 bucket directory in AWS"
exit 1
}

# Check command line args
if [ -z $1 ]
then
usage
else
# Parsing commandline args
for i in $*
do
case $i in
-r=*|--remote-dir=*)
#DM_DATA_DIR=`echo $i | sed 's/[-a-zA-Z0-9]*=//'` -- > this work but using expression expansion below is a much nicer and compact way
DM_DATA_DIR=${i#*=}
;;
-l=*|--local-dir=*)
AMAZON_DATA_DIR=${i#*=}
;;
-s3=*|--s3-bucket-dir=*)
#S3_DIR=`echo $i | sed 's/[-a-zA-Z0-9]*=//'`
S3_DIR=${i#*=}
;;
--copy-from-hdfs)
COPY_FROM_HDFS=YES
;;
--copy-to-s3)
COPY_TO_S3=YES
;;
-c|--gzip)
COMPRESS=YES
;;
*)
# Unknown option
;;
esac
done

Thoughts, and suggestions are welcome!

Wednesday, September 5, 2012

Speed up slow queries on information_schema table

Set innodb_stats_on_metadata=0 which will prevent statistic update when you query information_schema.

mysql> select count(*),sum(data_length) from information_schema.tables;
+----------+------------------+
| count(*) | sum(data_length) |
+----------+------------------+
| 5581 | 3051148872493 |
+----------+------------------+
1 row in set (3 min 21.82 sec)
mysql> show variables like '%metadata'
+--------------------------+-------+
| Variable_name | Value |
+--------------------------+-------+
| innodb_stats_on_metadata | ON |
+--------------------------+-------+
mysql> set global innodb_stats_on_metadata=0;
mysql> show variables like '%metadata'
+--------------------------+-------+
| Variable_name | Value |
+--------------------------+-------+
| innodb_stats_on_metadata | OFF |
+--------------------------+-------+

mysql> select count(*),sum(data_length) from information_schema.tables;
+----------+------------------+
| count(*) | sum(data_length) |
+----------+------------------+
| 5581 | 3051148872493 |
+----------+------------------+
1 row in set (0.49 sec)

Wednesday, February 1, 2012

Single-word Wang/Jenkins Hash in ConcurrentHashMap

ConcurrentHashMap is hash table supporting full concurrency of retrievals and adjustable expected concurrency for updates. I recently came across this code during testing, and one part really got my attention. To generate the hash, ConcurrentHashMap uses an algorithm based on bitshifting and bitwise operations.

========================================

Variant of single-word Wang/Jenkins hash

========================================

private static int hash(int h) {

// Spread bits to regularize both segment and index locations,

// using variant of single-word Wang/Jenkins hash.

h += (h << 15) ^ 0xffffcd7d;

h ^= (h >>> 10);

h += (h << 3);

h ^= (h >>> 6);

h += (h << 2) + (h << 14);

return h ^ (h >>> 16);

}

According to the comment in the code, this method applies a supplemental hash function to a given hashCode, which defends against poor quality hash functions. Good hash functions are important as a hash table effectively turns from a map to a linked list, in the worst case, all keys in the same bucket. There are also other considerations that come into play such as the performance of hash calculation and the number of buckets. Dr. Heinz M. Kabutz explains the power of “power-of-two number of buckets” which gives us some good starting point to understand what is really going on here.

Let’s look at the code above and see how things change, line-by-line. To make things simple, I use int 1 to perform all the operations.

In Java, the int data type is a 32-bit signed two’s complement integer. To represent int 1 in binary code, we have the following:

1	h=1 > 0000-0000-0000-0000-0000-0000-0001

Now, let’s dissect the following line:

1	h += (h << 15) ^ 0xffffcd7d

First, let’s re-write this into an easier-to-read format.. at least for me .

h1 = h << 15 = 0000-0000-0000-0000-1000-0000-0000-0000

hex = 0xffffcd7d = 1111-1111-1111-1111-1100-1101-0111-1101

h2 = h1 ^ hex = 1111-1111-1111-1111-0100-1101-0111-1101

h2 + h = 1111-1111-1111-1111-0100-1101-0111-1110

Using the same thought processing and applying it to each line, we end-up with:

h += (h << 15) ^ 0xffffcd7d = 1111-1111-1111-1111-0100-1101-0111-1110

h ^= (h >>> 10) = 1111-1111-1100-0000-1011-0010-1010-1101

h += (h << 3) = 1111-1101-1100-0110-0100-1000-0001-0101

h ^= (h >>> 6) = 1111-1110-0011-0001-0101-0001-0011-0101

h += (h << 2) + (h << 14) = 0100-1011-0100-0011-1101-0110-0000-1001

h ^= (h >>> 16) = 0100-1011-0100-0011-1001-1101-0100-1010

Result:
Bin = 0100-1011-0100-0011-1001-1101-0100-1010
Decimal = 1,262,722,378