August 08, 2012

Moving On

I am an idealist. I have an idealistic notion of how the world should work and I want to work to make this world a better place. Over the last few months this thought process has been though hell. All that I stand for, all that I care about has been disregarded left, right and center. At first, it felt like a phase that will pass. I genuinely believed things would change, would become better. Then the downward spiral began and I became desperate. Realization dawned - this is it, it’s not going to change.

Having had to do things that go against my beliefs and ethics have challenged me to no end. I have had to find my own code of ethics, boundaries which I was unwilling to cross. Finding what’s legal was the easy part. Finding what was right, was the tougher one. The discoveries have been through trial and error, but each discovery has been important, and has made me understand myself more.

Working with a code of ethics in this world is not easy. Some people don’t share similar ethics, but most of the people just don’t care about it. I care about the end result of what I do. I care about bringing a smile on the faces of people who use what I build and I strive to build things that people would love to use. Nothing is more sacred to me than the set of people who use what I build, and my whole set of ethics is built around the notion of doing what is best for the user. This is something that I would’ve taken for granted in any team I work for. I don’t take it for granted anymore. Caring for the user is a mindset that needs to be carefully nurtured and treated as sacred. The moment anyone starts to break this, it disintegrates fast and this is something I have seen with my own eyes.

This last phase was one of the most challenging phases of my life. I could’ve stayed but I have decided to move on. Being given the choice between making the world better, and making money off an imperfect world, I would choose the former every single time. Every. Single. Time.

August 04, 2012

Scraping Flipkart - Part 1

Scraping is an extremely brittle technique, but it’s also an invaluable tool if used correctly. For example, many sites don’t have an API, and occasionally, the information one needs can be better extracted by using programmatic techniques. I was pretty clear about the concepts involved in scraping, but when it came to implementation, I didn’t know where I could start.

So last week, I decided to embark on a simple experiment - to scrape Flipkart using a Ruby toolbox. I didn’t want to blindly spider Flipkart (although I found an interesting library called Anemone that can do just that). I was out looking for structured data of items that are present on the Flipkart store. I also wondered what data would be okay to scrape, and what would be considered an intrusion. In the end, I settled on a simple solution - anything not mentioned in the robots.txt file (which acts as somewhat of an exclusion list for bots and web crawlers) would be fair game.

I visited Flipkart’s robot.txt file at http://www.flipkart.com/robots.txt . It was a pretty uneventful visit, until I reached the end of the file, which mentioned -

Sitemap: http://www.flipkart.com/sitemap/sitemap_index.xml

I could’ve kicked myself for not thinking about the sitemap earlier, but then, better late than never. So I visited the XML page, and it was like hitting the jackpot. It had 89 gzipped files covering each category present on flipkart. Opening one up showed the individual items in the category. Some categories were divided into multiple files, but it looked like this was everything that Flipkart had to offer.

After finding the perfect source for data, I digged into finding the correct tools to extract the data. I came across Nokogiri, which is an XML parser. For HTTP requests, I decided to give HTTParty a try, and for unzipping the files, I used zlib. For persistence, I decided to store data in JSON format, so I used multi_json.

The code was surprisingly terse, and I was able to come up with it within an hour. It first downloads the main sitemap and parses a list of archives in it. It then sequentially downloads the archives that hold individual links, extracts them and adds the list of links present in them to the results.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
require 'nokogiri'
require 'httparty'
require 'zlib'
require 'multi_json'

# Get the main document
document = Nokogiri::XML(HTTParty.get('http://www.flipkart.com/sitemap/sitemap_index.xml').body)

# Extract the archive link list
xml_archives = document.css('sitemap loc').inject([]) do |result, node|
  result.push(node.children[0].content)
end

sitemap = xml_archives.inject({}) do |result, archive_url|

  p "> GET " + archive_url
  response = HTTParty.get(archive_url)

  p "Unzipping"
  string = StringIO.new(response.body.to_s)
  binding.pry
  unzipped = Zlib::GzipReader.new(string).read

  p "Selecting details"
  url_list = []

  # Using a SAX parser instead of CSS selectors because CSS
  # selectors are mind-numbingly slow in comparison.
  archive_document = Nokogiri::XML::Reader(unzipped)
  archive_document.each do |node|
    # node_type == 1 is for the opening tag
    if node.name == 'loc' && node.node_type == 1
      # Read content in between the `loc` tag.
      node.read
      print '#' 
      url_list.push(node.value)
    end 
  end 

  result[archive_url] = url_list 
  result
end

File.open('sitemap.dump', 'w') do |file|
  file.write(MultiJson.encode(sitemap))
end

With the data I have, several interesting metrics can be calculated. However, I decided to calculate the simplest, the number of unique items in Flipkart -

sitemap.inject(0) do |result, elem|
  result += elem[1].size  
end  

Flipkart currently has 1416441 listed items. That’s approximately 1.5 million unique items.

Next up, I am planning to do a more thorough scrape of individual items. By calculation, if I do 1 request/second, it will take me 16.3 days to scrape Flipkart completely. Let’s see. :-)

July 27, 2012

Schönfinkelization

I was amused by the name when I came across this technique while reading the book Javscript Patterns by Stoyan Stefanov. Schönfinkelization is just another name for a technique called Currying, but somehow, this name seems cool, right? ;-)

To understand what Currying is, we first need to understand partial function application. Intuitively, it just means that “if I fix the first arguments of a function, I get a function that takes the remaining arguments”.

Currying then, is the process of making a function capable of handling partial application. The basic idea of Currying is to transform a function requiring n arguments into a chain of functions, each taking a single argument. This allows us to fix the first few arguments to get a new function that will take the rest of the arguments, which is exactly what we need for partial application. In the book, Stoyan gave a generic function schonfinkelize to enable this behavior for all such functions -

function schonfinkelize(fn) {
  var slice = Array.prototype.slice,
    oldArgs = slice.call(arguments, 1);
  return function() {
    var newArgs = slice.call(arguments),
      args = oldArgs.concat(newArgs);
    return fn.apply(null, args);
  };
}

If you look carefully, most of the code is just there to get over the limitation that arguments is not an array but an array-like object. If arguments was an array, the function would’ve been much simpler, and then it is easier to see the gist of what the schonfinkelize function does.

// This is NOT working code. It assumes `arguments` is an array.

function schonfinkelize(fn) {
  var oldArgs = arguments.slice(1);
  return function() {
    return fn.apply(null, oldArgs.concat(arguments));
  };
}

Of course, once this generalized function is defined, it is easy to use it.

// Function that we want to schonfinkelize (or curry)
function multiply(a, b) {
  return a * b;
}

// Partial application
var multiplySc = schonfinkelize(multiply, 10);

// Final result
var result = multiplySc(10); // 100

REFERENCES:

  1. Javascript Patterns by Stoyan Stefanov
  2. Wikipedia - Currying
July 25, 2012

Apple's Outlook For India

Tim Cook, when asked in the Q3 earnings call about why Apple hasn’t been more successful in India:

I love India, but I believe Apple has some higher potential in the intermediate term in some other countries. That doesn’t mean we’re not putting emphasis in India, we are. We have a business there and it’s growing but the multi-layered distribution there really adds to the cost of getting products to market.

I believe the high cost of Apple products in a price-conscious nation like India is a huge factor that is crippling Apple’s growth. I’ve often wondered as to why Apple products cost more in India than, let’s say, the US. One reason that I know of is that the Indian Government levies a duty of 12% + 5% VAT on laptops. However, just the taxation itself cannot justify the significantly higher price. Today, Tim Cook mentioned another reason - multi-layered distribution.

P.S. I’d like to see an Apple Store coming to India soon.

July 24, 2012

Next Steps

The most important thing?

Write more.

What’s still broken with this blog?

  1. The stylesheet has plenty of room for pruning.
  2. There’s redundant code for post template.
  3. Archive page is not up to the mark. I didn’t figure out what’s going wrong with the font sizes.
  4. Google Analytics isn’t present yet. It was stupid to not have it ready in the first place.
  5. jQuery is redundant for now. If I ain’t using it, it’s best to comment it out for now.

What are the good to do plans?

  1. Convert the styles to Sass + Compass.
  2. Make the theme responsive and mobile-friendly.
  3. Add a way to enable Github Flavored markdown.
  4. Pagination.
  5. Add icons for the navigation menu.
  6. Package the theme for Jekyll Bootstrap, maybe.

NOTE: The working assumption according to which I am putting up this post is that I will go through with the steps I put down here, since I am writing them in public.

July 24, 2012

Finally, Jekyll

Phase 1: Wordpress

I started with Wordpress because that was the only blogging platform I knew of (that I could heavily customize). A few weeks and a few major designs later, I had my blog up and running. It wasn’t perfect, but it worked. The problem, however, was that the platform sucked for technical blogging. Putting in code itself was a pain. Moreover, I felt restricted to typing on a very substandard editor. It just didn’t work the way I would’ve liked it to.

Phase 2: Octopress

I never got started with Octopress. Customizing the default theme was a major headache, and after a few hackish attempts, I stopped trying. It never felt right. There was too much jazz with plugins I didn’t need, structure I didn’t want, and octopress magic that I didn’t care about. It’s good for people who just want some out-of-the-box functionality to get things going. Nothing wrong with that, but I have been seeing too much of the default Octopress theme elsewhere to have it on my blog as well, which meant I HAD to customize it far beyond what it looks like.

Phase 3: Jekyll

Being so frustrated with the status quo, I made a list of what I felt I really needed from a blogging platform

  1. Markdown Support
  2. Code Syntax Highlighting
  3. No dependence on a web-based interface
  4. Support for heavy customization

Wordpress sucks at the first three, Octopress sucks at the fourth, but Jekyll is good with all of the things I need. Hence, I decided to go with Jekyll, which is about as barebones as one can get. Accompanying it, I made the simplest theme I could, taking inspiration from Daring Fireball, Dustin Curtis and my previous theme, Sense. Still thinking of a name for it, but I like it for it’s simplicity. The whole thing was hacked up in less than a day.

P.S. It’s ridiculous that design held me back from writing, but it’s good to be back, writing again.

November 11, 2011

Be Human

Matt Gemmell on Adobe Communications:

Be human, and engage directly with people - they’ll respect you for it, and be more willing to give your business a chance.

Having that human touch is what most companies try to avoid rather than embrace. Wonder why.

November 07, 2011

Why I Work

He who seeks rest finds boredom. He who seeks work finds rest. Dylan Thomas

October 24, 2011

Celebrating Steve

Young Steve Jobs There can't be a better way to say goodbye, than for the whole of Apple to get together and celebrate all that Steve stood for.

August 25, 2011

That iMac I Saw

iMac I was introduced to an iMac in 2003.

It was set up for demonstration purposes in a corner in Delhi Public School, Noida, where I had gone for a programming competition. There were no words I had for this device. It was … unusual. That’s an interesting word to have for something that was so gorgeously beautiful. To put it in perspective, the PC I had at that time was a clunky machine, clearly showing age. It was a worn out shade of white, or rather a faded out yellow, with the most dull looking exteriors, the kind that could induce slumber. So to me, the description of the iMac started with this word. Unusual.

I remember going up to it and admiring it. There was no box for a CPU. Everything was inside the monitor itself - the curvy translucent bluish monitor through which you could see the motherboard. I grabbed hold of the hockey-puck mouse, and tried to press the left and the right of it, because I knew how to work a mouse. Or rather, I thought I knew. There’s a left button and a right button on every mouse, right? Wrong. This mouse had just ONE button - and that button was the mouse itself. It was a unique experience, considering I had been using the PC for about 10 years without even thinking about the existence of some other system that was similar but still so fundamentally different. How could anyone not have a right mouse button?

I started using it, falling in love with the way the genie effect worked, the way the dock buttons bounced up when I hovered above them and simply by the way in which everything was smooth. I had read somewhere that it contained this amazing new technology called Quartz which was awesome at anti-aliasing and rendering 2D stuff, and it was just wonderful seeing it come alive in front of me for the first time. I was fascinated. However, at that point of time in my life, it was just too expensive a computer. It wasn’t that I didn’t dream about buying it, it was that I couldn’t.

7 years later, I have a MacBook Pro, an iPod nano and an iPhone. Something changed.