GSoC 2016 - Apache Zeppelin

Tuesday, 28 June 2016

Communication WebSocket and concurrency

The goal is to achieve multiple notes to be downloaded from the peers. Hence the concurrency, the downloading should not block the main running thread.This is common to IpfsNotebookrepo or Bittorentrepo. So how should the design be ?

Here is the current IpfsNotebookrepo class.

The get(hash : Multihash) and get(url : MagnetURL) are blocking calls. It waits till it downloads from peer. Hence they have to be run in a thread. Hence various approaches are

IpfsNotebookRepo implements Runnable and submit it to scheduler. But I will have to create new IpfsNotebookRepo instances everytime.
Create a class IpfsDownloadTask implements Runnable/Callable . Should this class be nested , inner or a separate class. If it is a separate class it should contain IpfsNotebookRepo instance as a member to call .get method.

I have created a separate example project just focusing on the main part.

here is the code..

Currently I have used callbacks from google-gauva. After the download is complete send method is called with appropriate operation to notify the user.

So here are my questions

IpfsTask class call method currently calls getNote which just returns uppercase, actually it will be returning the note in string from peer. Where should this class be ? inner, separate ? If separate , it should contain IpfsNotebookRepo instance ?
After the note is downloaded I need to call the importNote from Notebook Server class which actually adds the note and broadcasts. How to achieve this ?
Ipfs servlet listens on separate url path for websocket. Should it be part of Notebook server path ?

I thinks design will be common to Bittorrent as well. So I would be grateful if you would give your help and advice on the design of communication.

Monday, 6 June 2016

DHT in java

Available libraries for torrent in java.

frostwire-jlibtorrent
ttorrent

Comparison

Sr No.	Feature	Frostwire-jlibtorrent	Ttorrent
1	DHT	Yes	No
2	Magnet uri	Yes	No
3	License	MIT license	Apaches Software License 2.0

Rest main features are present in both.
Ttorrent has more stars than frostwire-jlibtorrent on github.

References
1] https://github.com/frostwire/frostwire-jlibtorrent
2] https://github.com/mpetazzoni/ttorrent

Sunday, 5 June 2016

Dat

dat is similar to ipfs, p2p file sharing. Earlier dat's goal was to allow sharing and versioning of tabular data(csv), json. dat alpha was more about syncing non tabular files, single centralized repository like Dropbox. you can read here more about how dat evolved.

http://dat-data.com/blog/2016-01-19-brief-history-of-dat

Current api 1.0 has only two commands dat link and dat <share-link>. The debug flag prints more output like bittorent-dht node queries.

Unlike ipfs dat sha256 hash also considers file modes(permissions) among the other filesystem metadata. Dat uses a variety of different methods to discover peers that have the data it's looking for, including DNS, Multicast DNS, UDP, and TCP. Like ipfs it also has bootstrap nodes.

Key differences to BitTorrent

Although file sharing using Hyperdrive on the surface could seem similar to tools such as BitTorrent there are a few key differences.

Not all metadata needs to synced up front

Flexible and consistently small block sizes

Deduplication

Multiplexed swarms

Good read, documentation [1]

and the dependencies [2]

1] https://dat-data.readthedocs.io/en/latest/how-dat-works/#how-dat-works

2] https://dat-data.readthedocs.io/en/latest/ecosystem/

Friday, 3 June 2016

IPFS

IPFS is Inter Planetary Filesystem. It was presented by Juan Benet of Stanford. IPFS is a P2P based exchange of Git objects using Bittorent protocol in a single swarm in a single repository.In his paper he talks about how it can be the permanent distributed web. IPFS provides a high throughput content-addressed block storage model, with content-addressed hyper links. This forms a generalized Merkle DAG, a data structure upon which one can build versioned file system.

Key features :

IPFS uses S/Kademlia DHT to find peers in the network, query for providers, get and put values.Each node has a Public key and NodeId is hash of the key.
Kademlia uses the XOR distance to store values in the closest nodes. Resistance to sybill attacks.It requires nodes to create a PKI key pair, derive their identity from it, and sign their messages to each other.
It also uses some features of Coral.

Block Exchange

Bittorrent protocol has block exchange of data. These pieces are exchanged based on some strategy like tit-for-tat or rarest piece first. IPFS uses Bitswap strategy.
Unlike BitTorrent, BitSwap is not limited to the blocks in one torrent. BitSwap operates as a persistent marketplace where node can acquire the blocks they need, regardless of what files those blocks are part of.
This strategy makes use of Bitswap credit and debt ratio. debt ratio increases if node receives more bytes than it has sent. Peers send blocks to debtor peers probabilistically.

Merkle DAG

IPFS objects are closely related to Git objects. IPFS builds a Merkle DAG, a directed acyclic graph where links between objects are cryptographic hashes of the targets embedded in the sources.
the hash is multihash defined as
```
<1-byte hash function code><1-byte digest size in bytes><hash function output>
```
Most of the hashes start with "Qm" because the hash used is SHA256 and the length is 32.

Mutable Namespace

IPFS Objects and Merkle DAG

IPFS Object has the following structure

Links - an array of links it references
Data - byte array. blob of size < 256 kb

IPFS Link has the following structure

Name - string name for the link.
Hash - hash of the linked ipfs object
Size - total size of target object

So here I have an example directory. file.txt and ss have the same content, their hash have to be same.

ipfs object get QmawZYe7nVgbonstM9YLkbJPrwaSMAJ7nkWsPFxHJbCLRF
on the root object gives this output.

{
  "Links": [
    {
      "Name": "2A94M5J1Z",
      "Hash": "QmNhPUwuUQ1uD1n22h2CEBFLKPCExCiVc7rcgHmMftmzsv",
      "Size": 12562
    },
    {
      "Name": "bank-full.csv",
      "Hash": "QmXhyWEd21XEv4pJGHbxoFq6oud3HhADQjw6f5xR4NwDvo",
      "Size": 4611473
    },
    {
      "Name": "file.txt",
      "Hash": "QmXrP2yBFo1jvWw2WnY1mdCYJdiabW1WCmQwsYw1Ltfd2M",
      "Size": 32
    },
    {
      "Name": "shogun",
      "Hash": "QmdWtUhQzAX6e2xpDxZTJEwobHzUTuuVBWaYM8D5rzMTQs",
      "Size": 622130
    }
  ],
  "Data": "\u0008\u0001"
}

As you can see the link names are the name of the files or directories but for individual file the links don't have names. Also if a file < 256Kb it does not reference any objects i.e links array is empty. file.txt is small and bank-full.csv is large.
ipfs object get QmXhyWEd21XEv4pJGHbxoFq6oud3HhADQjw6f5xR4NwDvo
on bank-full.csv

{
  "Links": [
    {
      "Name": "",
      "Hash": "QmRA9jHW1DFa4brtGSSmWeEpXRX5apS7zxvAfgbJ3F599N",
      "Size": 262158
    },
    {
      "Name": "",
      "Hash": "QmNN8xinNToC6sz7xHMcBe6YPyd8Ryx3wWqkEeRYUTEEhn",
      "Size": 262158
    },
    {
      "Name": "",
      "Hash": "QmbSXZPGz7GiMz3iP6r7V6zMCxhT2EzTZGVkdJc3mcXPkj",
      "Size": 262158
    },
    {
      "Name": "",
      "Hash": "QmUEEzoSFDVQwKSZmQMW8U79jUptjLkJAcjMbZjoWrsnKa",
      "Size": 262158
    },
    {
      "Name": "",
      "Hash": "QmQwWkwAiHDuuuYTX8S1Hbks7USkfaD7A5Vf8Qmpyz1uaP",
      "Size": 262158
    },
    {
      "Name": "",
      "Hash": "QmY4QEsrrCWdmqAKtSUZWtpTPd58niySdHsq4YXH59ZpiK",
      "Size": 262158
    },
    {
      "Name": "",
      "Hash": "Qmbp6oskBFZGE3AQhjmm8ZRzZ1rCRaWzUg34zQK7SP8Mxm",
      "Size": 262158
    },
    {
      "Name": "",
      "Hash": "QmQyn37YawL1mCGs3SNmyLNRi1AuXsaNNwWVxkzuomTvQX",
      "Size": 262158
    },
    {
      "Name": "",
      "Hash": "QmbjD9fqBk9kGF9W5vFFLHcnfiiXH8zE2pVRwBTWjxGdV3",
      "Size": 262158
    },
    {
      "Name": "",
      "Hash": "QmU42pLqrKNp3hDNfgw74omWaqLrjMBWw3Uvx98d2CNn2u",
      "Size": 262158
    },
    {
      "Name": "",
      "Hash": "QmPNrToiZUfUEC2w75bw51GizQPP9xwm6wa56vKgGfHZW3",
      "Size": 262158
    },
    {
      "Name": "",
      "Hash": "QmSu1UK8xqvDbWZSTvHzYxEPz2qLNTcii5NVd7NSnDcSAm",
      "Size": 262158
    },
    {
      "Name": "",
      "Hash": "QmVxGKfp77DfPUjvzKfKx8bpYDbSHZtrmSXzz8wyD7t7nH",
      "Size": 262158
    },
    {
      "Name": "",
      "Hash": "QmZaiEbhTiXt7rvwPSR9FS6WyEosj2KmZdLqxPeZ8WCYrt",
      "Size": 262158
    },
    {
      "Name": "",
      "Hash": "QmWhkGkiw5REEqntnke2v6SbzqpF5SctuwKtwngu28sARv",
      "Size": 262158
    },
    {
      "Name": "",
      "Hash": "QmQsv8Nbfvt1RjtxbU5gyQLVprJ6Uz81N5HdAiffJ6zRoX",
      "Size": 262158
    },
    {
      "Name": "",
      "Hash": "QmTZwqWCVjs876DQBQxZhG5XngVPpXAz8h8fsRonMQGruW",
      "Size": 262158
    },
    {
      "Name": "",
      "Hash": "QmaMFU4hByFEpAX6ZEvcvia2Su8xwbKMfTTL74VMh3rYRM",
      "Size": 153914
    }
  ],
  "Data": "\b\u0002\u0018���\u0002 ��\u0010 ��\u0010 ��\u0010 ��\u0010 ��\u0010 ��\u0010 ��\u0010 ��\u0010 ��\u0010 ��\u0010 ��\u0010 ��\u0010 ��\u0010 ��\u0010 ��\u0010 ��\u0010 ��\u0010 ��\t"
}

ipfs object get QmXrP2yBFo1jvWw2WnY1mdCYJdiabW1WCmQwsYw1Ltfd2M
on file.txt or ss

{
  "Links": [],
  "Data": "\b\u0002\u0012\u0018dfad\nf\nc\na\nadkfakdfmaaa\n\u0018\u0018"
}

Merkle Dag

Merkle tree/Dag is used in Git objects and bitcoin, cryptography. Each node has a hash and it is hash of its children hashes combined.The root hash is the final hash of the object.

The leaf nodes contain the data. The links array is empty. Large files which has many ipfs objects/blocks does not have have link names for each one.graphmd can also be used to visualize the graph.

Versioning
Ipfs uses git like commit trees. Unchanged files point to previous objects. In Ipfs files are divided so if a part of large file is changed only that new object will be added to tree, rest will be deduplicated.

block : a variable-size block of data.
list : a collection of blocks or other lists.
tree : a collection of blocks, lists, or other trees.
commit : a snapshot in the version history of a tree.

Sharing Files

Coming to main point use case, sharing files with peers. So I added some files sent the hash to my friend and asked him to get

ipfs get QmawZYe7nVgbonstM9YLkbJPrwaSMAJ7nkWsPFxHJbCLRF

But it did not download on his PC waiting for a long time. I didn't understand why it didn't download.Also he was not in the list of my peers(ipfs swarm peers) but he was able to download via the browser.

https://ipfs.io/ipfs/QmawZYe7nVgbonstM9YLkbJPrwaSMAJ7nkWsPFxHJbCLRF

The problem was he had different version of ipfs than mine. You can check via ipfs id.

"AgentVersion": "go-libp2p/0.1.0",
"ProtocolVersion": "ipfs/0.1.0"

So after downloading the same version. I was able to download the file he sent instantly. His id is the highlighted one.

And also able to download the file via ipfs-java-api .

Things to note : java-ipfs-api requires target jdk 1.8 . When I tried to run my code I got major minor version error.Also before running the code the daemon should be running.

ipfs daemon

ipfs objects are pinned which are added via ipfs add. you can see all the list

ipfs pin ls will show all the pinned ipfs objects and you are serving them when you run the daemon.

IPNS

If the content changes the hash changes, if you need to serve some mutable content you can do via ipns. All you have to do is add the ipfs-path to your public key.

ipfs name publish /ipfs/QmawZYe7nVgbonstM9YLkbJPrwaSMAJ7nkWsPFxHJbCLRF

/ipns/<your pubic key> will download the above linked contents. Hence using this ipns link we can add new ipfs path to our public key and other users do not need to get this new ipfs link.

Monday, 9 May 2016

Some silly questions & xml-pull request

Questions

To debug I add the following line in zeppelin-env.sh
```
export ZEPPELIN_JAVA_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=9009"
```
I use Intellij Idea IDE and then remote debug.I set the breakpoints and able to watch all the variables. when I run the paragraph I get Socketexception as paragraph output, after taking too much time to run but commenting the debug line in zeppelin-env.sh I don't get the error ?
The dependency structure of project is as follows
zeppelin-server <= zeppelin-zengine <= zeppelin-interpreter
So do I have to always package the root project or just the module in which I made change, followed by zeppelin root package ?
```
<dependency>
      <groupId>${project.groupId}</groupId>
      <artifactId>zeppelin-interpreter</artifactId>
      <version>${project.version}</version>
</dependency>
```
Where does it look for dependency in target folder of zeppelin-interpreter or .m2 local repository. ?

Xml regarding issues

The config member of Paragraph class is Map<String,Object> or to be precise HashMap<String,Object> following the flow this is how it is set..

In NotebookServer the OnMeassagereceived() function has a switch case based on the operation. The updateParagraph and runParagraph methods take the config value from Message.data and also other values etc. like params. The runtime values of config values is com.google.gson.internal.StringMap. The values can itslef be Float, Integer, Boolean , Arraylist<StringMap>. So the issue is only the root level entries are mapped in Xml and the graph element of config is empty in Xml.

Second issue is mapping back from Xml to Note. The error I am getting is InvocationTargetException not Jaxbexception just after unmarshalling. So I was not able to properly debug line

Hence to work around this I delete the notes in notebook-xml and re run so the notebookreposync sync(0,1) converts to xml back for me.

Monday, 25 April 2016

Week 1 - XmlNotebookRepo (Store Notebooks in XML format)

Goal

The goal would be to have .xml representation of the notebook persisted in local filesystem along with existing .json one. Could be just note.xml in the same folder, or could be `./notebook-xml/<noteId>/note.xml`
It should save the same notebook, but in XML format, just in the local filesystem.

So here is how I approached..

Created XmlNotebookRepo java class in
package
org.zeppelin.notebook.repo; copied code from VFSNotebookRepo changed the storage directory at this line
```
this.filesystemRoot = new URI(new File(
        conf.getRelativeDir(filesystemRoot.getPath() + "-xml")).getAbsolutePath()); 
```

Added the following in zeppelin-site.xml

<property>
        <name>zeppelin.notebook.storage</name>
        <value>org.apache.zeppelin.notebook.repo.XmlNotebookRepo</value>
        <description>notebook persistence layer implementation</description>
</property>

So now zeppelin.notebook.storage would have two properties but while remote debugging I found that it has only one value. I also tried uncommenting the GitNotebookRepo storage property but still the value was one.

the allStorageClassNames variable did not contain comma separated class names.

So I proceeded with the XmlNotebookRepo itself.

JAXB Usage

I read and created some JAXB examples created some Employee, Student, Address examples with composition.Generated the XML output, different types of annotations @XmlRootElement , @XmlElementWrapper , @XmlElement , @XmlAccessorType(XmlAccessType.FIELD) @XmlTransient etc. This blog[1] was quite useful.

A no-arg constructor is required.
Public getter/setter or @XmlElement
Also java collection are mapped to Xml like Map,List,Set. @XmlElementWrapper to create a wrapping element.

Mapping of Interfaces.

Caused by: com.sun.xml.bind.v2.runtime.IllegalAnnotationsException: 2 counts of IllegalAnnotationExceptions java.util.List is an interface, and JAXB can't handle interfaces

The solution to this is to use @XmlAnyElement with @XmlRootElement and passing .class of the classes that implement it to JAXBContext.newInstance. Here is the code to demonstrate. Here are the following classes[2]

Address.java
Cow.java
Employee.java
Student.java
XMLTest.java

Now moving to zeppelin, Modifying Note.java with the annotations, also Paragraph.java needs to modified but I thought lets start first with simple fields to mapped and later move to complex members like AngularObjects, paragraphs, config, Info.

Fields like NoteInterpreterLoader, JobListenerFactory , NotebookRepo have to be transient , we don't want them to be mapped in XML file so I used the annotation @XmlTransient and still I am getting error at this line in XmlNotebookRepo save() method.

Please help.. I have spent a day solving this I don't know how to proceed further.
Before running the server again please delete the note in notebook-xml as I have not handled loadallNotes() which loads all saved notes. Breakpoint is at save method in XmlNotebookrepo which is hit after creating Note in the UI.

Errors are temporary, giving up is permanent

So I have figured out what was wrong. I read about XmlAccessorTypes like Field,Property and Public and also XmlAdapters. Now I have my Note saved in Xml partially AngularObjects and GUI-config is remaining and loading of notes i.e unmarshalling.

note.xml

working ...

Links

1]http://blog.bdoughan.com/search/label/JAXB

2] Github gist codes

3] My Github repo on branch xml-feature

Google calendar