Initializing Huge Graphs

Topics: graph
May 26, 2009 at 4:07 AM

Hi,

I need to work with huge graphs (10 million nodes, 200 million edges) with edges Edge<int, int>. Is there a way to allocate and initialize the edges with one memory allocation in an array, preferably as value type, as opposite to creating 200 million tiny objects? How much memory should I expect such a graph to take per vertex/edge?

Thanks,

Vlad 

Coordinator
May 26, 2009 at 4:49 AM

Yes, use SEdge<int,int> instead which is a value type. Let's see: 8-byte * 200 * 10^6 = 1600Mb  and that's a lower bound. If you have a 64bit machine it should not be an issue, otherwize QuickGraph might not be the best tool to deal with large graphs. The PBGL (http://www.osl.iu.edu/research/pbgl/) is probably a more scalable solution.

May 28, 2009 at 2:19 AM
Edited May 28, 2009 at 2:27 AM

Thanks for your suggestion.  I'm trying to replace edges with SEdges, as now without it I'm running out of memory building a graph of 10M edges. (I have a 64bit, 4G Ram machine). But unless QuickGraph has some significant overhead over Boost, i.e. much more than 8 bytes per int-int edge, plus 4 bytes per cost, I would be fine.

 

Just one more question. How would you initialize a static undirected clique on integers, say range [1..10000], with score equal to 1 for every edge (I would like to run a spanning tree algorithm on it)?

Would you use AddVerticesAndEdge range? And how would store costs efficiently? If I understand correctly, just using Dictionary will create a linked list of all values for given hash key, i.e. every int will be stored as an object.

 

Thanks,

 Vlad

May 28, 2009 at 2:25 AM
Edited May 28, 2009 at 2:26 AM

A clarification: I can either write a function

int Cost(SEdge e)

or write my own bijective hash function Edges->[0..num_edges-1] and store the Costs in the array or List.

 

Thanks,

Vlad

Coordinator
May 28, 2009 at 5:52 AM

do you plan to mutate the costs? If not, you could simple use a delegate that returns 1 and be done. If some of the costs would have to be mutated, you could have a small dictionary containing the mutated costs - in that sense you would pay a lesser cost on memory.

You could also implement your own struct that implements IEdge and has a cost field. You probably want to add the vertices (AddVertexRange), then the edges (AddEdgeRange) - it should be slightly more efficient.

May 28, 2009 at 7:30 AM

Thanks for you quick response. No, I will not mutate neither costs nor the graph - everything is static. Costs of 1 in a clique were just an example, I now build an array of costs, but will use STaggedEdge<TVertex, TTag> edge. I assume using tag for edge costs is OK.

 

I just don't quite understand what AddEdgeRange does internally. If I pass it List<SEdge>, since the function expects IEnumerable, it would not reuse the array and instead would allocate memory for Edges individually (and copy them to a List<Edge> anyway ?), 200 million times in my case, am I right?

 

Thanks,

Vlad

 

 

 

Coordinator
May 28, 2009 at 12:45 PM

In your case where the graph is immutable, we probably need a new/better data structure to represent it.

Coordinator
May 28, 2009 at 3:59 PM

I've added ArrayAdjacencyGraph which uses a more consice representation. There's a bug in the contracts rewritter so I have not tested it thorousgly yet.

May 29, 2009 at 12:54 AM

Thanks, that was fast!

I'm reading through the code now, it looks it does not support undirected graph yet. Sorry, if I was not particularly clear, I'm running a spanning tree algorithms on an undirected graph.

Thanks,

Vlad

 

 

May 30, 2009 at 4:44 PM

Hi,

I have an idea how to overcome the scalability problem and I'm interested if it's feasible or, maybe, even already present in QuickGraph.

Is it possible to provide instead of edges enumerators or delegates that given a vertex produce the set of its neighbors? For example, in my particular case vertices are short strings and edges connect similar strings, with weigh equal to the degree of similarity. Instead of storing a huge set of outgoing or incoming edges anywhere to be used just once, I could compute them on as-needed basis. This recomputing will work for near O(n) algorithms, like spanning tree or strongly connected components and not so good for, say, flow. But the good thing is that the graph interface is decoupled from the data structure, and if the recomputation is expensive, the user can store the data in array or what ever is efficient for her and just compute the delegates with ability to reuse all the algorithms implemented in QuickGraph.


One problem that I see here is that pre-computing and storing O(V) delegates is not particularly space efficient, higher order functions would be required, AFAIK, here - QuickGraph can receive a function that given a vertex index computes a function that enumerate the neighbors/edges.  I'm using QuickGraph with F#, and this is easily done there, not familiar with C# yet.


Is this something feasible, already available or I'm missing something major?


Thanks,

Vlad

 

 

Coordinator
May 30, 2009 at 9:26 PM

Take a look at the FunctionalImplicitGraph to see if this is something that could fit your needs. It relies solely on a delegate of the form:

     bool TryFunc<TVertex, IEnumerable<TEdge>>(TVertex v, out IEnumerable<TEdge> edges);

so you can lazily get the out edges and enumerate them.

Coordinator
May 30, 2009 at 9:42 PM

I've renamed it to DelegateImplicitUndirectedGraph. Let's what you can do with this one.

May 31, 2009 at 1:45 AM
Edited May 31, 2009 at 7:00 PM

Thanks,

It looks this will work for me. Do you have any idea how much memory will it take internally if I run a spanning tree algorithm? I guess Kruskal will not be efficient as it has to put in advance all edges into the heap or sort them by cost, so laziness in graph representation will not help much. But Prim's heap contains only the edges between the "supernode" and its complement, so theoretically, we are talking about $O(V)$ worst case memory used internally. Is it how it works?


One newbie question - I could not find build instructions anywhere. How do I create QuickGraph.dll from the source? I opened sources/QuickGraph.sln in Visual Studio 2010 Beta 1, it offered to convert something; then, when I tried to build the project I got a bunch of errors.

 

Coordinator
May 31, 2009 at 5:07 AM

Yeah with respect to the build, just unload whatever project that does not build. Msal/Glee require different components that are not checked in the depot. QuickGraph is the root project and should build.

Coordinator
May 31, 2009 at 5:39 AM

QuickGraph contains Kruskal's algorithm only. I did not migrate Prim's implementation from QuickGraph 1.0.

May 31, 2009 at 7:09 PM

I just noticed that even superlinear algorithms may have no time overhead due to lazy edge computation in implicit graphs. For example, while Prim is superlinear, each edge is iterated only once, so disregarding the differences in cache performance, the time will be exactly the same as if all E edges are precomputed.

Coordinator
May 31, 2009 at 11:34 PM

I correct a previous comment: Prim is supported see http://quickgraph.codeplex.com/Wiki/View.aspx?title=Minimum%20Spanning%20Tree&referringTitle=Home.

Curious to see, how far you can go with the lazy graphs.

Jul 17, 2009 at 5:23 AM

Something wrong goes on with the Delegate graphs. Below I attach an example of a very simple graph where I try to find a spanning tree. Sorry for F# code, in C# I cannot write, only read.

getMEM_filter_simple uses chains_scores_1_read_simple to generate a graph of 99 nodes, with edge i<- >i+1 of weight 0 for all i in [0..97] and 4 edges of non-zero weight. This graph has a maximum spanning tree of weight 40.

When I use Kruskal (the weights are set as "minus tag") I get the correct result of 40. If MinimumSpanningTreeKruskal is replaced with MinimumSpanningTreePrim I get a result of 32 with edge 68-93 missing from the optimal tree. I tried to replace FibonacciHeap

with the BinaryQueue and got  another result for MST: 24! (Side note, I head to modify in "class BinaryQueue<TVertex, TDistance> :     IQueue<TVertex>" IQueue to IPriorityQueue. For some reason it was different than in FibonacciQueue).

 

I did not have such a discrepancy between Kruskal and Prim prior to switching to the Delegate graph. Notice that Kruskal has to precompute all the edges in advance so for Kruskal the Delegates does not matter.

---------------------------------------------------------------------------

 

(*bug.fs*)

#light
open System
open QuickGraph
open QuickGraph.Algorithms
open QuickGraph.Collections
open System.Collections.Generic
open System.IO

let chains_scores_1_read_simple input_read =     
    let edge =
        match input_read with
        |19 -> seq[new SUndirectedTaggedEdge<int, int> (19, 47, 11);new SUndirectedTaggedEdge<int, int> (input_read, input_read+1, 0)]
        |61 ->  seq[new SUndirectedTaggedEdge<int, int> (61, 87, 13);new SUndirectedTaggedEdge<int, int> (input_read, input_read+1, 0)]
        |39 -> seq[new SUndirectedTaggedEdge<int, int> (39, 96, 8);new SUndirectedTaggedEdge<int, int> (input_read, input_read+1, 0)]
        |68 -> seq[new SUndirectedTaggedEdge<int, int> (68, 93, 8);new SUndirectedTaggedEdge<int, int> (input_read, input_read+1, 0)]
        |98 -> Seq.empty
        |input_read -> seq[ new SUndirectedTaggedEdge<int, int> (input_read, input_read+1, 0)]
    edge    
  //  let link_squares (squares: List<Rect>) (output_chain:bool)

let getMEM_filter_simple () =
    new TryFunc<int, IEnumerable<SUndirectedTaggedEdge<int, int>>  >
      (
      fun i ([<Out>] res: byref <IEnumerable<SUndirectedTaggedEdge<int, int>>>  ) ->

           res <-   (chains_scores_1_read_simple   i  )
           Printf.printf "edges=%A\n" (Seq.to_array res)
           not (Seq.isEmpty res))    
     

let build_graph ()= //: seq<int * Array<int>> ) =
    let encoding=new System.Text.ASCIIEncoding()
    let num_reads = 99
    let edge_enum = getMEM_filter_simple ()
    let graph = new DelegateUndirectedGraph<int,SUndirectedTaggedEdge<int, int>>
                     ((seq[0..num_reads - 1]), edge_enum, false)
    let mst_edges = graph.MinimumSpanningTreeKruskal  (fun key -> -float (key.Tag))
    let mutable weight = 0
    for edge in mst_edges do
        weight <- weight + edge.Tag
        Printf.printf "Edge %A\n" edge

    Printf.printf "Total weight=%d\n" weight        

let _ = build_graph()