.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "examples/07_wiki/plot-02-demo=scalable_topic_model-model=hdp_topic+mult.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_examples_07_wiki_plot-02-demo=scalable_topic_model-model=hdp_topic+mult.py: ==================================================== Scalable training of HDP topic models ==================================================== In this demo, we'll review the scalable memoized training of HDP topic models. To review, our memoized VB algorithm (Hughes and Sudderth, NeurIPS 2013) proceeds like this pseudocode: .. code-block:: python n_laps_completed = 0 while n_laps_completed < nLap: n_batches_completed_this_lap = 0 while n_batches_completed_this_lap < nBatch: batch_data = next_minibatch() # Batch-specific local step LPbatch = model.calc_local_params(batch_data, **local_step_kwargs) # Batch-specific summary step SSbatch = model.get_summary_stats(batch_data, LPbatch) # Increment global summary statistics SS = update_global_stats(SS, SSbatch) # Update global parameters model.update_global_params(SS) From a runtime perspective, the important settings a user can control are: * nBatch: the number of batches * nLap : the number of required laps (passes thru full dataset) to perform * local_step_kwargs : dict of keyword arguments that control local step optimization What happens at each step? -------------------------- In the local step, we visit each document in the current batch. At each document, we estimate its local (document-specific) variational posterior. This is done via an *iterative* algorithm, which is rather expensive. We might need 50 or 100 or 200 iterations at each document, though each iteration is linear in the number of documents and the number of topics. The summary step simply computes the sufficient statistics for the batch. Usually this is far faster than the local step, since it a closed-form computation not an iterative estimation. The global parameter update step is similarly quite fast, because we're using a model that enjoys conjugacy (e.g. the observation model's global posterior is a Dirichlet, related to a Multinomial likelihood and a Dirichlet prior). Thus, the *local step* is the runtime bottleneck. Runtime vs nBatch ----------------- It may be tempting to think that smaller minibatches (increasing nBatch) will make the code go "faster". However, if you fix the number of laps to be completed, increasing the number of batches leads to strictly *more* work. However, for each of the requested laps, here's the work performed: * the *same* number of per-document local update iterations are completed * the *same* number of per-document summaries are completed * the total number of global parameter updates is exactly nBatch For scaling to large datasets, the important thing is *not* to keep the number of laps the same, but to keep the wallclock runtime the same, and then to ask how much progress is made in reducing the loss (either training loss or validation loss, whichever is more relevant). Running with larger nBatch values will usually give improved progress in the same amount of time. Runtime vs Local Step Convergence Thresholds -------------------------------------------- Since the local step dominates the cost of updates, managing the run time of the local iterations is important. There are two settings in the code that control this: * nCoordAscentItersLP : number of local step iterations to perform per document * convThrLP : threshold to decide if local step updates have converged The local step pseudocode is: .. code-block:: python for each document d: for iter in [1, 2, ..., nCoordAscentItersLP]: # Update q(\pi_d), the variational posterior for document d's # topic probability vector # Update q(z_d), the variational posterior for document d's # topic-word discrete assignments # Compute N_d1, ... N_dK, expected count of topic k in document d if iter % 5 == 0: # every 5 iterations, check for early convergence # Quit early if no N_dk entry changes by more than convThrLP ``` Thus, setting these local step optimization hyperparameters can be very practically important. Setting convThrLP to -1 (or any number less than zero) will always do all the requested iterations. Setting convThrLP to something moderate (like 0.05) will often reduce the local step cost by 2x or more. .. GENERATED FROM PYTHON SOURCE LINES 104-111 .. code-block:: default import bnpy import numpy as np import os import matplotlib.pyplot as plt .. GENERATED FROM PYTHON SOURCE LINES 112-115 Read text dataset from file Keep the first 6400 documents so we have a nice even number .. GENERATED FROM PYTHON SOURCE LINES 116-126 .. code-block:: default dataset_path = os.path.join(bnpy.DATASET_PATH, 'wiki') dataset = bnpy.data.BagOfWordsData.LoadFromFile_ldac( os.path.join(dataset_path, 'train.ldac'), vocabfile=os.path.join(dataset_path, 'vocab.txt')) # Keep 6400 documents with at least 50 words doc_ids = np.flatnonzero(dataset.getDocTypeCountMatrix().sum(axis=1) >= 50) dataset = dataset.make_subset(docMask=doc_ids[:6400], doTrackFullSize=False) .. GENERATED FROM PYTHON SOURCE LINES 127-131 Train scalable HDP topic models ------------------------------- Vary the number of batches and the local step convergence threshold .. GENERATED FROM PYTHON SOURCE LINES 131-171 .. code-block:: default # Model kwargs gamma = 25.0 alpha = 0.5 lam = 0.1 # Initialization kwargs K = 25 # Algorithm kwargs nLap = 5 traceEvery = 0.5 printEvery = 0.5 convThr = 0.01 for row_id, convThrLP in enumerate([-1.00, 0.25]): local_step_kwargs = dict( # perform at most this many iterations at each document nCoordAscentItersLP=100, # stop local iters early when max change in doc-topic counts < this thr convThrLP=convThrLP, ) for nBatch in [1, 16]: output_path = '/tmp/wiki/scalability-model=hdp_topic+mult-alg=memoized-nBatch=%d-nCoordAscentItersLP=%s-convThrLP=%.3g/' % ( nBatch, local_step_kwargs['nCoordAscentItersLP'], convThrLP) trained_model, info_dict = bnpy.run( dataset, 'HDPTopicModel', 'Mult', 'memoVB', output_path=output_path, nLap=nLap, nBatch=nBatch, convThr=convThr, K=K, gamma=gamma, alpha=alpha, lam=lam, initname='randomlikewang', moves='shuffle', traceEvery=traceEvery, printEvery=printEvery, **local_step_kwargs) .. rst-class:: sphx-glr-script-out .. code-block:: none Dataset Summary: BagOfWordsData total size: 6400 units batch size: 6400 units num. batches: 1 Allocation Model: HDP model with K=0 active comps. gamma=25.00. alpha=0.50 Obs. Data Model: Multinomial over finite vocabulary. Obs. Data Prior: Dirichlet over finite vocabulary lam = [0.1 0.1] ... Initialization: initname = randomlikewang K = 25 (number of clusters) seed = 1607680 elapsed_time: 0.0 sec Learn Alg: memoVB | task 1/1 | alg. seed: 1607680 | data order seed: 8541952 task_output_path: /tmp/wiki/scalability-model=hdp_topic+mult-alg=memoized-nBatch=1-nCoordAscentItersLP=100-convThrLP=-1/1 1.000/5 after 21 sec. | 629.3 MiB | K 25 | loss 8.524931808e+00 | 2.000/5 after 41 sec. | 642.8 MiB | K 25 | loss 8.424868135e+00 | Ndiff64262.723 3.000/5 after 62 sec. | 642.8 MiB | K 25 | loss 8.281559226e+00 | Ndiff96336.179 4.000/5 after 82 sec. | 642.8 MiB | K 25 | loss 8.189309395e+00 | Ndiff92668.664 5.000/5 after 103 sec. | 642.8 MiB | K 25 | loss 8.132816746e+00 | Ndiff73088.704 ... done. not converged. max laps thru data exceeded. Dataset Summary: BagOfWordsData total size: 6400 units batch size: 400 units num. batches: 16 Allocation Model: HDP model with K=0 active comps. gamma=25.00. alpha=0.50 Obs. Data Model: Multinomial over finite vocabulary. Obs. Data Prior: Dirichlet over finite vocabulary lam = [0.1 0.1] ... Initialization: initname = randomlikewang K = 25 (number of clusters) seed = 1607680 elapsed_time: 0.0 sec Learn Alg: memoVB | task 1/1 | alg. seed: 1607680 | data order seed: 8541952 task_output_path: /tmp/wiki/scalability-model=hdp_topic+mult-alg=memoized-nBatch=16-nCoordAscentItersLP=100-convThrLP=-1/1 0.062/5 after 1 sec. | 337.2 MiB | K 25 | loss 9.531189882e+00 | 0.125/5 after 3 sec. | 359.2 MiB | K 25 | loss 9.006368522e+00 | 0.188/5 after 4 sec. | 339.5 MiB | K 25 | loss 8.806502436e+00 | 0.500/5 after 11 sec. | 359.8 MiB | K 25 | loss 8.467480192e+00 | 1.000/5 after 22 sec. | 383.4 MiB | K 25 | loss 8.320662377e+00 | 1.500/5 after 32 sec. | 383.4 MiB | K 25 | loss 8.248913541e+00 | Ndiff 6033.618 2.000/5 after 43 sec. | 379.6 MiB | K 25 | loss 8.164549711e+00 | Ndiff52349.841 2.500/5 after 54 sec. | 384.2 MiB | K 25 | loss 8.151080023e+00 | Ndiff52349.841 3.000/5 after 65 sec. | 406.7 MiB | K 25 | loss 8.116495124e+00 | Ndiff35271.864 3.500/5 after 76 sec. | 360.5 MiB | K 25 | loss 8.109997041e+00 | Ndiff 3247.260 4.000/5 after 87 sec. | 363.0 MiB | K 25 | loss 8.092532637e+00 | Ndiff58174.730 4.500/5 after 98 sec. | 380.8 MiB | K 25 | loss 8.087635863e+00 | Ndiff 2104.671 5.000/5 after 109 sec. | 403.6 MiB | K 25 | loss 8.077369841e+00 | Ndiff61984.353 ... done. not converged. max laps thru data exceeded. Dataset Summary: BagOfWordsData total size: 6400 units batch size: 6400 units num. batches: 1 Allocation Model: HDP model with K=0 active comps. gamma=25.00. alpha=0.50 Obs. Data Model: Multinomial over finite vocabulary. Obs. Data Prior: Dirichlet over finite vocabulary lam = [0.1 0.1] ... Initialization: initname = randomlikewang K = 25 (number of clusters) seed = 1607680 elapsed_time: 0.0 sec Learn Alg: memoVB | task 1/1 | alg. seed: 1607680 | data order seed: 8541952 task_output_path: /tmp/wiki/scalability-model=hdp_topic+mult-alg=memoized-nBatch=1-nCoordAscentItersLP=100-convThrLP=0.25/1 1.000/5 after 13 sec. | 680.6 MiB | K 25 | loss 8.532773413e+00 | 2.000/5 after 24 sec. | 693.8 MiB | K 25 | loss 8.433359333e+00 | Ndiff59669.458 3.000/5 after 34 sec. | 693.8 MiB | K 25 | loss 8.289494765e+00 | Ndiff65081.816 4.000/5 after 43 sec. | 693.8 MiB | K 25 | loss 8.197018858e+00 | Ndiff77516.592 5.000/5 after 51 sec. | 693.8 MiB | K 25 | loss 8.140545185e+00 | Ndiff73205.620 ... done. not converged. max laps thru data exceeded. Dataset Summary: BagOfWordsData total size: 6400 units batch size: 400 units num. batches: 16 Allocation Model: HDP model with K=0 active comps. gamma=25.00. alpha=0.50 Obs. Data Model: Multinomial over finite vocabulary. Obs. Data Prior: Dirichlet over finite vocabulary lam = [0.1 0.1] ... Initialization: initname = randomlikewang K = 25 (number of clusters) seed = 1607680 elapsed_time: 0.0 sec Learn Alg: memoVB | task 1/1 | alg. seed: 1607680 | data order seed: 8541952 task_output_path: /tmp/wiki/scalability-model=hdp_topic+mult-alg=memoized-nBatch=16-nCoordAscentItersLP=100-convThrLP=0.25/1 0.062/5 after 1 sec. | 362.3 MiB | K 25 | loss 9.545032888e+00 | 0.125/5 after 1 sec. | 370.9 MiB | K 25 | loss 9.017962537e+00 | 0.188/5 after 2 sec. | 370.9 MiB | K 25 | loss 8.817299651e+00 | 0.500/5 after 5 sec. | 371.9 MiB | K 25 | loss 8.476303437e+00 | 1.000/5 after 9 sec. | 411.0 MiB | K 25 | loss 8.328292116e+00 | 1.500/5 after 13 sec. | 391.5 MiB | K 25 | loss 8.255930038e+00 | Ndiff 5696.282 2.000/5 after 17 sec. | 414.0 MiB | K 25 | loss 8.171153832e+00 | Ndiff60840.642 2.500/5 after 21 sec. | 368.2 MiB | K 25 | loss 8.157750198e+00 | Ndiff60840.642 3.000/5 after 25 sec. | 411.6 MiB | K 25 | loss 8.123241075e+00 | Ndiff56677.641 3.500/5 after 29 sec. | 413.5 MiB | K 25 | loss 8.116762516e+00 | Ndiff 3222.854 4.000/5 after 33 sec. | 411.8 MiB | K 25 | loss 8.098976267e+00 | Ndiff49797.809 4.500/5 after 36 sec. | 391.5 MiB | K 25 | loss 8.093910662e+00 | Ndiff 2099.082 5.000/5 after 40 sec. | 411.8 MiB | K 25 | loss 8.083408441e+00 | Ndiff34527.081 ... done. not converged. max laps thru data exceeded. .. GENERATED FROM PYTHON SOURCE LINES 172-179 Plot: Training Loss and Laps Completed vs. Wallclock time --------------------------------------------------------- * Left column: Training Loss progress vs. wallclock time * Right column: Laps completed vs. wallclock time Remember: one lap is a complete pass through entire training set (6400 docs) .. GENERATED FROM PYTHON SOURCE LINES 179-209 .. code-block:: default H = 3; W = 4 fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(2*W,2*H), sharex=True, sharey=False) for row_id, convThrLP in enumerate([-1.00, 0.25]): for nBatch in [1, 16]: output_path = '/tmp/wiki/scalability-model=hdp_topic+mult-alg=memoized-nBatch=%d-nCoordAscentItersLP=%s-convThrLP=%.3g/' % ( nBatch, local_step_kwargs['nCoordAscentItersLP'], convThrLP) elapsed_time_T = np.loadtxt(os.path.join(output_path, '1', 'trace_elapsed_time_sec.txt')) elapsed_laps_T = np.loadtxt(os.path.join(output_path, '1', 'trace_lap.txt')) loss_T = np.loadtxt(os.path.join(output_path, '1', 'trace_loss.txt')) ax[row_id, 0].plot(elapsed_time_T, loss_T, '.-', label='nBatch=%d, batch_size = %d' % (nBatch, 6400/nBatch)) ax[row_id, 1].plot(elapsed_time_T, elapsed_laps_T, '.-', label='nBatch=%d' % nBatch) ax[row_id, 0].set_ylabel('training loss') ax[row_id, 1].set_ylabel('laps completed') ax[row_id, 0].set_xlabel('elapsed time (sec)') ax[row_id, 1].set_xlabel('elapsed time (sec)') ax[row_id, 0].legend(loc='upper right') ax[row_id, 0].set_title(('Loss vs Time, local conv. thr. %.2f' % (convThrLP)).replace(".00", "")) ax[row_id, 1].set_title(('Laps vs Time, local conv. thr. %.2f' % (convThrLP)).replace(".00", "")) plt.tight_layout() plt.show() .. image-sg:: /examples/07_wiki/images/sphx_glr_plot-02-demo=scalable_topic_model-model=hdp_topic+mult_001.png :alt: Loss vs Time, local conv. thr. -1, Laps vs Time, local conv. thr. -1, Loss vs Time, local conv. thr. 0.25, Laps vs Time, local conv. thr. 0.25 :srcset: /examples/07_wiki/images/sphx_glr_plot-02-demo=scalable_topic_model-model=hdp_topic+mult_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 210-222 Lessons Learned --------------- The local step is the most expensive step in terms of runtime (far more costly than the summary or global step) Generally, increasing the number of batches has the following effect: * Increase the total computational work that must be done for a fixed number of laps * Improve the model quality achieved in a limited amount of time, unless the batch size becomes so small that global parameter estimates are poor We generally recommend considering: * batch size around 250 - 2000 (which means set nBatch = nDocsTotal / batch_size) * carefully setting the local step convergence threshold (convThrLP could be 0.05 or 0.25 when training, probably needs to be smaller when computing likelihoods for a document) * setting the number of iterations per document sufficiently large (might get away with nCoordAscentItersLP = 10 or 25 when training, but might need many iters like 50 or 100 at least when evaluating likelihoods to be confident in the value) .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 5 minutes 4.281 seconds) .. _sphx_glr_download_examples_07_wiki_plot-02-demo=scalable_topic_model-model=hdp_topic+mult.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot-02-demo=scalable_topic_model-model=hdp_topic+mult.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot-02-demo=scalable_topic_model-model=hdp_topic+mult.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_