[Biomod-commits] Re : prevalence and pseudoabsences
Wilfried Thuiller
wilfried.thuiller at ujf-grenoble.fr
Tue Apr 26 07:10:27 CEST 2011
> I was originally modeling my species with presences = pseudo-absences based on my reading of the literature. Then I found the Jimenez-Valverde, Lobo and Hortal paper in Community Ecology (2009) titled "The effect of prevelence and its interaction with sample size on the reliability of species distribution models". Using a virtual species, they found that biased prevalence is only significant with extremely unbalanced samples, given many caveats (such as reliable training data & relevant predictors). In practice, they recommend using as large a sample size as possible, to improve model stability & improve sample coverage over the environmental and spatial gradients of the study area. This includes using as many absences as possible, down to a prevalence of 0.01. They include discussion of appropriate use of probabilities (since they are skewed due to prevalence) & appropriate assessment of model performance (e.g. don't use kappa).
Well.... one thing very important to remind is that you can't generalized with virtual species. Virtual species are usually built within a particular context. They are used to understand what could be a problem and sometimes propose solutions. Concluding that one should use 0.01 is non-sense because it will always depend on the species, study area, history and everything.
I would recommend to look at other papers as well (Philipps et al. 2009, Witz & Guisan, ect...) which may help giving a broader picture.
> Anyway - it is a very interesting paper & made me want to try modeling my species using a prevalence of 0.1. So I ran my models in three ways:
>
> prevalence = 0.1 (presence = 304 / PA = 2736)
> prevalence = 0.5 (presence = 304 / PA = 304)
> prevalence = 0.5 (presence = 304 / PA = 2736 weighted)
>
> I compared ROC and TSS scores for cross-validation, sensitivity and specificity. Models with prevalence = 0.1 had the best specificity scores, but worst CV and sensitivity. Prevalence = 0.5 (304/304) had the best CV and sensitivity scores (with one exception), with specificity second to the prevalence = 0.1 models. Prevalence = 0.5 (304/2736 weighted) was in the middle. Most of these differences were relatively small.
What you have pseudo-absence, it does make too much sense to look at specificity as your absence are not "true" absence. The general evaluation of the model is sufficient bearing in mind that the model try to fit the absence as absence not potentially as presence. This is the thing I like using weights. You give pseudo-absence do the models, but you let him know you do not trust them too much and at least less than the presence. This is, to my point of view, relatively logical. AT the same time, if you have "true" absence, you could give them more weights than the other pseudo-absences.
> Right now I'm assessing the stability of my 304/304 models to PA pulls, since 304 PAs samples a small number of possible absences (total grid cells in my study area = 6808).
>
> With real (not virtual) data sets, there are obviously many interacting factors that influence final CV/sens/spec scores. I was surprised to see the relatively small differences made by changing prevalence and # of PA records.
I am not surprised. This is what I told you before. Usually, the differences are in the probabilities, but as soon as you have transformed them in binary presence-absence, the threshold allow equalizing the results anyway.
> I'd be interested to hear yours & others thoughts on these issues. I wonder how your upcoming paper using virtual datasets compares with the Jimenez-Valverde et. al. paper?
Quite different I am afraid.... We used 10000 simulations, 2 very different virtual species, different sampling strategies, seven modeling techniques from BIOMOD, weigthed-unweigthed, background vs pseudo-absence, ect...
The main problem is that the recommendation obviously change with the technique (machine learning vs regression) which makes the paper difficult to read ;-)
Cheers,
Wilfried
>
> Thanks for an interesting discussion!
> Brenna
>
> From: Wilfried Thuiller [wilfried.thuiller at ujf-grenoble.fr]
> Sent: Saturday, April 23, 2011 6:29 AM
> To: Brenna Forester
> Cc: biomod-commits at lists.r-forge.r-project.org
> Subject: Re: Re : [Biomod-commits] prevalence and pseudoabsences
>
> Dear Brenna,
>
>> Thanks Bruno & Wilfried,
>>
>> So to clarify: I run pseudo.abs - in my case as so:
>>
>> PA1 <- pseudo.abs(coor=Sp.Env[,2:3], status=Sp.Env[,1], strategy="random",
>> env=Sp.Env[,4:10], nb.points=2736, species.name="Rhodiola",
>> add.pres=F, create.dataset=T, plot=T, pcol="red", acol="grey80")
>>
>> This creates two objects, "PA1" (a vector of cell numbers chosen as absences) and "Dataset.Rhodiola.random.partial", a dataframe of coordinates and "status" (zero).
>>
>> I would then create a new dataset that has just my presence records (304) and these 2736 absences. I would run that dataset (Sp.Env.PA1) in the Intial.State() and Models() functions, for example, as so:
>>
>> Initial.State(Response=Sp.Env.PA1[,c(1)], Explanatory=Sp.Env.PA1[,4:10],
>> IndependentResponse=NULL, IndependentExplanatory=NULL,
>> sp.name="Rhodiola")
>>
>> Models(GLM = T, TypeGLM = "simple", Test = "AIC", GBM = T, No.trees = 5000,
>> GAM = T, CTA = T, CV.tree = 100, ANN = T, CV.ann = 5, SRE = F, FDA = T,
>> MARS = T, RF = T, NbRunEval = 10, DataSplit = 70, Yweights=NULL,
>> NbRepPA=0, Roc=T, Optimized.Threshold.Roc=T, Kappa=T, TSS=T,
>> KeepPredIndependent = F, VarImport=5)
>>
>> I keep NbRepPA = 0 so it uses the entire dataset to evaluate the model, maintaining my prevalence at 0.1 (304 presence records/3040 total records in the dataset).
>> I think I am correct on everything to this point?
>
> Yes, you are correct.
>
>> So my question is: I want to do 5 PA pulls (as I would if I ran it in the Models() function, NbRepPA = 5), maintaining my 0.1 prevalence. But I would then have run Models() five times on 5 datasets (each with different PA pulls). How does BIOMOD create a final model when using PA pulls (e.g. NbRepPA = 5) within the Models() function, and can I replicate that when I run my PA pulls manually as above?
>
> There is no final model when using several PA sets. There are as many "final models" as PA sets.
> If you want to use several sets of PA yourself, make predictions from every model (using the Projections function for instance on the overall area). Then you'll need to combine them yourself.
> There are several alternatives for combining projections from different models from different PA sets and from different repetitions from cross-validation:
>
> Either you create a simple average and standard deviation from projections in probability values. You can then derive a confidence interval if you want.
> You could also perform a weighted sum using weights derived from TSS or ROC for instance. It will give more weights to the best models (from the cross-validation column in Evaluation.results.TSS).
> You could also perform what we usually call a committee averaging where you let the models vote for a presence or an absence. For this, you do not use the probability of occurrence anymore, but rather the presence-absence data directly. You then sum the presence-absences maps. If you have 5 repetitions, 5 models and 5 sets of PA, you thus have at maximum 125. When the sum if equal to 125, it means all repetitions, PA and models agree to say this is a presence, and when you got zero, it means the reverse obviously. Between 0 and 125 will give you the probability of agreement from the models for an absence (after rescaling everything by 125 for instance). This ensemble approach is very close to the Bayesian philosophy with posterior probabilities. I really like this approach, much better than looking at probability of occurrences themselves.
>
> Now, I am not entirely sure why you want to keep your prevalence. Regression like models are not really good with artificial unbalanced dataset (prevalence different than zero). They are supposed to work well if the prevalence is the true prevalence of the species. This is the case with a perfect stratified sampling, but this is absolutely not when using random sets of pseudo-absence.
> Therefore, the results are usually anyway similar. The main difference being the "true" probability of the models which will be higher for the pseudo-absence are downweighted. however, when they are transformed between 0 and 1, results are usually very similar.
> I think Witz and Guisan recently show that using weighted pseudo-absence was better. We also have a paper close to be accepted with Methods in Ecology and Evolution showing the same with virtual datasets.
>
> Hope it helps,
>
> Wilfried
>
>
>
>
>
>>
>> I hope this isn't too confusing!
>> Thank you!
>> Brenna
>>
>>
>> From: Bruno Lafourcade [brunolafourcade at aol.com]
>> Sent: Thursday, April 21, 2011 11:37 PM
>> To: wilfried.thuiller at ujf-grenoble.fr; Brenna Forester
>> Cc: biomod-commits at r-forge.wu-wien.ac.at
>> Subject: Re : [Biomod-commits] prevalence and pseudoabsences
>>
>>
>> Hi Brenna,
>>
>> The pseudo-absence procedure within the Models function is automated and generates a
>> weighting to give a prevalence of 0.5 for each run.
>>
>> To make sure that the prevalence doesn't change, you have to build your own pseudo-absence
>> data outside of the Models function (even prior to Initial.State). In that way, the Models function
>> will not recognize your data as being pseudo.abs and will not weight them, just like for any
>> standard input data.
>>
>> Use the pseudo.abs() function to this matter. Don't hesitate to ask for details on how to use it.
>>
>> Best,
>> Bruno
>>
>>
>> -------
>> Bruno Lafourcade
>> Statistical tools engineer
>>
>> Laboratoire d'Ecologie Alpine, bureau 308
>> CNRS - UMR 5553, 2233 rue de la piscine
>> 38400 Saint Martin d'Hères
>> -------
>>
>>
>> -----E-mail d'origine-----
>> De : Wilfried Thuiller <wilfried.thuiller at ujf-grenoble.fr>
>> A : Brenna Forester <forestb at students.wwu.edu>
>> Cc : biomod-commits at lists.r-forge.r-project.org <biomod-commits at r-forge.wu-wien.ac.at>
>> Envoyé le : Vendredi, 22 Avril 2011 7:09
>> Sujet : Re: [Biomod-commits] prevalence and pseudoabsences
>>
>> Dear Brenna,
>>
>> Yes and no...
>>
>> If you do not ask for pseudo-absence (NbPA=0), there is no weigthing and all your pseudo-absence will be used at once. Prevalence = 0.1
>> If you add NbPA = 3040 (or more), yes, there is. The prevalence = 0.5
>>
>> Does it help?
>> Wilfried
>>
>>
>>
>> Le 22 avr. 2011 à 00:53, Brenna Forester a écrit :
>>
>>> Hello,
>>>
>>> I see in the "Presentation Manual for BIOMOD" (page 18) the following statement: "In all procedures, BIOMOD ensures that the prevalence of the original data is conserved in the calibration and evaluation datasets."
>>>
>>> I have 304 presence records and am running my pseudoabsence pulls with 3040 absences (a prevalence of 0.1). The number of pixels in my study area is 6808.
>>>
>>> From the above quote, I think that BIOMOD is maintaining the original prevalance of 0.1. Is that correct? I just want to be sure that there is no weighting of absence records (e.g. weighting to simulate a prevalence of 0.5).
>>>
>>> Thank you,
>>> Brenna
>>> _______________________________________________
>>> Biomod-commits mailing list
>>> Biomod-commits at lists.r-forge.r-project.org
>>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/biomod-commits
>>
>> --------------------------
>> Dr. Wilfried Thuiller
>> Laboratoire d'Ecologie Alpine, UMR CNRS 5553
>> Université Joseph Fourier
>> BP53, 38041 Grenoble cedex 9, France
>> tel: +33 (0)4 76 51 44 97
>> fax: +33 (0)4 76 51 42 79
>>
>> Email: wilfried.thuiller at ujf-grenoble.fr
>> Personal website: http://www.will.chez-alice.fr
>> Team website: http://www-leca.ujf-grenoble.fr/equipes/emabio.htm
>>
>> FP6 European MACIS project: http://www.macis-project.net
>> FP6 European EcoChange project: http://www.ecochange-project.eu
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>>
>>
>> Biomod-commits mailing list
>>
>>
>> Biomod-commits at lists.r-forge.r-project.org
>>
>>
>> https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/biomod-commits
>>
>>
>>
>
> --------------------------
> Dr. Wilfried Thuiller
> Laboratoire d'Ecologie Alpine, UMR CNRS 5553
> Université Joseph Fourier
> BP53, 38041 Grenoble cedex 9, France
> tel: +33 (0)4 76 51 44 97
> fax: +33 (0)4 76 51 42 79
>
> Email: wilfried.thuiller at ujf-grenoble.fr
> Personal website: http://www.will.chez-alice.fr
> Team website: http://www-leca.ujf-grenoble.fr/equipes/emabio.htm
>
> FP6 European MACIS project: http://www.macis-project.net
> FP6 European EcoChange project: http://www.ecochange-project.eu
>
>
>
>
>
>
>
--------------------------
Dr. Wilfried Thuiller
Laboratoire d'Ecologie Alpine, UMR CNRS 5553
Université Joseph Fourier
BP53, 38041 Grenoble cedex 9, France
tel: +33 (0)4 76 51 44 97
fax: +33 (0)4 76 51 42 79
Email: wilfried.thuiller at ujf-grenoble.fr
Personal website: http://www.will.chez-alice.fr
Team website: http://www-leca.ujf-grenoble.fr/equipes/emabio.htm
FP6 European MACIS project: http://www.macis-project.net
FP6 European EcoChange project: http://www.ecochange-project.eu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.r-forge.r-project.org/pipermail/biomod-commits/attachments/20110426/6dec4bf5/attachment-0001.htm>
More information about the Biomod-commits
mailing list