Genome-Wide Association Study: 2008

2008년 12월 15일 월요일

RMA

RMA (Robust Multi-array Analysis)
RMA는 어피 진칩을 프로브수준에서 신호강도를 표준화하고, 요약하는 방법이다. 프로브 수준의 데이터로부터 시작하여 PM값들이 배경신호 정정되고, 표준화되어, 마지막으로 발현양이 요약된다. 다음의 세단계로 이루어 진다.

배경신호 정정
배경신호정정은 프로브수준 프로세싱에 있어 가장 중요한 단계이다. RMA에 사용되는 배경색정정은 비선형정정 (non-linear correction)이며, 칩당으로 이루어진다. 어피칩상의 프로브간의 PM값의 분포에 기초한다. PM값은 배경신호의 혼합이며, 광학적 잡음과 비특이결합, 등등에 의해 발생된다. The background is estimated as expectation of the signal (S) conditioned on observed PM values (O), using a kernel density estimation in both GeneSpring GX 7.3.1 and GeneSpring GX 9.0. However,, however GeneSpring GX 7.3.1 uses direct convolution while GeneSpring GX 9.0 uses Fast Fourier Transformation.

Normalization
Normalization is necessary so that multiple chips can be compared to each other, and analyzed together. The normalization procedure is aimed at making the distributions identical across arrays. The normalization used in RMA is quantile normalization. This usually gives very sharp normalizations.Both GeneSpring GX 7.3.1 and GeneSpring GX 9.0 use quantile normalization. Note that, in this procedure, all the arrays are used and no chip is discarded based on extreme value considerations.

Summarization
Once the probe-level PM values have been background-corrected and normalized, they need to be summarized into expression measures, so that the result is a single expression measure per probe-set, per chip. The summarization used is motivated by the assumption that observed log-transformed PM values follow a linear additive model containing a probe affinity effect, a gene specific effect (the expression level) and an error term. For RMA, the probe affinity effects are assumed to sum to zero, and the gene effect (expression level) is estimated using median polishing. Median polishing is a robust model fitting technique, that protects against outlier probes. Both GeneSpring GX 7.3.1 and GeneSpring GX 9.0 use same methodology for summarization.

2008년 12월 14일 일요일

affy chip 데이터 분석

어피칩 데이터의 분석

어피칩 raw data cell 파일의 표준화 (normalization)는 RMA 방법을 사용,

왜? 서로 다른 그룹간의 발현차이를 보는데는 RMA 방법이 제일 좋은 것으로 연구됨.

R , bioconductor의 affy 명령을 사용해서 RMA를 수행, 그럼, 배경신호제거, 사분위수표준화, 그리고 메디안폴리싱을 수행한다. 사분위수표준화는 칩간 표준화이고, 메디안폴리싱은 진간 표준화이다.

이때, affy 칩 관련 bioconductor 패키지를 이용해서, 다양한 QC를 하도록 한다.

MA, RLE, NUSE, 등등

이후의 발현비교는

limma 패키지나 maanova를 이용해서 해결하면 될 듯.

2008년 12월 11일 목요일

RMA/GCRMA 표준화이후에 또 표준화를 해야하나?

RMA/GCRMA 표준화는 시료수준에서 수행되는 표준화방법이다. 따라서 이 방법이 적용된 칩데이터에 칩당 표준화를 수행하는 것은 중복일 뿐이다. 따라서 유전자간 표준화 (per gene)를 수행해야 한다.

2008년 12월 3일 수요일

--linear or --logistic 과 --covar 사용시의 p 값 의미?

--covar 명령을 --linear/--logistic 과 같이 사용했다면, plink는 다중회귀분석을 수행하게 되며, SNP, 공변량 (covariate), 및 상호작용 term에 대한 coefficient (상관계수?) 와 p 값을 보고한다. 단지 intercept에 대한 term만 제외된다.

공변량에 대한 p 값은 공변량에 의한 조절후의 SNP-표현형 연관성에 대한 p 값이 아니다. ADD 값이 바로 공변량을 반영시킨 후의 SNP-표현형간의 연관성에 대한 p 값이다. 그 뒤의 값들이 공변량와 피노타입 간의 연관성분석의 p value 값이다. 그리고, 이 값이 극단적으로 유의하게 나오긴 하지만, SNP가 고도로 유의한 영향을 가진다는것을 의미하는것은 아니다.

CHR SNP BP A1 TEST NMISS BETA STAT P
1 rs1234567 742429 G ADD 1495 -0.03335 -0.1732 0.8625
1 rs1234567 742429 G COV1 1495 0.1143 9.748 8.321e-022

위 예에서, 공변량이 극히 상관성이 있다고 나오지만, SNP가 표현형과 관련이 있다는 증거가 되지 않는다.

2008년 10월 28일 화요일

PLINK 한글 매뉴얼

5. 요약 통계 (summary statistics)

plink는 여러가지의 표준 요약 통계 옵션을 제공하며, 이 것은 QC에 유용하며, missing genotype rate, MAF, HWe, 비멘델리안유전 등이 제공된다.

중요: 아래에 설명되는 모든 표준 요약통계는 높은 missing genotype rate를 가진 개인은 모두 제거된 데이터를 사용하였다. --mind 옵션을 사용하면 되며, 기준값은 0.1, 즉 10% 이상의 데이터가 미싱된 개인은 제외된다.

5.1 미싱 지노타입 (missing genotypes)
지노타입/missingness 비율을 계산하기 위해서는
plink --file data --missing
이 경우, plink.imiss; plink.lmiss 라는 2파일이 만들어 진다.
plink.imiss 는 개인의 결손치를 나타내며, 다음의 형태를 가진다.

FID Family ID
IID Individual ID
MISS_PHENO Missing phenotype? (Y/N)
N_MISS Number of missing SNPs
N_GENO Number of non-obligatory missing genotypes
F_MISS Proportion of missing SNPs

plink.lmiss 는 스닙의 결손치를 나타내며, 다음의 형태를 가진다.
SNP SNP identifier
CHR Chromosome number
N_MISS Number of individuals missing this SNP
N_GENO Number of non-obligatory missing genotypes
F_MISS Proportion of sample missing for this SNP

HINT: 케이스/콘트롤간의 결손치 차이를 계산하기 위해서는 --test-missing 옵션을 참조하라.
HINT : 결손치 요약을 만드는 것은 카테고리의 클러스터 변수에 의해 stratified되며, --within 및 --missing 옵션을 사용한다. In this way, the missing rates will be given separately for each level of the categorical variable. For example, the categorical variable could be which plate that sample was on in the genotyping. Details on the format of a cluster file can be found here.

5.2 Obligatory missing genotypes (절대적으로 빼버린 지노타입)
지노타이핑 실패가 아닌, 그냥 빼버려야 되는 경우가 있다. 예를 들면, 샘플의 어떤 부분이 스닙들의 하부세트상에서 지노타입되었고, 미싱데이터에 의해 개인과 스닙들이 필터아웃 되지 않기를 원할 것이다. 반대로, 특정 지노타입 (스닙/개인의 세트)들을 아예 제외할 수 있으며, --zero-cluster 옵션을 쓴다. 하지만, 당신은 여전히 미싱 데이터 하한선을 세팅할 수 있기를 원할 것이다.

HINT: 데이터 관리섹션을 보면, 지노타입들의 특정 세트들을 어떻게 빼는지에 대해 알 수 있다.

두가지 옵션을 이용해 절대 결손치를 만든다.
plink --bfile mydata --oblig-missing myfile.zero --oblig-cluster myfile.clst --assoc

This command applies the default genotyping thresholds (90% per
individual and per SNP) but accounting for the fact that certain SNPs
are obligatory missing (with the 90% only refers to those SNPs
actually attempted, for example).

The file specified by --oblig-clusters has the same format as
a cluster file (except only a single cluster
field is allowed here, i.e. only 3 columns). For example,


1  1 0 0 1 1   A A  C C  A A
2  1 0 0 1 1   C C  A A  C C
3  1 0 0 1 1   A C  A A  A C
4  1 0 0 1 1   A A  C C  A A
5  1 0 0 1 1   C C  A A  C C
6  1 0 0 1 1   A C  A A  A C
1b 1 0 0 1 1   A A  0 0  0 0
2b 1 0 0 1 1   C C  0 0  0 0
3b 1 0 0 1 1   A C  0 0  0 0
4b 1 0 0 1 1   A A  0 0  0 0
5b 1 0 0 1 1   C C  0 0  0 0
6b 1 0 0 1 1   A C  0 0  0 0

and MAP file test.map


    1 snp1 0 1000
    1 snp2 0 2000
    1 snp3 0 3000

If the obligatory missing file, test.oblig is


    snp2   C1
    snp3   C1

it implies that SNPs snp2 and snp3 are obligatory
missing for all individuals belonging to cluster C1. The corresponding
cluster file is test.clst

indicating that the last six individuals belong to
cluster C1. (Not all individuals need be specified in this
file.)

NOTE You can have more than one cluster category
specified in these files (i.e. implying different patterns of
obligatory missing data for different sets of individuals).

Running a --missing command on the basic fileset, ignoring the
obligatory missing nature of some of the data, results in the following:

plink --file test --missing

which shows in the LOG file that 6 individuals were removed because of missing data


    ...
    6 of 12 individuals removed for low genotyping ( MIND > 0.1 )
    ...

and the corresponding output files (plink.imiss
and plink.lmiss) indicate no missing data (purely because the
six individuals with 2 of 3 genotypes missing were already filtered
out and everybody else left happens to have complete genotyping).


     FID  IID MISS_PHENO   N_MISS   F_MISS
       1    1          N        0        0
       2    1          N        0        0
       3    1          N        0        0
       4    1          N        0        0
       5    1          N        0        0
       6    1          N        0        0

and


     CHR  SNP   N_MISS   F_MISS
       1 snp1        0        0
       1 snp2        0        0
       1 snp3        0        0

In contrast, if the obligatory missing data are specified as follows:

plink --file test --missing --oblig-missing test.oblig --oblig-clusters test.clst

we now see


    ...
    0 of 12 individuals removed for low genotyping ( MIND > 0.1 )
    ...

and the corresponding output files now include an extra field, N_GENO,
which indicates the number of non-obligatory missing genotypes, which is the denominator
for the genotyping rate calculations


     FID  IID MISS_PHENO   N_MISS   N_GENO   F_MISS
       1    1          N        0        3        0
       2    1          N        0        3        0
       3    1          N        0        3        0
       4    1          N        0        3        0
       5    1          N        0        3        0
       6    1          N        0        3        0
      1b    1          N        0        1        0
      2b    1          N        0        1        0
      3b    1          N        0        1        0
      4b    1          N        0        1        0
      5b    1          N        0        1        0
      6b    1          N        0        1        0

and


     CHR  SNP   N_MISS   N_GENO   F_MISS
       1 snp1        0       12        0
       1 snp2        0        6        0
       1 snp3        0        6        0

Seen another way, if one specified --mind 1 to include all
individuals (i.e. not apply the default 90% genotyping rate threshold
for each individual before this step), then the results would not
change with the obligatory missing specification in place, as
expected; in contrast, without the specification of obligatory missing
data, we would see


     FID  IID MISS_PHENO   N_MISS   F_MISS
       1    1          N        0        0
       2    1          N        0        0
       3    1          N        0        0
       4    1          N        0        0
       5    1          N        0        0
       6    1          N        0        0
      1b    1          N        2 0.666667
      2b    1          N        2 0.666667
      3b    1          N        2 0.666667
      4b    1          N        2 0.666667
      5b    1          N        2 0.666667
      6b    1          N        2 0.666667

and


     CHR  SNP   N_MISS   F_MISS
       1 snp1        0        0
       1 snp2        6      0.5
       1 snp3        6      0.5

In this not particularly exciting example, there are no missing
genotypes that are non-obligatory missing (i.e. that not specified by
the two files) -- if there were, it would counted appropriately in the
above files, and used to filter appropriately also.

NOTE All subsequent analyses do not distingush
whether genotypes were missing due to failure or were obligatory
missing -- that is, this option only effects the behavior of
the --mind and --geno filters.

NOTE If a genotype is set to be obligatory missing
but actually in the genotype file it is not missing, then it will be
set to missing and treated as if missing.

5.3 Cluster individuals based on missing genotypes

Systematic batch effects that induce missingness in parts of the sample will induce
correlation between the patterns of missing data that different individuals display.
One approach to detecting correlation in these patterns, that might possibly idenity
such biases, is to cluster individuals based on their identity-by-missingness (IBM).
This approach use exactly the same procedure as the IBS clustering for population
stratification, except the distance between two individuals is based not on which (non-missing)
allele they have at each site, but rather the proportion of sites for which two individuals are
both missing the same genotype.

To use this option:

plink --file data --cluster-missing

which creates the files:


    plink.matrix.missing
    plink.cluster3.missing

which have similar formats to the corresponding IBS clustering files. Specifically, the
plink.mdist.missing file can be subjected to a visualisation technique such as
multidimensinoal scaling to reveal any strong systematic patterns of missingness.

Note The values in the .mdist file are distances rather than
similarities, unlike for standard IBS clustering. That is, a value of 0 means that two
individuals have the same profile of missing genotypes. The exact value represents the
proportion of all SNPs that are discordantly missing (i.e. where one member of the pair
is missing that SNP but the other individual is not).

The other constraints (significance test, phenotype, cluster size and external matching
criteria) are not used during IBM clustering. Also, by default, all individuals and all SNPs
are included in an IBM clustering analysis, unlike IBS clustering, i.e. even individuals or
SNPs with very low genotyping, or monomorphic alleles. By explicitly specifying
--mind or --geno or --maf certain individuals or SNPs can be
excluded (although the default is probably what is usually required for quality control
procedures).

Test of missingness by case/control status

To obtain a missing chi-sq test (i.e. does, for each SNP,
missingness differ between cases and controls?), use the option:

plink --file mydata --test-missing

which generates a file


    plink.missing

which contains the fields


    CHR         Chromosome number
    SNP         SNP identifier
    F_MISS_A    Missing rate in cases
    F_MISS_U    Missing rate in controls
    P           Asymptotic p-value (Fisher's exact test)

The actual counts of missing genotypes are available in the
plink.lmiss file, which is generated by the --missing

option.

Note This test is only applicable to case/control
data.

5.5 Haplotype-based test for non-random missing genotype data

The previous test asks whether genotypes are missing at random
or not with respect to phenotype. This test asks whether or not
genotypes are missing at random with respect to the true
(unobserved) genotype, based on the observed genotypes of nearby SNPs.

Note This test assumes dense SNP genotyping such that
flanking SNPs are typically in LD with each other. Also bear in mind that
a negative result on this test may simply reflect the fact that there is
little LD in the region.

This test works by taking a SNP at a time (the 'reference' SNP) and asking
whether haplotype formed by the two flanking SNPs can predict whether or
not the individual is missing at the reference SNP. The test is a simple
haplotypic case/control test, where the phenotype is missing status at the
reference SNP. If missingness at the reference is not random with respect
to the true (unobserved) genotype, we may often expect to see an
association between missingness and flanking haplotypes.

Note: 단지 우리는 그런 연관성을 못 볼 수 있을 것이기 때문에, 지노타입들이 랜덤미싱을 의미할 필요가없는 그런 연관성을 못 볼 수도 있다. 즉 이 방법은 높은 특이성을 가지지, 감수성을 가지지 않는다. 따라서 이 방법은 많은 결손치를 초래한다. 그러나 QC 스크린 도구로서 사용되어질 때 사용자는 비랜덤 결손치의 높은 유의성 패턴을 보이는 스닙들에 대해 주의를 기울여야만 한다.

This option is run with the command:

plink --file data --test-mishap

which generates an output file called


    plink.missing.hap

which has the fields


    LOCUS        Reference SNP
    HAPLOTYPE    Flanking haplotype, or heterozygosity
    F_0          Frequency of HAPLOTYPE if missing reference SNP
    F_1          Frequency of HAPLOTYPE if not missing reference SNP
    M_H1         N missing/not missing for HAPLOTYPE
    M_H2         N missing/not missing for not-HAPLOTYPE
    CHISQ        Chisquare test for non-random missingness
    P            Asymptotic p-value
    SNPS         Identifier for flanking SNPs

The HAPLOTYPE typically represents each two-SNP flanking haplotype
(i.e. not including the reference SNP itself); each reference SNP will also
have a row labelled HETERO in this column, which means
we are testing whether or not being heterozygous for the flanking haplotypes (which would,
under many sets of haplotype frequencies, increase the chance of being heterozygous
for the reference SNP). SNPs with no or very little missing genotype data are skipped. Only haplotypes above the
--maf threshold are used in analysis.

Here is an example from real data (rows split into two sets for clarity):


    LOCUS         HAPLOTYPE        F_0       F_1         M_H1         M_H2   
    rs17012390           CT     0.5238   0.01949       55/104      50/5233   
    rs17012390           TC     0.4762    0.9805      50/5233       55/104   
    rs17012390       HETERO          1   0.04252       56/114       0/2567   

    LOCUS         HAPLOTYPE      CHISQ         P  SNPS
    rs17012390           CT      923.4         0  rs17012387|rs17012393
    rs17012390           TC      923.4         0  rs17012387|rs17012393
    rs17012390       HETERO      863.3         0  rs17012387|rs17012393

This clearly shows a huge chi-square (the sample is large, N of over 2500 individuals).
We see that of 56 missing genotypes for this reference SNP, all occur when the flanking
haplotypic background is heterozygous (i.e. M_H1 shows 56/114, indicating that there
are 114 other instances of a heterozygous haplotypic background when the reference SNP is not missing)
whereas we see not a single missing call when the flanking SNP background is homozygous, of which we see
2567 observations. This is clearly indicative of non-random association between the unobserved genotype and
missing status.

Looking at the same data a different way, F_1 indicates that the majority of the sample (people not
missing at the reference SNP) have haplotype frequencies of CT and TC haplotypes at approximately
0.02 and 0.98 respectively). In contrast, because all people missing this SNP are on heterozygous backgrounds,
these frequencies become approximately 50:50 in this group (shown in F_0).

In the particular dataset this example comes from, this SNP would have passed a
standard quality control test. The --hardy command shows that this SNP does
not failure the HWE test; also, it does not show excessive amounts
of missing data (the --missing command indicates a missing rate of 0.021). The genotype
counts (obtained by the --hardy option) are, for the whole sample, 0/104/2584.

In contrast, here are the same results for a different SNP that does not show any evidence
of non-random missingness.


         LOCUS    HAPLOTYPE        F_0       F_1         M_H1         M_H2   
     rs3912752           CC    0.07692   0.06507        2/354      24/5086
     rs3912752           TT     0.1154     0.205       3/1115      23/4325
     rs3912752           CT     0.8077      0.73      21/3971       5/1469
     rs3912752       HETERO     0.2308    0.4279       3/1164      10/1556

         LOCUS    HAPLOTYPE      CHISQ          P   SNPS
     rs3912752           CC    0.05967      0.807   rs3912751|rs351596
     rs3912752           TT      1.276     0.2586   rs3912751|rs351596
     rs3912752           CT     0.7938     0.3729   rs3912751|rs351596
     rs3912752       HETERO      2.056     0.1516   rs3912751|rs351596

Here we do not see any deviation between the flanking haplotype frequencies between people
missing versus genotyped for the reference SNP. Of course, there is less missingness for this SNP
(26 missing genotypes) so we might expect power is lower, even if there were non-random missingness.

This only highlights the point made above that, in general, significant results are more interpretable
than non-signficant results for this test. But more importantly, if there are only a handful of missing genotypes,
we do not particular care whether or not they are missing at random, as they would not bias the association with disease
in any case. Of course, whether there is non-random genotyping error is another question...

By default, we currently just select exactly two flanking SNPs. This can be changed with the option --mishap-window. For
example,

plink --bfile mydata --test-mishap --mishap-window 4

Future releases will feature a more
intelligent selection
of flanking markers.

Note This routine currently skips the SNPs on the X and Y chromosomes.

Hardy-Weinberg Equilibrium

To generate a list of genotype counts and Hardy-Weinberg test
statistics for each SNP, use the option:

plink --file data --hardy

which creates a file:


    plink.hwe

This file has the following format


    SNP             SNP identifier
    TEST            Code indicating sample
    A1              Minor allele code
    A2              Major allele code
    GENO            Genotype counts: 11/12/22
    O(HET)          Observed heterozygosity
    E(HET)          Expected heterozygosity
    P               H-W p-value

For case/control samples, each SNP will have three entries (rows) in this
file, with TEST being either ALL, AFF (cases
only) or UNAFF (controls only). For quantitative traits, only a
single row will appear for each SNP, labelled ALL(QT).

Only founders are considered for the Hardy-Weinberg calculations -- ie.
for family data, any offspring are ignored.

WARNING By default, this procedure only considers founders, so
no HW results would be given for sibling-only datasets (i.e. if no parents exist).
To perform a rough, somewhat biased test, use the --nonfounders option
which means that all individuals will be included. Alternatively, manually extract
one person per family for this calculation and recode these individuals as founders
(see the --keep option to facilitate this).

The default test is an exact one, described and implemented by Wigginton et al
(see reference below), which is more accurate for rare genotypes. You can still perform
the standard asymptotic test with the --hardy2 option.


A Note on Exact Tests of Hardy-Weinberg Equilibrium.
Wigginton JE, Cutler DJ and Abecasis GR
Am J Hum Genet (2005) 76: 887-93

Allele frequency

To generate a list of minor allele frequencies (MAF) for each SNP,
based on all founders in the sample:

plink --file data --freq

will create a file:


    plink.frq

with five columns:


    CHR       Chromosome
    SNP       SNP identifier
    A1        Allele 1 code (minor allele)
    A2        Allele 2 code (major allele)
    MAF       Minor allele frequency
    NCHROBS   Non-missing allele count

HINT
To produce summary of allele frequencies that is stratified by
a categorical cluster variable, use the --within filename
option as well as --missing. In this way, the frequencies
will be given separately for each level of the categorical variable. Details
on the format of a cluster file can be found
here.

NOTE
If a SNP fails the genotyping rate threshold (as set by the
--geno value, which is by default 0.10) the frequency
will appear as NA in the plink.frq output file. To
obtain frequencies on all SNPs irrespective of genotyping rate, set
--mind 1.

Linkage disequilibrium based SNP pruning

Sometimes it is useful to generate a pruned subset of SNPs that are in
approximate linkage equilibrium with each other. This can be achieved
via two commands: --indep which prunes based on
the variance inflation factor (VIF), which recursively
removes SNPs within a sliding window;
second, --indep-pairwise which is similar, except it is based
only on pairwise genotypic correlation.

Hint The output of either of these commands is two
lists of SNPs: those that are pruned out and those that are not. A
separate command using the --extract or --exclude
option is necessary to actually perform the pruning.

The VIF pruning routine is performed:

plink --file data --indep 50 5 2

will create files


    plink.prune.in
    plink.prune.out

Each is a simlpe list of SNP IDs; both these files can subsequently be
specified as the argument for a
--extract or --exclude command.

The parameters for --indep are: window size in SNPs
(e.g. 50), the number of SNPs to shift the window at each step
(e.g. 5), the VIF threshold. The VIF is 1/(1-R^2)
where R^2 is the multiple correlation coefficient for a SNP
being regressed on all other SNPs simultaneously. That is, this
considers the correlations between SNPs but also between linear
combinations of SNPs. A VIF of 10 is often taken to represent near
collinearity problems in standard multiple regression analyses
(i.e. implies R^2 of 0.9). A VIF of 1 would imply that the SNP is
completely independent of all other SNPs. Practically, values between
1.5 and 2 should probably be used; particularly in small samples, if
this threshold is too low and/or the window size is too large, too
many SNPs may be removed.

The second procedure is performed:

plink --file data --indep-pairwise 50 5 0.5

This generates the same output files as the first version; the only
difference is that a simple pairwise threshold is used. The first two
parameters (50 and 5) are the same as above (window size and step);
the third parameter represents the
r^2 threshold. Note: this represents the pairwise SNP-SNP
metric now, not the multiple correlation coefficient; also note, this
is based on the genotypic correlation, i.e. it does not involve
phasing.

To give a concrete example: the command above that specifies 50 5 0.5
would a) consider a window of 50 SNPs, b) calculate LD between each pair of SNPs in the
window, b) remove one of a pair of SNPs if the LD is greater than 0.5, c) shift the window
5 SNPs forward and repeat the procedure.

To make a new, pruned file, then use something like (in this example,
we also convert the standard PED fileset to a binary one):

plink --file data --extract plink.prune.in --make-bed --out pruneddata

Mendel errors

To generate a list of Mendel errors for SNPs and families, use the option:

plink --file data --mendel

which will create files:


    plink.mendel
    plink.imendel
    plink.fmendel
    plink.lmendel

The *.mendel file contains all Mendel errors (i.e. one line
per error); the *.imendel file contains a summary of
per-individual error rates; the *.fmendel file contains a
summary of per-family error rates; the *.lmendel file
contains a summary of per-SNP error rates.

The *.mendel file has the following columns:


    FID            Family ID
    KID            Child individual ID
    CHR            Chromosome
    SNP            SNP ID
    CODE           A numerical code indicating the type of error (see below)
    ERROR          Description of the actual error

The error codes are as follows:


     Code   Pat  ,   Mat   ->    Offspring

    1      AA   ,   AA    ->    AB      
    2      BB   ,   BB    ->    AB

    3      BB   ,   **    ->    AA
    4      **   ,   BB    ->    AA
    5      BB   ,   BB    ->    AA

    6      AA   ,   **    ->    BB
    7      **   ,   AA    ->    BB
    8      AA   ,   AA    ->    BB

    9      **   ,   AA    ->    BB    (X chromosome male offspring)
    10     **   ,   BB    ->    AA    (X chromosome male offspring)

The *.lmendel file has the following columns:


    CHR            Chromosome
    SNP            SNP ID
    N              Number of Mendel errors for this SNP

The *.imendel file has the following columns:


    FID            Family ID
    IID            Individual ID
    N              Number of errors this individual was implicated in

The following heurtistic is used to provide a rough estimate of Mendel
error rare 'per individual': error types 1 and 2 count for all 3
individuals (child, father, mother); error types 5 and 8 count only
for the child (i.e. otherwise requires two errors, one in each
parent); error types 3 and 6 count for the child and the father; all
other types (4, 7, 9 and 10) count for the offspring and the mother.
This metric might indicate that, for example, in a nuclear family with
two parents and two offspring, many more Mendel errors can be
associated with the first sibling; the remaining trio might not show
any increased rate.

Currently, PLINK only scans full trios for Mendel errors. Families
with fewer than 2 parents in the dataset will not be tested.

Finally, the *.fmendel file has the following columns:


    FID            Family ID
    PAT            Paternal individual ID
    MAT            Maternal individual ID
    CHLD           Number of offspring in this (nuclear) family
    N              Number of Mendel errors for this (nuclear) family

Sex check

This option uses X chromosome data to determine sex (i.e. based on
heterozygosity rates) and flags individuals for whom the reported sex in
the PED file does not match the estimated sex (given genomic data). To
run this analysis, use the flag:

plink --bfile data --check-sex

which generates a file


    plink.sexcheck

which contains the fields


    FID     Family ID
    IID     Individual ID
    PEDSEX  Sex as determined in pedigree file (1=male, 2=female)
    SNPSEX  Sex as determined by X chromosome
    STATUS  Displays "PROBLEM" or "OK" for each individual
    F       The actual X chromosome inbreeding (homozygosity) estimate

A PROBLEM arises if the two sexes do not match, or if the
SNP data or pedigree data are ambiguous with regard to sex. A male
call is made if F is more than 0.8; a femle call is made if
F is less than 0.2.

The command

plink --bfile data --impute-sex --make-bed --out newfile

will impute the sex codes based on the SNP data, and create a new file
with the revised assignments, in this case a new binary fileset.

Pedigree errors

PLINK can accept multigenerational family data for family-based
tests and Mendel error checks. It will break multigenerational families
down into nuclear family units where appropriate. Extended family
information is not used in an optimal manner, however (e.g. to help find
Mendel errors using grandparental genotypes if parental genotypes are
missing).

Unless PLINK is explicitly told to perform a family-based
analysis, it will ignore any pedigree structure in the sample and
analyse the data as if all individuals are unrelated (i.e.
the --assoc option, for example, will ignore family
structure). It is therefore the responsibility of the user to ensure
that the data are appropriate for the type of test (e.g. if performing
a standard association test with --assoc, this implies that
all individuals should be unrelated for asymptotic significance values
to be correct). The exception to this general rule is that certain
summary statistics are based only on founders.

PLINK will spot most pedigree errors (e.g. if an
individual has two fathers, for example). For a more comprehensive
evaluation of pedigree errors (invalid or incompletely specified
pedigree structures) please use a different software package such as
PEDSTATS
or famtypes.

2008년 10월 3일 금요일

PLINK

PLINK

1. ped파일은 6개의 정보컬럼 및 지노타입 데이터 컬럼으로 구성된다.

FamilyID IndivisualID PaternalID MateralID Sex Affected Genotypes

이때, 정보컬럼의 각 공백은 스페이스 혹은 탭
지노타입데이터는 1개 이상들어가는데, 형태는 다음과 같다.

A A
? ?
N N

두개의 allele 사이는 스페이스로 구분.
각 각 SNP 간의 사이는 스페이스 혹은 탭으로 구분.

2008년 8월 18일 월요일

Epistasis (상위성)

Epistasis (상위성)

각 유전자가 하나씩 발현하였을 경우와는 다른 표현형을 보이는 둘 이상의 유전자좌의 상호작용. 표현형의 차이를 유발시키는 원인을 평가하는 통계연구에서는 epistasis라는 용어는 대립형질이 아닌 유전자들간의 상호작용으로 나타난 표현형의 변이를 의미한다.

상위성 유전자 작용이란 서로 독립되어 있는 두 유전자가 자신의 표현형을 나타내야 하는데, 한쪽 유전자의 작용결과만 발현되고 다른 유전자의 작용은 가려져서 표현되지 않는 것을 말한다. 이때 표현형이 발현되는 쪽의 유전자를 상위성 유전자, 가려지는 쪽의 유전자를 하위성 유전자라고 한다.

2008년 7월 27일 일요일

PED 파일

페드파일은 스페이스 혹 탭 분리파일로, 앞 6컬럼은 필수다.

Family_ID Individual_ID Paternal_ID Maternal_ID Sex Phenotype

* Sex; 1=male, 2=female, unknown = other
* ID는 알파뉴메릭이다. 패밀리와 개인 ID는 반드시 개인을 구별할 수 있도록 유닉해야한다. PED파일은 반드시 1개여야하고, 오직 1개의 페노타입만 가져야 한다.
* 페노타입은 quantitative trait or affection status 0. 1. 2. or missing genotype code

2008년 7월 15일 화요일

Quantile normalization

퀀타일 노말라이제이션 (사분위수 표준화?)

테리스피드 (Terry Speed) 그룹이 소개한 비모수 방법에 의한 표준화이며, 합성칩에 주로사용.
유전자 abundance의 분포는 모든 시표에서 거의 같다는 가정. 편의를 위해 모든 칩들상의 프로브의 pooled 분포를 취한다.

2008년 7월 14일 월요일

GWA-WTCCC 데이터분석방법

CHIAMO
CHIAMO는 어피 500K 매핍칩으로부터 지노타입들을 불러내는 프로그램이다. 이 프로그램은 잠재적으로 서로다른 인텐시티 특성을 가짐으로서, 전장유전체에서 증가된 위양성율을 이끄는 다중 코호트를 허락한다 (The program allows for multiple cohorts which have potentially different intensity characteristics that can lead to elevated false-positive rates in genome-wide studies). [이말인즉슨, 다중코호트분석은 위양성이 증대된다는 뜻인거 같은데]. 사용된 모델은 계측구조 (hierarchical structure)를 가지는데, 이는 각 코호트의 파라미터들간의 correlation을 가능케 한다. 아마 보다 자세한 설명은 곧 논문으로 나올 것 같음. CHIAMO에 의해 생성된 파일은 SNPTEST와 IMPUTE라는 프로그램에 사용되어 진다. CHIAMO는 WTCCC에 의해 수행된 7개의 GWAS를 위한 지노타입들을 불러내는데 사용되었다.

SNPTEST
SNPTEST는 전장유전체연구에서 단일SNP association의 분석에 사용되는 프로그램이다. 수행되는 테스트는 바이너리(케이스-콘트롤)와 정량된 표현형으로 제공될 수있고, covariates의 임의의 세트의 조건과 지노타입의 uncertainity에 대한 account로서 제공될 수 있다. 이 프로그램은 IMPUTE, GTOOL, 및 CHIAMO에서 호출된 지노타입에 대한 결과파일 모두에 대해 균일한 작업을 위해 디자인되었다.

IMPUTE
이것은 알려진 haplotype들의 세트에 기초하여 (HapMap Phase II의 hyplotype들 처럼) 전장유전체 케이스-콘트롤 연구들에서 unobserbed genotype들을 imputing하는 프로그램이다. CHIAMO와 HAPGEN의 결과파일을 입력파일로 쓸수있고, IMPUTE의결과파일은 SNPTEST의 입력파일로 쓸수있다.

HPAGEN
HPAGEN은 SNP 마커들에서 케이스-콘트롤 데이터세트들을 시뮬레이션하며, IMPUTE, SNPTEST, GTOOL에 의해 사용되는 파일형태에서 결과테이터를 시물레이션할 수 있다. 이방법은 LD에서 마커들을 다룰수있고, 전장염색체와같이 커다란 영역에 대한 데이터세트를 시뮬레이션할 수 있다. HAPGEN은

2008년 7월 10일 목요일

HapMap project

HapMap 프로젝트

인간은 23쌍의 염색체를가진다. 염기서열분석결과 30억개의 염기로 구성되어 있음이 나타났다. 인간의 지놈서열은 거의 동일하다. 그러나 평균 1200개 염기당 한개의 비율로 염기차이가 존재한다. 대략 250만개의 단일염기다형성이 존재한다. 다른 염기혹 염기추가 혹 결손 등의 변이가 존재한다. SNP 혹은 스닙 이라고도 한다. 대략 천만개 정도의 스닙이 존재할 것으로 예상.

2008년 6월 22일 일요일

Linkage disequilibrium

집단유전체학에서, LD는 둘이상의 loci (같은 염색체상에 있을 필요가 없는)의 대립형질의 무작위적이지 않은 연관이다. 이는 linkage (연관)와는 다른 것으로, 연관은 둘이상의 loci간의 제한적인 재조합이 한개의 염색체상으로 들어오는, 둘이상의 loci의 association을 설명한다. 연관비평형은 어떤 situation을 설명한다. 이 situation은 대립형질들 혹은 유전적 표지들의 어떠한 조합들이 군집(집단)에서 무작위적 형성으로부터 예상되는 것보다 더 혹은 덜한 률로 보여지는 situation으로 설명된다. Linkage disequilibrium describeds a situation in which some combinations of alleles or genetic markers occur more or less frequently in a population than would be expected from a random formation of haplotypes from alleles based on their frequencies. 서로다른 loci의 다형태들간의 Non-random association는 LD의 degree에 의해 결정된다.
LD는 일반적으로 유전적 연관과 재조합 빈도(rate)에 의해서 유발된다; 돌연변이율, random drift 혹은 non-random mating, and 집단의 구조. 예를 들면, 일부 개체 (세균같은)들이 LD를 보여줄 수 있을 것이다. 왜냐하면, 세균들은 무성생식을 하며, LD를 break down 하는 재조합이 없기 때문이다.

2008년 6월 10일 화요일

Genome-Wide Association Study

Genome-Wide Association Study

지놈 와이드 어소시에이션 스터디. 한글로 번역하면, 유전체 범위 (혹 수준)에서의 연관연구 정도 된다. 번역도 어렵다. 줄여서 GWAS라고도 한다. 최근 들어 이 GWAS가 많은 관심을 받고 있다. 이것이 무언지 알아보자. 참고자료로서 아래의 웹사이트와 문서들을 참조했다.
1) www.genome.gov/20019523

1. GWAS란 무엇인가?
GWSA란 많은 사람들로부터 특정 질병과 연관된 유전적 변이를 찾기위해 완벽한 세트의 DNA, 혹은 지놈을 가로지르는 빠른 마커 탐색을 이용하는 접근법이다. 새로운 유전적 연관이 동정된 뒤에는 연구자들은 질병을 검출하고, 치료하고, 방지하기 위한 더 나은 전략을 개발하는데 이러한 정보를 이용할 수 있다. 이러한 연구들은 특히, 천식, 암, 당뇨, 심장병, 정신병과 같이 일반적이고 복잡한 질병들에서의 유전적 변이를 찾는데 유용하다

2. 왜 이런 연구가 지금 가능한가?
2003년 인간유전체계획과 2005년 국제 반수체지도 계획 (HapMap) 들이 종료됨과 동시에, 과학자들은 이제 일반적인(common) 질병들에 대한 유전적인 기여들을 찾을 수 있게 해주는 연구 도구 세트를 가지게 되었다. 이러한 도구에는 참조 인간 유전체 서열, 인간 유전자 변이 지도 등과 같은 전산화된 데이터베이스들과, 질환의 발병에 기여하는 유전자 변이를 찾기위한 빠르고 정확하게 전(whole) 지놈새료를 분석할 수 있는 새로운 기술들의 세트가 포함된다.

3. 인류건강에 GWAS가 주는 이득은?
GWAS로부터 얻어지는 medical care에 대한 영향은 잠재적으로 연속적일 수 있다. 이러한 연구는 맞춤의학(personalized medicine) 시대를 위한 groundwork을 laying한다. (맞춤의학에서는 현재의 one-size-fits-all approach로 이루어 지는 의학적 케어들이 보다 개별적, 개인적으로 이루어 진다).
미래에는, 유전체범주의 탐색들의 비용과 효율면에서의 개선들 및 다른 혁명적 기술들 이후, 건강전문가들은 진전되고 있는 특정 질병들에 대한 위험도에 대한 개인화된 정보를 환자에게 제공하기 위해 그러한 도구사용이 가능할 수 있을 것이다. 이러한 정보는 건강전문가들이 각 개인들의 독특한 유전적 체질을 고려하여 예방법을 설계할 수 있게 해줄 것이다. 게다가, 만약 환자가 아프게 되면, 이러한 정보는 특정 환자에게 고효율 및 저부작용인 치료법 선택에 사용될 수 있다.

4. GWAS에서 발견된 것들은 무엇인가?
연구자들은 이미 이 새로운 기술을 이용해 상당한 성공들을 보고해 왔다. 예를 들면, 2005년, 3개의 독립적인 연구에서 맹안의 일반적 형태는 보체인자 H-이것은 염증을 조절하는데 관여하는 단백질을 생산-의 유전자의 변이와 연관이 있음이 밝혀졌다. 이전에는 염증이 이러한 형태의 맹안에 유의하게 기여할것이라고 거의 생각되지 않았고, 이러한 것을 노인황반변성(age-related macular degeration)이라고 한다.
유사한 성공담들이 2형당뇨, 파킨슨병, 심부전, 비만, 크론병(국한성 회장염) 및 전립선암 뿐만 아니라, 항우울증 약의 반응에 영향을 주는 유전적 변이를 동정하는 GWAS를 이용하여 보고되어 있다.